Bug 145017 – intltool-merge --pass-through causes problems in entity_encode

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 145017 - intltool-merge --pass-through causes problems in entity_encode


Summary:	intltool-merge --pass-through causes problems in entity_encode


Status:	RESOLVED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-06-26 14:21 UTC by Nickolay V. Shmyrev
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
t.tar.gz with test (10.66 KB, application/x-compressed-tar) 2004-10-12 10:09 UTC, Nickolay V. Shmyrev		Details
Deprecate --pass-through option (7.24 KB, patch) 2004-10-14 12:09 UTC, Danilo Segan	none	Details \| Review
Patch to update regression test to work with patch for bug 145017 (939 bytes, patch) 2004-10-14 16:18 UTC, Rodney Dawes	none	Details \| Review
Test: XML file (106 bytes, text/plain) 2004-10-19 20:57 UTC, Danilo Segan		Details
Test: PO file (590 bytes, text/plain) 2004-10-19 20:57 UTC, Danilo Segan		Details
Deprecate --pass-through option, update testcases (8.38 KB, patch) 2004-11-05 11:23 UTC, Danilo Segan	committed	Details \| Review

Description Nickolay V. Shmyrev 2004-06-26 14:21:23 UTC

Translation of xml file with intltool is correct in POSIX locale, but in
ru_RU.UTF-8 locale it is not correct, the translated string translated from UTF8
twice. 

The reason for that function entity_encode in intltool-merge

sub entity_encode
{
    my ($pre_encoded) = @_;
                                                                                
    my @list_of_chars = unpack ('C*', $pre_encoded);
                                                                                
    if ($PASS_THROUGH_ARG)
    {
        return join ('', map (&entity_encode_int_even_high_bit, @list_of_chars))
   }
    else
    {
        # with UTF-8 we only encode minimalistic
        return join ('', map (&entity_encode_int_minimalist, @list_of_chars));
    }
}

unpack incorrectly works with my perl-5.8.0 and russian utf8-string. I suggest
to relace this code with:

my @list_of_chars = unpack ('U*', $pre_encoded);

Comment 1 Rodney Dawes 2004-10-11 20:34:26 UTC

How can I test this? What is the end result that I can check for to make sure
this works?

Comment 2 Nickolay V. Shmyrev 2004-10-12 10:09:03 UTC

Created attachment 32504 [details]
t.tar.gz with test

Here is the test attached. Some comments:

Unfortunately, entity_encode doesn't called for intltool-0.31.3, but it was
executed for intltool-0.30.0. This is not the problem, since it is still called

for gconf schemas merge :)
									       

So the results
5.8.0 (Redhat 5.8.0-55.i386.rpm) - original (0.31.3) - broken both with -p or
wi5.8.0 (Redhat 5.8.0-55.i386.rpm) - fixed (0.31.3) with C* changed to U*
5.8.3 (Redhat 5.8.3-18.i386.rpm) - original and fixed partially broken - for -p

									       

Why entity_encode doesn't called for 0.31.3 - that is serious problem and
a big one I think.


Generally, all this is just a perl magic. But I think it should work anyhow.

Comment 3 Danilo Segan 2004-10-12 12:17:09 UTC

Nickolay, it's *not* a problem, since we're handling arbitrary XML (and want to
allow even nested tags) -- see bug 130802 for details. Schemas is a format where
we know what to expect, so we're making it easier for translators.

What you seem to be referring to is that intltool-merge encoded UTF-8 two-byte
sequences incorrectly, i.e. single character encoded as two UTF-8 bytes becomes
something like "&#208;&#173;".

This also happens only with "-p" (--pass-through) option for me, and using "U*"
makes it work for Perl 5.6.1 in your example, but it would make it not
"pass-through" anymore, but rather "treat input as UTF-8".  Try converting your
translation to eg. KOI8-R (iconv -f UTF-8 -t KOI8-R <ru.po |sed
s/UTF-8/KOI8-R/>blah.po), and then merging again: "intltool-merge -s -p .
test.schemas.in test.schemas".

So, if you're certain to have UTF-8 input (i.e. that's why you might want to use
"-p" with UTF-8 .po files), then why don't you use "-u" option instead of "-p"?

I simply mean that pass-through cannot be made to cope with both UTF-8 and
single-byte encodings as it is, so it's hard to decide what would "correct"
behaviour be. Best might be not to use entity_encode if it's used, or to use
entity_encode_int_minimalist instead.

Comment 4 Danilo Segan 2004-10-12 14:22:57 UTC

Uhm, to clarify a bit: this seems to be a bug with "-p" from my POV. It
shouldn't run entity_encode on any strings which are basically "passed through"
(meaning, no processing, right?).

Comment 5 Nickolay V. Shmyrev 2004-10-12 14:29:21 UTC

Agree about no entity encode for -x, also probably agree with -p.

But I wish to note that there is bug for -s -u options and redhat's perl-5.8.0.
So probably, there should be more portable implementation of entity_encode.

Comment 6 Danilo Segan 2004-10-12 15:00:50 UTC

Yes, I agree entity_encode is buggy when "-p" is used, but I'm not sure how to
fix it appropriately. Blindly using UTF-8 is as bad as blindly using single-byte
encodings. Perhaps not as bad. Rodney, Kenneth, what's your take on this?

OTOH, I see from your examples a problem with "-s -u" without "-p" in Perl 5.8.0
but not in 5.8.3 (as if UTF-8 encoded Cyrillic was treated as ISO-8859-1 and
then converted to UTF-8; basically, doing "iconv -f ISO-8859-1 -t UTF-8" on data
which is UTF-8 already; I've seen this too many times so I recognize it right
away :). This certainly looks like Perl bug if 5.8.3 works fine.

entity_encode is currently used for merging .schemas, bonobo-activation files
and XML attributes.  This means that whenever "-p" is used, these may run into
problems. Where is "-p" actually useful? We should at least discourage its' use.

Comment 7 Rodney Dawes 2004-10-12 15:16:17 UTC

I don't know. Sounds like a perl bug that has been fixed already, I guess. I
don't know this code that well.

Comment 8 Danilo Segan 2004-10-12 16:08:13 UTC

I think it's time to deprecate "-p". It was introduced long time ago, along with
"-u":

2001-08-18  Cyrille Chepelov  <chepelov@calixo.net>

        * xml-i18n-merge.in.in: added two new flags, --pass-through(-p)
        and --utf8(-u). The former asks for previous XML behaviour, the
        latter complies better with the XML spec. If neither is asked,
        xml-i18n-merge.in.in will complain vehemently but do -p.

I'd like to remove it right away, since "-p" is not used anywhere that I know of
(intltool.m4 uses "-u" in all the macros AFAICT). It makes sense to finally
switch over to "-u" by default, instead of this broken "-p".

This will also eliminate need for entity_encode as it is, and we'll use
*_minimalist version only.

I'm adjusting bug summary accordingly. I'll come up with a proposed patch later
on, unless someone does it before me (I wouldn't mind, just go ahead folks ;-).

Comment 9 Nickolay V. Shmyrev 2004-10-12 16:25:13 UTC

Could you also add test for 5.8.0 at least at configure.

Comment 10 Kenneth Rohde Christiansen 2004-10-12 16:42:09 UTC

I find it OK to remove the -p option. Maybe we should exit(1) intltool with an
error message when -p is used so that people know what they need to change.

Comment 11 Rodney Dawes 2004-10-12 16:56:16 UTC

Or maybe just a warning and continue on? It seems better than exiting,
especially since they will end up just removing the -p, and continuing until
further errors, themselves. I don't see any reason to have the extra step required.

Comment 12 Danilo Segan 2004-10-12 17:19:02 UTC

Nickolay, all .m4 macros and tests/selftest.pl have been using LC_ALL=C for a
while already, I guess simply to go around those Perl bugs.  So, you're probably
alright, unless you're running it by hand, when you'll want to do the same (run
as "LC_ALL=C intltool-merge ...").

So, you're right that there's still a problem with Perl 5.8.0, and your original
bug report could be split in two: one that concerns "-p" option problems (this
one), and other which concerns Perl 5.8.0 problems (but since that's fixed in
5.8.3, it might be enough to just document that people should use LC_ALL=C with
Perl 5.8.0).

Rodney, Kenneth: I fine with both.  I really don't think we'll run into anyone
still using "-p", since intltool.m4 doesn't use it (otoh, our own selftest.pl.in
uses it in a number of places).

Comment 13 Rodney Dawes 2004-10-12 17:31:06 UTC

I prefer a warning and just following through. Our regression tests need to be
rewritten anyway. If you just type ./configure && make check, it fails, because
intltool-merge isn't created yet. And it causes distcheck to fail, because it
generates a bunch of files that it doesn't clean up. It could probably be split
up into various individual tests too. It seems odd to have ~18 tests, and then
have it say "0 of 1 tests failed". But selftest fixing is probably another
series of bugs. I want to get all of this cruft cleaned up and working nicely
for a bigger release milestone. We can at least fix selftest.pl.in for now to
not use -p. Nothing should depend on it, I think.

Comment 14 Danilo Segan 2004-10-14 12:09:05 UTC

Created attachment 32600 [details] [review]
Deprecate --pass-through option

2004-10-14  Danilo Šegan  <dsegan@gmx.net>

	Fixes #145017.
	
	* NEWS: Document changes.

	* tests/selftest.pl.in (check_multimerge_result): Removed "-p" and
	"-u" options.

	* intltool-merge.in.in (print_help): modify -p, -u help messages.
	(utf8_sanity_check): default to UTF-8.
	(entity_encode): only minimalistic encoding.
	(entity_encode_int_even_high_bit): removed.
	Added utf8_sanity_check for all styles except RFC822.

Comment 15 Danilo Segan 2004-10-14 12:11:55 UTC

With these changes, test 2. fails (Bonobo UI merging), but the new XML is
equivalent to what we had before (we get &quot; now instead of &#34; etc, and
UTF-8 is used for "é", instead of &#2..;).

Most other merging steps already used "-u", which I removed now (it's the
default, and cannot be turned off).

Comment 16 Rodney Dawes 2004-10-14 16:09:20 UTC

Test 18 (xml:space) also fails.

Comment 17 Rodney Dawes 2004-10-14 16:17:21 UTC

Oh. Nevermind. Test 18 was failing because I was trying to implement a test for
bug 48489, and it was pulling stuff from that po file for some reason.

Comment 18 Rodney Dawes 2004-10-14 16:18:45 UTC

Created attachment 32610 [details] [review]
Patch to update regression test to work with patch for bug 145017

Here's a patch to update the regression test for merge1.xml, so that we pass
with the patch for this bug.

Comment 19 Rodney Dawes 2004-10-14 16:20:39 UTC

The patch looks ok, I think, after I updated the merge1.xml in results, so that
make check passes still. I'd like to get more testing though, before we commit
it. Like running make dist in a project where there are a LOT of xml file
translations that get merged in, where some of the translations might not be in
UTF-8 encodings.

Comment 20 Danilo Segan 2004-10-14 19:47:41 UTC

gok, "Locations.xml" in gnome-applets and xkeyboard_config (at Translation
Project) seem likely candidates for testing this (they use a lot of XML
messages), but I believe all of them have all translations in UTF-8 (just what
Gnome does everywhere). 

So, it's hard to get such test data, and the best I know of is in tests/ already:

$ intltool-merge -x cases cases/merge5p.sheet.in cases/merge5p.sheet
WARNING: cases/fr.po is not in UTF-8 but iso-8859-1, converting...
WARNING: cases/test.po is not in UTF-8 but iso-8859-1, converting...
WARNING: cases/fr_BE.po is not in UTF-8 but ISO-8859-1, converting...
Merging translations into cases/merge5p.sheet.
CREATED cases/merge5p.sheet

(just what test does, but with --quiet option)

Comment 21 Danilo Segan 2004-10-19 20:52:14 UTC

I found more problems which happen in certain cases (when fixing bug 155843).

What happens is that when used with "-m" option, we get the same crap (even with
"-u" option, or this patch), whereas if there's no "-m" option, it works fine. 
I'm not sure where the problem is, since same code paths are being used in both
cases AFAICS.

I'll attach testcases so someone can take a look. Note that you will probably
need patch from bug #155843 to test this.

Comment 22 Danilo Segan 2004-10-19 20:57:26 UTC

Created attachment 32795 [details]
Test: XML file

Comment 23 Danilo Segan 2004-10-19 20:57:51 UTC

Created attachment 32796 [details]
Test: PO file

Comment 24 Danilo Segan 2004-10-19 21:03:21 UTC

To make it clear: this is not relevant to the above patch: it happens with and
without it, so even when "pass-through" is not used.

Comment 25 Rodney Dawes 2004-11-03 19:37:35 UTC

Danilo: Any progress on this patch? Are the side effects bugs or what? It would
be nice to get the fix in soon.

Comment 26 Danilo Segan 2004-11-04 16:22:49 UTC

I'll look a bit more into it later today. I'll probably move off unpack()
instead, which will hopefully eliminate problems for good.

Comment 27 Danilo Segan 2004-11-05 11:14:49 UTC

Actually, the problem is with this line (around 730):
	print $fh "<$nodename$outattr";

If I put it instead as
	print $fh "<$nodename",$outattr;

it strangely works just fine ($outattr is not messed up anymore). I'll post
updated patch shortly, and commit it later on.

It worked for non-multiple-output cases because it actually used a different
code path (the line above was used for non-translated attributes string, which
is ASCII commonly, so problems were not obvious):
    print $fh "<", $nodename, " xml:lang=\"", $lang, "\"", $localattrs;

Comment 28 Danilo Segan 2004-11-05 11:23:05 UTC

Created attachment 33467 [details] [review]
Deprecate --pass-through option, update testcases

This should now be complete patch which resolves this bug in its entirety.

2004-11-05  Danilo Šegan  <dsegan@gmx.net>
 
	Fixes #145017.
	
	* NEWS: Document changes.

	* tests/selftest.pl.in (check_multimerge_result): Removed "-p" and
	"-u" options.

	* intltool-merge.in.in (print_help): modify -p, -u help messages.
	(utf8_sanity_check): default to UTF-8.
	(entity_encode): only minimalistic encoding.
	(entity_encode_int_even_high_bit): removed.
	Added utf8_sanity_check for all styles except RFC822.
	(traverse): s/"<$nodename$outattr"/"<$nodename", $outattr/.

Comment 29 Rodney Dawes 2004-11-05 16:46:32 UTC

Looks good except for a couple small style issues...

 elsif ($BA_STYLE_ARG && @ARGV > 2) 
 {
+        &utf8_sanity_check;
 	&preparation;

It looks like there is an extra space character here? The other additions like
this in the patch all line up correctly.

+	print $fh "<$nodename", $outattr;

Shouldn't $outattr be enclosed in double-quotes also?

Comment 30 Danilo Segan 2004-11-05 17:06:57 UTC

On the extra space: I probably put 8 spaces there instead of a tab, which is
used everywhere else I guess. I'll fix that.

On the second issue: no need to use quotes around $outattr, though it doesn't
hurt (I've tried both, and they work).  Quotes are not used anywhere else where
variables are passed standalone to print, so it would be breaking the (silent)
convention to do it here.

Comment 31 Danilo Segan 2004-11-06 08:40:10 UTC

Comment on attachment 33467 [details] [review]
Deprecate --pass-through option, update testcases

Committed with a few style fixes.