GNOME Bugzilla – Bug 145017
intltool-merge --pass-through causes problems in entity_encode
Last modified: 2004-12-22 21:47:04 UTC
Translation of xml file with intltool is correct in POSIX locale, but in ru_RU.UTF-8 locale it is not correct, the translated string translated from UTF8 twice. The reason for that function entity_encode in intltool-merge sub entity_encode { my ($pre_encoded) = @_; my @list_of_chars = unpack ('C*', $pre_encoded); if ($PASS_THROUGH_ARG) { return join ('', map (&entity_encode_int_even_high_bit, @list_of_chars)) } else { # with UTF-8 we only encode minimalistic return join ('', map (&entity_encode_int_minimalist, @list_of_chars)); } } unpack incorrectly works with my perl-5.8.0 and russian utf8-string. I suggest to relace this code with: my @list_of_chars = unpack ('U*', $pre_encoded);
How can I test this? What is the end result that I can check for to make sure this works?
Created attachment 32504 [details] t.tar.gz with test Here is the test attached. Some comments: Unfortunately, entity_encode doesn't called for intltool-0.31.3, but it was executed for intltool-0.30.0. This is not the problem, since it is still called for gconf schemas merge :) So the results 5.8.0 (Redhat 5.8.0-55.i386.rpm) - original (0.31.3) - broken both with -p or wi5.8.0 (Redhat 5.8.0-55.i386.rpm) - fixed (0.31.3) with C* changed to U* 5.8.3 (Redhat 5.8.3-18.i386.rpm) - original and fixed partially broken - for -p Why entity_encode doesn't called for 0.31.3 - that is serious problem and a big one I think. Generally, all this is just a perl magic. But I think it should work anyhow.
Nickolay, it's *not* a problem, since we're handling arbitrary XML (and want to allow even nested tags) -- see bug 130802 for details. Schemas is a format where we know what to expect, so we're making it easier for translators. What you seem to be referring to is that intltool-merge encoded UTF-8 two-byte sequences incorrectly, i.e. single character encoded as two UTF-8 bytes becomes something like "Э". This also happens only with "-p" (--pass-through) option for me, and using "U*" makes it work for Perl 5.6.1 in your example, but it would make it not "pass-through" anymore, but rather "treat input as UTF-8". Try converting your translation to eg. KOI8-R (iconv -f UTF-8 -t KOI8-R <ru.po |sed s/UTF-8/KOI8-R/>blah.po), and then merging again: "intltool-merge -s -p . test.schemas.in test.schemas". So, if you're certain to have UTF-8 input (i.e. that's why you might want to use "-p" with UTF-8 .po files), then why don't you use "-u" option instead of "-p"? I simply mean that pass-through cannot be made to cope with both UTF-8 and single-byte encodings as it is, so it's hard to decide what would "correct" behaviour be. Best might be not to use entity_encode if it's used, or to use entity_encode_int_minimalist instead.
Uhm, to clarify a bit: this seems to be a bug with "-p" from my POV. It shouldn't run entity_encode on any strings which are basically "passed through" (meaning, no processing, right?).
Agree about no entity encode for -x, also probably agree with -p. But I wish to note that there is bug for -s -u options and redhat's perl-5.8.0. So probably, there should be more portable implementation of entity_encode.
Yes, I agree entity_encode is buggy when "-p" is used, but I'm not sure how to fix it appropriately. Blindly using UTF-8 is as bad as blindly using single-byte encodings. Perhaps not as bad. Rodney, Kenneth, what's your take on this? OTOH, I see from your examples a problem with "-s -u" without "-p" in Perl 5.8.0 but not in 5.8.3 (as if UTF-8 encoded Cyrillic was treated as ISO-8859-1 and then converted to UTF-8; basically, doing "iconv -f ISO-8859-1 -t UTF-8" on data which is UTF-8 already; I've seen this too many times so I recognize it right away :). This certainly looks like Perl bug if 5.8.3 works fine. entity_encode is currently used for merging .schemas, bonobo-activation files and XML attributes. This means that whenever "-p" is used, these may run into problems. Where is "-p" actually useful? We should at least discourage its' use.
I don't know. Sounds like a perl bug that has been fixed already, I guess. I don't know this code that well.
I think it's time to deprecate "-p". It was introduced long time ago, along with "-u": 2001-08-18 Cyrille Chepelov <chepelov@calixo.net> * xml-i18n-merge.in.in: added two new flags, --pass-through(-p) and --utf8(-u). The former asks for previous XML behaviour, the latter complies better with the XML spec. If neither is asked, xml-i18n-merge.in.in will complain vehemently but do -p. I'd like to remove it right away, since "-p" is not used anywhere that I know of (intltool.m4 uses "-u" in all the macros AFAICT). It makes sense to finally switch over to "-u" by default, instead of this broken "-p". This will also eliminate need for entity_encode as it is, and we'll use *_minimalist version only. I'm adjusting bug summary accordingly. I'll come up with a proposed patch later on, unless someone does it before me (I wouldn't mind, just go ahead folks ;-).
Could you also add test for 5.8.0 at least at configure.
I find it OK to remove the -p option. Maybe we should exit(1) intltool with an error message when -p is used so that people know what they need to change.
Or maybe just a warning and continue on? It seems better than exiting, especially since they will end up just removing the -p, and continuing until further errors, themselves. I don't see any reason to have the extra step required.
Nickolay, all .m4 macros and tests/selftest.pl have been using LC_ALL=C for a while already, I guess simply to go around those Perl bugs. So, you're probably alright, unless you're running it by hand, when you'll want to do the same (run as "LC_ALL=C intltool-merge ..."). So, you're right that there's still a problem with Perl 5.8.0, and your original bug report could be split in two: one that concerns "-p" option problems (this one), and other which concerns Perl 5.8.0 problems (but since that's fixed in 5.8.3, it might be enough to just document that people should use LC_ALL=C with Perl 5.8.0). Rodney, Kenneth: I fine with both. I really don't think we'll run into anyone still using "-p", since intltool.m4 doesn't use it (otoh, our own selftest.pl.in uses it in a number of places).
I prefer a warning and just following through. Our regression tests need to be rewritten anyway. If you just type ./configure && make check, it fails, because intltool-merge isn't created yet. And it causes distcheck to fail, because it generates a bunch of files that it doesn't clean up. It could probably be split up into various individual tests too. It seems odd to have ~18 tests, and then have it say "0 of 1 tests failed". But selftest fixing is probably another series of bugs. I want to get all of this cruft cleaned up and working nicely for a bigger release milestone. We can at least fix selftest.pl.in for now to not use -p. Nothing should depend on it, I think.
Created attachment 32600 [details] [review] Deprecate --pass-through option 2004-10-14 Danilo Šegan <dsegan@gmx.net> Fixes #145017. * NEWS: Document changes. * tests/selftest.pl.in (check_multimerge_result): Removed "-p" and "-u" options. * intltool-merge.in.in (print_help): modify -p, -u help messages. (utf8_sanity_check): default to UTF-8. (entity_encode): only minimalistic encoding. (entity_encode_int_even_high_bit): removed. Added utf8_sanity_check for all styles except RFC822.
With these changes, test 2. fails (Bonobo UI merging), but the new XML is equivalent to what we had before (we get " now instead of " etc, and UTF-8 is used for "é", instead of ..;). Most other merging steps already used "-u", which I removed now (it's the default, and cannot be turned off).
Test 18 (xml:space) also fails.
Oh. Nevermind. Test 18 was failing because I was trying to implement a test for bug 48489, and it was pulling stuff from that po file for some reason.
Created attachment 32610 [details] [review] Patch to update regression test to work with patch for bug 145017 Here's a patch to update the regression test for merge1.xml, so that we pass with the patch for this bug.
The patch looks ok, I think, after I updated the merge1.xml in results, so that make check passes still. I'd like to get more testing though, before we commit it. Like running make dist in a project where there are a LOT of xml file translations that get merged in, where some of the translations might not be in UTF-8 encodings.
gok, "Locations.xml" in gnome-applets and xkeyboard_config (at Translation Project) seem likely candidates for testing this (they use a lot of XML messages), but I believe all of them have all translations in UTF-8 (just what Gnome does everywhere). So, it's hard to get such test data, and the best I know of is in tests/ already: $ intltool-merge -x cases cases/merge5p.sheet.in cases/merge5p.sheet WARNING: cases/fr.po is not in UTF-8 but iso-8859-1, converting... WARNING: cases/test.po is not in UTF-8 but iso-8859-1, converting... WARNING: cases/fr_BE.po is not in UTF-8 but ISO-8859-1, converting... Merging translations into cases/merge5p.sheet. CREATED cases/merge5p.sheet (just what test does, but with --quiet option)
I found more problems which happen in certain cases (when fixing bug 155843). What happens is that when used with "-m" option, we get the same crap (even with "-u" option, or this patch), whereas if there's no "-m" option, it works fine. I'm not sure where the problem is, since same code paths are being used in both cases AFAICS. I'll attach testcases so someone can take a look. Note that you will probably need patch from bug #155843 to test this.
Created attachment 32795 [details] Test: XML file
Created attachment 32796 [details] Test: PO file
To make it clear: this is not relevant to the above patch: it happens with and without it, so even when "pass-through" is not used.
Danilo: Any progress on this patch? Are the side effects bugs or what? It would be nice to get the fix in soon.
I'll look a bit more into it later today. I'll probably move off unpack() instead, which will hopefully eliminate problems for good.
Actually, the problem is with this line (around 730): print $fh "<$nodename$outattr"; If I put it instead as print $fh "<$nodename",$outattr; it strangely works just fine ($outattr is not messed up anymore). I'll post updated patch shortly, and commit it later on. It worked for non-multiple-output cases because it actually used a different code path (the line above was used for non-translated attributes string, which is ASCII commonly, so problems were not obvious): print $fh "<", $nodename, " xml:lang=\"", $lang, "\"", $localattrs;
Created attachment 33467 [details] [review] Deprecate --pass-through option, update testcases This should now be complete patch which resolves this bug in its entirety. 2004-11-05 Danilo Šegan <dsegan@gmx.net> Fixes #145017. * NEWS: Document changes. * tests/selftest.pl.in (check_multimerge_result): Removed "-p" and "-u" options. * intltool-merge.in.in (print_help): modify -p, -u help messages. (utf8_sanity_check): default to UTF-8. (entity_encode): only minimalistic encoding. (entity_encode_int_even_high_bit): removed. Added utf8_sanity_check for all styles except RFC822. (traverse): s/"<$nodename$outattr"/"<$nodename", $outattr/.
Looks good except for a couple small style issues... elsif ($BA_STYLE_ARG && @ARGV > 2) { + &utf8_sanity_check; &preparation; It looks like there is an extra space character here? The other additions like this in the patch all line up correctly. + print $fh "<$nodename", $outattr; Shouldn't $outattr be enclosed in double-quotes also?
On the extra space: I probably put 8 spaces there instead of a tab, which is used everywhere else I guess. I'll fix that. On the second issue: no need to use quotes around $outattr, though it doesn't hurt (I've tried both, and they work). Quotes are not used anywhere else where variables are passed standalone to print, so it would be breaking the (silent) convention to do it here.
Comment on attachment 33467 [details] [review] Deprecate --pass-through option, update testcases Committed with a few style fixes.