GNOME Bugzilla – Bug 98988
intltool-merge works incorrectly on utf-8 locale if .po is already in utf-8
Last modified: 2009-08-15 18:40:50 UTC
The problem is in the perl function map() called in entity_encode. It works incorrectly in LANG=ru_RU.UTF-8 (ok for C and ru_RU.KOI8-R). It actually performs second UTF-8 encoding on utf-8 characters.
You have any idea how to work around this? Should we just check for the lang and then not encode...or do you have better suggestions. If you could supply a patch (with ChangeLog entry) that would be wonderful.
I found the solution! At least it works for me (with perl 5.8.0-55 from RH80). Just before this map function, there is an unpack function. my @list_of_chars = unpack ('C*', $pre_encoded); Just change C to U: my @list_of_chars = unpack ('U*', $pre_encoded); And it works! But I am not sure earlier perl versions will be ok here. For more information about unpack see, for example, http://perlhelp.web.cern.ch/PerlHelp/lib/Pod/perlunicode.html
Thanks! Fixed in CVS
Invalid type in unpack: 'U' at ./xml-i18n-merge line 464. $ perl -v This is perl, version 5.005_03 built for i386-linux the perl of redhat 6.2
Would there be any problems with requiring a newer perl? We *need* to support UTF-8 today.
Wahts the symptom of this bug, could you provide a simple testcase what will happen if the perl tools are run explicitly in C locale
This change broke 'make install' for me. The .schemas files seem to be populated with strings from the locale encoding not UTF-8 even though the .po files the strings are fetched from are UTF-8. Changing unpack() from U* to C* around line 461 in intltool-merge fixes it for me.
This really breaks the tinderbox at the moment so I'll revert it for now until things get sorted out.
OK, it seems in some situations U is better than C - but in some situations this cause problems. Can we determine the combinations - and use some kind of "if" statement to switch? As a start - for UTF-8 based encodings and UTF-8 based translation files "U" works - and "C" does not (for perl 5.8.0 at least). For 8-bit encodings (for LANG=C at least) and and UTF-8 bit translation files "U" still works (at least for me). What are the situations when "U" breaks things?
I think when perl isnt new enought to support the option :/ There much be some perl way to check for the perl version and use U when perl is new enought, and C when not, and maybe output an error. Will you look at this Sergev?
Well, I do not think I have enough expertise with Perl (actually, I have written no Perl code in my life:). I will try to figure out how to 'require' the perl version - but not sure how long it will take for me...
No problem. Give it a try - it is always good to learn :) If you haven't fixed it before my next release I will take a look at it
Created attachment 14020 [details] PNG showing the results of this bug :)
I really wonder what is going on. I dont really understand it. Can you check if it uses caches translations? and if that matters? Do you get this warning: "WARNING: $po_file is not in UTF-8 but $encoding, converting..."?
Try changing this in intltool-merge.... sub entity_encode { my ($pre_encoded) = @_; my @list_of_chars; if (%ENV{'LANG'} =~ /\.UTF-8$/) { @list_of_chars = unpack ('U*', $pre_encoded); } else { @list_of_chars = unpack ('C*', $pre_encoded); } if ($PASS_THROUGH_ARG) { return join ('', map (&entity_encode_int_even_high_bit, @list_of_chars)); } else { # with UTF-8 we only encode minimalistic return join ('', map (&entity_encode_int_minimalist, @list_of_chars)); } }
syntax error at ../intltool-merge line 465, near "%ENV{" Glad that you're working on it :)
The output (without the patch) is all normal, btw. bert@saphir component $ make File_Roller_Component.server sed -e "s|\@BONOBODIR\@|/usr/lib/bonobo|" File_Roller_Component.server.in.in > File_Roller_Component.server.in ../intltool-merge ../po File_Roller_Component.server.in File_Roller_Component.server -o -u -c ../po/.intltool-merge-cache Generating and caching the translation database Merging translations into File_Roller_Component.server.
my bad, the syntax isnt %ENV{'LANG'} but $ENV{"LANG"} Try changing that!
Yeah, now works as expected. Thanks.
On Markus' request i did some tests with with perl 5.6.1 and a patched intltool-merge. With an unpatched version of intltool-merge and locale either unset (blank) or set to en_US.UTF-8 a certain translated file's md5sum was : 80ea2cdb06f844592a86769b70a2dd90 File_Roller_Component.server Now with patched intltool-merge and LANG unset : 80ea2cdb06f844592a86769b70a2dd90 File_Roller_Component.server (notice that here i get several lines of 'Use of uninitialized value in pattern match (m//) at ../intltool-merge line 466.' in the make output) Now with patched intltool-merge and LANG set to en_US.UTF-8 : c344cd26b265821d58ee953f275a31ea File_Roller_Component.server
Where 80ea2cdb06f844592a86769b70a2dd90 is the md5sum of the correct file.
Which means that it doesnt work. I really dont understand what is happening here. :/ - I hope you can try to debug. Does this only affect Russian?
It affects all utf-8 locales. Refer to bug 91289, which I think is a dup.
And the patch worked for you? but not for foser? Can you and foser look into this? Try to figure out why it works for you and not him? Different perl version? I really need this bug fixed.
Oh yes, forgot to mention, he has got something < 5.8 and I have got 5.8.0
Reminds me of something. A quote from http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate "you can use a line such as utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0); in order to detect whether the current locale uses the UTF-8 encoding." That would be at least a saner way than searching for UTF-8 in the locale name :)
I have locale LANG=ru_RU.UTF-8 and no CODESET envvar at all. So the way with CODESET can be used - but LANG should be analyzed anyway...
It says nowhere that this C code reads a environment variable :) from the lc_langinfo man page: CODESET (LC_CTYPE) Return a string with the name of the character encoding used in the selected locale, such as "UTF‐8", "ISO‐8859‐1", or "ANSI_X3.4‐1968" (better known as US‐ASCII). This is the same string that you get with "locale charmap". For a list of charac‐ ter encoding names, try "locale −m", cf. locale(1).
You are right. Sorry for my ignorance. So for me that way is perfectly ok.
Changing C to U as Sergey suggested in his first comment doesn't work for me ( perl 5.8, redhat beta 8.0.93 ), I get the same results in xml files made with intltool-merge.
I wanted to fix this by letting the m4 substitute INTLTOOL_MERGE with LANG=C intltool-merge --blah ... but I cannot get this working. It seems like make doesnt like two = on same line. If anyone has a fix, please commit it :)
Can't you set the locale in intltool-merge yourself? It's POSIX::setlocale() from what I read.
Maybe. If you can test it, that will be great.
I'd love to.
I tried adding use POSIX qw(locale_h); setlocale (LC_ALL, "C"); (also with LANG, "C") and it doesnt work. :( I need help using LANG=C intltool-merge in the Makefile Doing something like this will get me in to thouble because of the INTLTOOL_MERGE being used in lines like [2] [1] INTLTOOL_MERGE = LANG=C $(top_builddir)/intltool-merge [2] INTLTOOL_XML_RULE = %.xml: %.xml.in $(INTLTOOL_MERGE) $(wildcard $(top_srcdir)/po/*.po) ; $(INTTLTOOL_MERGE) $(top_srcdir)/po $< $@ -x -u -c $(top_builddir)/po/.intltool-merge-cache I hope someone has some input
OK, apparently I am just stupid :) Bug fixed in cvs!
tested on RH 9 , it works, closing