Bug 98988 – intltool-merge works incorrectly on utf-8 locale if .po is already in utf-8

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 98988 - intltool-merge works incorrectly on utf-8 locale if .po is already in utf-8


Summary:	intltool-merge works incorrectly on utf-8 locale if .po is already in utf-8


Status:	VERIFIED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	High major
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2002-11-19 13:14 UTC by Sergey V. Udaltsov
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	2.1/2.2

Attachments
PNG showing the results of this bug :) (192.82 KB, image/png) 2003-02-01 16:14 UTC, Markus Bertheau	Details

Description Sergey V. Udaltsov 2002-11-19 13:14:46 UTC

The problem is in the perl function map() called in entity_encode. It works
incorrectly in LANG=ru_RU.UTF-8 (ok for C and ru_RU.KOI8-R). It actually
performs second UTF-8 encoding on utf-8 characters.

Comment 1 Kenneth Rohde Christiansen 2002-11-21 16:04:05 UTC

You have any idea how to work around this? Should we just check for
the lang and then not encode...or do you have better suggestions. 

If you could supply a patch (with ChangeLog entry) that would be
wonderful.

Comment 2 Sergey V. Udaltsov 2002-11-21 16:25:08 UTC

I found the solution! At least it works for me (with perl 5.8.0-55
from RH80). Just before this map function, there is an unpack function.

my @list_of_chars = unpack ('C*', $pre_encoded);

Just change C to U:

my @list_of_chars = unpack ('U*', $pre_encoded);

And it works! But I am not sure earlier perl versions will be ok here.

For more information about unpack see, for example,
http://perlhelp.web.cern.ch/PerlHelp/lib/Pod/perlunicode.html

Comment 3 Kenneth Rohde Christiansen 2002-11-21 16:34:33 UTC

Thanks! Fixed in CVS

Comment 4 Yanko Kaneti 2002-11-21 22:59:23 UTC

Invalid type in unpack: 'U' at ./xml-i18n-merge line 464.

$ perl -v

This is perl, version 5.005_03 built for i386-linux



the perl of redhat 6.2

Comment 5 Kenneth Rohde Christiansen 2002-11-21 23:31:44 UTC

Would there be any problems with requiring a newer perl? We *need* to
support UTF-8 today.

Comment 6 Yanko Kaneti 2002-11-22 00:29:50 UTC

Wahts the symptom of this bug, could you provide a simple testcase

what will happen if the perl tools are run explicitly in C locale

Comment 7 Kjartan Maraas 2002-11-23 16:07:42 UTC

This change broke 'make install' for me. The .schemas files seem to be
populated with strings from the locale encoding not UTF-8 even though
the .po files the strings are fetched from are UTF-8.

Changing unpack() from U* to C* around line 461 in intltool-merge
fixes it for me.

Comment 8 Yanko Kaneti 2002-11-23 21:58:49 UTC

This really breaks the tinderbox at the moment so I'll revert it for
now until things get sorted out.

Comment 9 Sergey V. Udaltsov 2002-11-24 21:35:04 UTC

OK, it seems in some situations U is better than C - but in some
situations this cause problems. Can we determine the combinations -
and use some kind of "if" statement to switch?

As a start - for UTF-8 based encodings and UTF-8 based translation
files "U" works - and "C" does not (for perl 5.8.0 at least).

For 8-bit encodings (for LANG=C at least) and and UTF-8 bit
translation files "U" still works (at least for me).

What are the situations when "U" breaks things?

Comment 10 Kenneth Rohde Christiansen 2003-01-05 17:07:36 UTC

I think when perl isnt new enought to support the option :/

There much be some perl way to check for the perl version and use U
when perl is new enought, and C when not, and maybe output an error.

Will you look at this Sergev?

Comment 11 Sergey V. Udaltsov 2003-01-07 11:59:29 UTC

Well, I do not think I have enough expertise with Perl (actually, I
have written no Perl code in my life:). I will try to figure out how
to 'require' the perl version - but not sure how long it will take for
me...

Comment 12 Kenneth Rohde Christiansen 2003-01-07 14:07:48 UTC

No problem. Give it a try - it is always good to learn :) If you
haven't fixed it before my next release I will take a look at it

Comment 13 Markus Bertheau 2003-02-01 16:14:48 UTC

Created attachment 14020 [details]
PNG showing the results of this bug :)

Comment 14 Kenneth Rohde Christiansen 2003-02-01 16:51:49 UTC

I really wonder what is going on. I dont really understand it. Can you
check if it uses caches translations? and if that matters? Do you get
this warning: "WARNING: $po_file is not in UTF-8 but $encoding,
converting..."?

Comment 15 Kenneth Rohde Christiansen 2003-02-01 16:57:29 UTC

Try changing this in intltool-merge....

sub entity_encode
{
    my ($pre_encoded) = @_;
    my @list_of_chars;

    if (%ENV{'LANG'} =~ /\.UTF-8$/)
    {
	@list_of_chars = unpack ('U*', $pre_encoded);
    }
    else
    {
	@list_of_chars = unpack ('C*', $pre_encoded);
    }

    if ($PASS_THROUGH_ARG) 
    {
        return join ('', map (&entity_encode_int_even_high_bit,
@list_of_chars));
    } 
    else 
    {
	# with UTF-8 we only encode minimalistic
        return join ('', map (&entity_encode_int_minimalist,
@list_of_chars));
    }
}

Comment 16 Markus Bertheau 2003-02-01 17:07:56 UTC

syntax error at ../intltool-merge line 465, near "%ENV{"

Glad that you're working on it :)

Comment 17 Markus Bertheau 2003-02-01 17:09:42 UTC

The output (without the patch) is all normal, btw.

bert@saphir component $ make File_Roller_Component.server
sed -e "s|\@BONOBODIR\@|/usr/lib/bonobo|"
File_Roller_Component.server.in.in > File_Roller_Component.server.in
../intltool-merge ../po File_Roller_Component.server.in
File_Roller_Component.server -o -u -c ../po/.intltool-merge-cache
Generating and caching the translation database
Merging translations into File_Roller_Component.server.

Comment 18 Kenneth Rohde Christiansen 2003-02-01 17:13:49 UTC

my bad, the syntax isnt %ENV{'LANG'} but $ENV{"LANG"}

Try changing that!

Comment 19 Markus Bertheau 2003-02-01 17:20:08 UTC

Yeah, now works as expected. Thanks.

Comment 20 Marinus Schraal 2003-02-01 18:36:55 UTC

On Markus' request i did some tests with with perl 5.6.1 and a patched
intltool-merge.

With an unpatched version of intltool-merge and locale either unset
(blank) or set to en_US.UTF-8 a certain translated file's md5sum was :
80ea2cdb06f844592a86769b70a2dd90  File_Roller_Component.server

Now with patched intltool-merge and LANG unset :
80ea2cdb06f844592a86769b70a2dd90  File_Roller_Component.server
(notice that here i get several lines of 'Use of uninitialized value
in pattern match (m//) at ../intltool-merge line 466.' in the make output)

Now with patched intltool-merge and LANG set to en_US.UTF-8 :
c344cd26b265821d58ee953f275a31ea  File_Roller_Component.server

Comment 21 Markus Bertheau 2003-02-01 18:51:56 UTC

Where 80ea2cdb06f844592a86769b70a2dd90 is the md5sum of the correct file.

Comment 22 Kenneth Rohde Christiansen 2003-02-01 19:26:15 UTC

Which means that it doesnt work. I really dont understand what is
happening here. :/ - I hope you can try to debug. Does this only
affect Russian?

Comment 23 Markus Bertheau 2003-02-01 19:50:01 UTC

It affects all utf-8 locales. Refer to bug 91289, which I think is a dup.

Comment 24 Kenneth Rohde Christiansen 2003-02-01 19:58:22 UTC

And the patch worked for you? but not for foser? Can you and foser
look into this? Try to figure out why it works for you and not him?
Different perl version? I really need this bug fixed.

Comment 25 Markus Bertheau 2003-02-01 20:15:37 UTC

Oh yes, forgot to mention, he has got something < 5.8 and I have got 5.8.0

Comment 26 Markus Bertheau 2003-02-06 22:40:54 UTC

Reminds me of something. A quote from

http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

"you can use a line such as

  utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);

in order to detect whether the current locale uses the UTF-8 encoding."

That would be at least a saner way than searching for UTF-8 in the
locale name :)

Comment 27 Sergey V. Udaltsov 2003-02-06 23:15:09 UTC

I have locale LANG=ru_RU.UTF-8 and no CODESET envvar at all. So the
way with CODESET can be used - but LANG should be analyzed anyway...

Comment 28 Markus Bertheau 2003-02-08 16:05:23 UTC

It says nowhere that this C code reads a environment variable :) from
the lc_langinfo man page:

       CODESET (LC_CTYPE)
              Return a string with the name of the character encoding
used  in
              the   selected   locale,   such  as  "UTF&#8208;8", 
"ISO&#8208;8859&#8208;1",  or
              "ANSI_X3.4&#8208;1968" (better known as US&#8208;ASCII). This  is 
the  same
              string that you get with "locale charmap". For a list of
charac&#8208;
              ter encoding names, try "locale &#8722;m", cf. locale(1).

Comment 29 Sergey V. Udaltsov 2003-02-09 00:19:58 UTC

You are right. Sorry for my ignorance. So for me that way is perfectly ok.

Comment 30 Marius Andreiana 2003-02-10 13:42:53 UTC

Changing C to U as Sergey suggested in his first comment doesn't work
for me ( perl 5.8, redhat beta 8.0.93 ), I get the same results in xml
files made with intltool-merge.

Comment 31 Kenneth Rohde Christiansen 2003-02-14 15:30:51 UTC

I wanted to fix this by letting the m4 substitute INTLTOOL_MERGE with

LANG=C intltool-merge --blah ...

but I cannot get this working. It seems like make doesnt like two = on
same line. If anyone has a fix, please commit it :)

Comment 32 Markus Bertheau 2003-02-14 15:40:56 UTC

Can't you set the locale in intltool-merge yourself? It's
POSIX::setlocale() from what I read.

Comment 33 Kenneth Rohde Christiansen 2003-02-14 15:48:12 UTC

Maybe. If you can test it, that will be great.

Comment 34 Markus Bertheau 2003-02-15 10:05:22 UTC

I'd love to.

Comment 35 Kenneth Rohde Christiansen 2003-03-11 22:39:05 UTC

I tried adding

use POSIX qw(locale_h);
setlocale (LC_ALL, "C");

(also with LANG, "C")

and it doesnt work. :(

I need help using LANG=C intltool-merge in the Makefile

Doing something like this will get me in to thouble because of the
INTLTOOL_MERGE being used in lines like [2]

[1] INTLTOOL_MERGE = LANG=C $(top_builddir)/intltool-merge

[2] INTLTOOL_XML_RULE = %.xml:       %.xml.in       $(INTLTOOL_MERGE)
$(wildcard $(top_srcdir)/po/*.po) ; $(INTTLTOOL_MERGE)
$(top_srcdir)/po $< $@ -x -u -c $(top_builddir)/po/.intltool-merge-cache

I hope someone has some input

Comment 36 Kenneth Rohde Christiansen 2003-03-12 00:20:56 UTC

OK, apparently I am just stupid :) 

Bug fixed in cvs!

Comment 37 Marius Andreiana 2003-07-24 06:47:46 UTC

tested on RH 9 , it works, closing