After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 320721 - intltool is not UTF-8 ready, may corrupt PO files
intltool is not UTF-8 ready, may corrupt PO files
Status: RESOLVED FIXED
Product: intltool
Classification: Deprecated
Component: general
0.34.x
Other All
: Normal normal
: ---
Assigned To: intltool maintainers
intltool maintainers
Depends on:
Blocks:
 
 
Reported: 2005-11-04 21:44 UTC by Simos Xenitellis
Modified: 2005-11-25 20:01 UTC
See Also:
GNOME target: ---
GNOME version: 2.13/2.14


Attachments
Make intltool UTF-8 ready, open files in UTF-8 mode, XML::Parser::Expat accesses nodes as UTF-8 strings (4.03 KB, patch)
2005-11-04 21:46 UTC, Simos Xenitellis
none Details | Review
Addition of testcase to check for encoding problems when parsing XML files. (2.10 KB, patch)
2005-11-05 20:46 UTC, Simos Xenitellis
none Details | Review
Patch to extract comments in XML files without processing them and spoiling the encoding (522 bytes, patch)
2005-11-06 00:05 UTC, Simos Xenitellis
none Details | Review
Patch to make intltool (xml) encoding-neutral, also adds two test cases. (5.53 KB, patch)
2005-11-24 23:36 UTC, Simos Xenitellis
committed Details | Review

Description Simos Xenitellis 2005-11-04 21:44:08 UTC
Please describe the problem:
As Roozbeh noticed
(http://mail.gnome.org/archives/gnome-i18n/2005-November/msg00005.html),
intltool is not UTF-8 ready.
The bug appears when producing the POT file for
http://cvs.gnome.org/viewcvs/gnome-applets/gweather/Locations.xml.in
by running "intltool-update -P" in
http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/
It might also corrupt the .po files when updating them.

Steps to reproduce:
1. Create the pot file for po-locations (gnome-applets/po-locations) by running
"intltool-update -P
2. The file is not valid UTF-8, test with: iconv -f utf8 -t utf8 < 
gnome-applets/po-locations/gnome-applets-2.0.pot
3. iconv detects UTF-8 errors in the file.


Actual results:
The generated POT file is not valid UTF-8, meaning that a person starting a
translation may get a corrupted file.

Expected results:
The resulting file should be valid utf-8.
The intl-tool scripts should be utf-8 ready.

Does this happen every time?
Yes, for the specific POT file that has UTF-8 characters.

Other information:
intltool does not open files in utf-8 mode. 
The XML code manages to bypass this issue by accessing nodes though
"original_string()" (perldoc XML::Parser::Expat), which prints the original node
verbatim, not as UTF-8. However, this "workaround" does not work with the
comments in http://cvs.gnome.org/viewcvs/gnome-applets/gweather/Locations.xml.in
therefore, the resulting file is not UTF-8 valid.

The solution is to open the files in Perl as UTF-8, as in
open(MYFILE, "myfile.po");
binmode MYFILE, ":utf8";

Also, access the nodes from XML files using recognized_string().

Patch follows.
Comment 1 Simos Xenitellis 2005-11-04 21:46:06 UTC
Created attachment 54323 [details] [review]
Make intltool UTF-8 ready, open files in UTF-8 mode, XML::Parser::Expat accesses nodes as UTF-8 strings

Patch created for HEAD
Comment 2 Danilo Segan 2005-11-05 13:32:38 UTC
Care to provide a test as well, please? :)

(test that will fail with current intltool, but which would work with new one:
i.e. UTF-8 in a comment)
Comment 3 Rodney Dawes 2005-11-05 14:02:05 UTC
How does this deal with po files that are not in UTF-8?
Comment 4 Simos Xenitellis 2005-11-05 14:19:51 UTC
> Care to provide a test as well, please? :)
> 
> (test that will fail with current intltool, but which would work with new one:
> i.e. UTF-8 in a comment)

Use the current intltool to create the POT file for gnome-applets-locations,
(run "intltool-update -P" in gnome-applets/po-locations/). 
[The file is also available at
http://l10n-status.gnome.org/gnome-2.14/PO/gnome-applets-locations.HEAD.pot]

The resulting file is not UTF-8 encoded, 
1. you can inspect with your favourite text editor
2. you can test with "iconv -f utf8 -t utf8 < gnome-applets-locations.HEAD.pot
3. you can 
% file gnome-applets-locations.HEAD.pot
gnome-applets-locations.HEAD.pot: Non-ISO extended-ASCII English text
%

If you apply the patch and create the POT file again, it passes as valid UTF-8 :)
Comment 5 Simos Xenitellis 2005-11-05 14:32:04 UTC
> How does this deal with po files that are not in UTF-8?

I am not sure of such a case in GNOME translations.
Can you point me to such a project/PO file?

Currently all PO files contains characters that fit in US-ASCII, which
effectively means that they pass as UTF-8 (UTF-8 is compatible with US-ASCII, as
the characters with codepoints from 1-127 are represented exactly the same).
The situation with all POT files that I know of at GNOME CVS is they simply use
US-ASCII, and the tools happened to work. Once non-US-ASCII characters appear in
POT files, there is a need to use UTF-8.

Using encodings such as iso-8859-x is not an easy task, as files do not contain
encoding information. There is a field in the PO headers, however this field
acts effectively as a suggestion. GNOME, since 2.0, moved to UTF-8 for all
translation work, even if it is Australian English, French, or Malay (they
effectively use US-ASCII).

I am not sure if there are specific active GNOME projects where non-UTF-8 (such
as iso-8859-x or windows-125x) encodings is still a requirement.
Comment 6 Danilo Segan 2005-11-05 14:35:33 UTC
Re comment 4: Simos, I was thinking of a short regression test for
intltool/tests/ infrastructure.

Re comment 5: Simos, intltool is not used only in Gnome projects, and breaking
backwards compatibility is not a wise idea anyway.  Providing a test-case which
passes "make check" would implicitely check if it passes for all the non-UTF-8
PO files as well (we have regression tests for that as well).
Comment 7 Danilo Segan 2005-11-05 14:41:50 UTC
Btw, the problem is not with non-UTF-8 PO files, since only intltool-merge works
with them, but rather, with other content encoded as non-UTF-8.

expat (and by extension, our XML parser) should be able to handle <?xml
encoding="<something-not-UTF-8>" ?> as well, right?

Not to mention other file types, such as RFC822 as used in Debian files, which
even insist on encoding NOT being UTF-8.
Comment 8 Simos Xenitellis 2005-11-05 20:46:27 UTC
Created attachment 54367 [details] [review]
Addition of testcase to check for encoding problems when parsing XML files.

Checks if the encoding of the text is dealt properly for an xml file, when
intltool-extract is used to parse it.

An issue arises when the XML file has translator comments and translatable text
in an encoding other than US-ASCII. If these are US-ASCII, the problem does not
arise.

The current version of intltool-extract creates a file with a problem:
The comments are saved as iso-8859-1
The messages are saved as UTF-8 (no conversion done).

Then, when editing this file, editors such as vi/gedit/?? would save as
iso-8859-1, corrupting it.
Comment 9 Simos Xenitellis 2005-11-05 21:42:04 UTC
Danilo: I see what you mean by keeping intltool encoding-neutral.
There is a way to fix the encoding problem with a simple change similar to the
following:

--- intltool-old/intltool-extract.in.in      2005-08-01 11:34:42.000000000 +0500
+++ intltool/intltool-extract.in.in      2005-11-05 20:55:09.000000000 +0500
@@ -485,11 +485,10 @@
 sub intltool_tree_comment
 {
     my $expat = shift;
-    my $data  = shift;
     my $clist = $expat->{Curlist};
     my $pos   = $#$clist;

-    push @$clist, 1 => $data;
+    push @$clist, 1 => $expat->original_string();
 }

 # Verbatim copy from intltool-merge.in.in
================

By using "original_string()" we avoid any implicit encoding conversion.

However, the string returned contains "<!--" and "-->", as in
"<!-- Comment for *both* attributes and content -->"
instead of the correct
" Comment for *both* attributes and content "

Is there an elegant way to remove those "<!--" and "-->" apart from using
something like substr?
Comment 10 Simos Xenitellis 2005-11-06 00:05:31 UTC
Created attachment 54375 [details] [review]
Patch to extract comments in XML files without processing them and spoiling the encoding

Patch to allow XML files to have comments in any encoding, without 
Fixes issue with gnome-applets/po-locations/ where the POT file is created
wrongly; it has strings in UTF-8 and ISO-8859-1 encodings in the same file,
making editors regard the file as iso-8859-1 and spoiling any new work done to
them.
Comment 11 Simos Xenitellis 2005-11-15 22:10:30 UTC
I consider these patches ready to use
1. the patch that makes intltool really encoding-agnostic
2. the patch that adds a test case with docbook document that has comments in
non-ASCII encoding.

There is need to fix up some of the .po files in
http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/

These are (contain both extended characters from iso-8859-1 and UTF-8 text):
a. it.po
b. ja.po
c. sr@Latn.po
d. sr.po
In other words, these files are not UTF-8 sane.

The Ukranian translation appears corrupted, though UTF-8 sane:
http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/uk.po?view=markup
Check the end of the file, there are some mathematical symbols which, afaik, do
not belong to the Ukranian alphabet.
Comment 12 Rodney Dawes 2005-11-24 21:02:33 UTC
Do we have a regression test for this?
Comment 13 Rodney Dawes 2005-11-24 21:11:14 UTC
Sorry. Just noticed the testcase patch for the specific case this bug is
targetted at. Do we have a test case to check that it doesn't break for files
encoded in ISO-8859-1 or KOI8-R or other charsets?
Comment 14 Simos Xenitellis 2005-11-24 23:36:27 UTC
Created attachment 55204 [details] [review]
Patch to make intltool (xml) encoding-neutral, also adds two test cases.

This patch 
1. makes intltool encoding-neutral
2. adds a test case with XML file and comments/msgid text in UTF-8
3. adds a test case with XML file and comments/msgid text in ISO-8859-1
Comment 15 Rodney Dawes 2005-11-25 20:01:01 UTC
Committed to CVS. Thanks for the patch.