Bug 320721 – intltool is not UTF-8 ready, may corrupt PO files

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 320721 - intltool is not UTF-8 ready, may corrupt PO files


Summary:	intltool is not UTF-8 ready, may corrupt PO files


Status:	RESOLVED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	0.34.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-11-04 21:44 UTC by Simos Xenitellis
Modified:	2005-11-25 20:01 UTC

See Also:
GNOME target:	---
GNOME version:	2.13/2.14

Attachments
Make intltool UTF-8 ready, open files in UTF-8 mode, XML::Parser::Expat accesses nodes as UTF-8 strings (4.03 KB, patch) 2005-11-04 21:46 UTC, Simos Xenitellis	none	Details \| Review
Addition of testcase to check for encoding problems when parsing XML files. (2.10 KB, patch) 2005-11-05 20:46 UTC, Simos Xenitellis	none	Details \| Review
Patch to extract comments in XML files without processing them and spoiling the encoding (522 bytes, patch) 2005-11-06 00:05 UTC, Simos Xenitellis	none	Details \| Review
Patch to make intltool (xml) encoding-neutral, also adds two test cases. (5.53 KB, patch) 2005-11-24 23:36 UTC, Simos Xenitellis	committed	Details \| Review

Description Simos Xenitellis 2005-11-04 21:44:08 UTC

Please describe the problem:
As Roozbeh noticed
(http://mail.gnome.org/archives/gnome-i18n/2005-November/msg00005.html),
intltool is not UTF-8 ready.
The bug appears when producing the POT file for
http://cvs.gnome.org/viewcvs/gnome-applets/gweather/Locations.xml.in
by running "intltool-update -P" in
http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/
It might also corrupt the .po files when updating them.

Steps to reproduce:
1. Create the pot file for po-locations (gnome-applets/po-locations) by running
"intltool-update -P
2. The file is not valid UTF-8, test with: iconv -f utf8 -t utf8 < 
gnome-applets/po-locations/gnome-applets-2.0.pot
3. iconv detects UTF-8 errors in the file.


Actual results:
The generated POT file is not valid UTF-8, meaning that a person starting a
translation may get a corrupted file.

Expected results:
The resulting file should be valid utf-8.
The intl-tool scripts should be utf-8 ready.

Does this happen every time?
Yes, for the specific POT file that has UTF-8 characters.

Other information:
intltool does not open files in utf-8 mode. 
The XML code manages to bypass this issue by accessing nodes though
"original_string()" (perldoc XML::Parser::Expat), which prints the original node
verbatim, not as UTF-8. However, this "workaround" does not work with the
comments in http://cvs.gnome.org/viewcvs/gnome-applets/gweather/Locations.xml.in
therefore, the resulting file is not UTF-8 valid.

The solution is to open the files in Perl as UTF-8, as in
open(MYFILE, "myfile.po");
binmode MYFILE, ":utf8";

Also, access the nodes from XML files using recognized_string().

Patch follows.

Comment 1 Simos Xenitellis 2005-11-04 21:46:06 UTC

Created attachment 54323 [details] [review]
Make intltool UTF-8 ready, open files in UTF-8 mode, XML::Parser::Expat accesses nodes as UTF-8 strings

Patch created for HEAD

Comment 2 Danilo Segan 2005-11-05 13:32:38 UTC

Care to provide a test as well, please? :)

(test that will fail with current intltool, but which would work with new one:
i.e. UTF-8 in a comment)

Comment 3 Rodney Dawes 2005-11-05 14:02:05 UTC

How does this deal with po files that are not in UTF-8?

Comment 4 Simos Xenitellis 2005-11-05 14:19:51 UTC

> Care to provide a test as well, please? :)
> 
> (test that will fail with current intltool, but which would work with new one:
> i.e. UTF-8 in a comment)

Use the current intltool to create the POT file for gnome-applets-locations,
(run "intltool-update -P" in gnome-applets/po-locations/). 
[The file is also available at
http://l10n-status.gnome.org/gnome-2.14/PO/gnome-applets-locations.HEAD.pot]

The resulting file is not UTF-8 encoded, 
1. you can inspect with your favourite text editor
2. you can test with "iconv -f utf8 -t utf8 < gnome-applets-locations.HEAD.pot
3. you can 
% file gnome-applets-locations.HEAD.pot
gnome-applets-locations.HEAD.pot: Non-ISO extended-ASCII English text
%

If you apply the patch and create the POT file again, it passes as valid UTF-8 :)

Comment 5 Simos Xenitellis 2005-11-05 14:32:04 UTC

> How does this deal with po files that are not in UTF-8?

I am not sure of such a case in GNOME translations.
Can you point me to such a project/PO file?

Currently all PO files contains characters that fit in US-ASCII, which
effectively means that they pass as UTF-8 (UTF-8 is compatible with US-ASCII, as
the characters with codepoints from 1-127 are represented exactly the same).
The situation with all POT files that I know of at GNOME CVS is they simply use
US-ASCII, and the tools happened to work. Once non-US-ASCII characters appear in
POT files, there is a need to use UTF-8.

Using encodings such as iso-8859-x is not an easy task, as files do not contain
encoding information. There is a field in the PO headers, however this field
acts effectively as a suggestion. GNOME, since 2.0, moved to UTF-8 for all
translation work, even if it is Australian English, French, or Malay (they
effectively use US-ASCII).

I am not sure if there are specific active GNOME projects where non-UTF-8 (such
as iso-8859-x or windows-125x) encodings is still a requirement.

Comment 6 Danilo Segan 2005-11-05 14:35:33 UTC

Re comment 4: Simos, I was thinking of a short regression test for
intltool/tests/ infrastructure.

Re comment 5: Simos, intltool is not used only in Gnome projects, and breaking
backwards compatibility is not a wise idea anyway.  Providing a test-case which
passes "make check" would implicitely check if it passes for all the non-UTF-8
PO files as well (we have regression tests for that as well).

Comment 7 Danilo Segan 2005-11-05 14:41:50 UTC

Btw, the problem is not with non-UTF-8 PO files, since only intltool-merge works
with them, but rather, with other content encoded as non-UTF-8.

expat (and by extension, our XML parser) should be able to handle <?xml
encoding="<something-not-UTF-8>" ?> as well, right?

Not to mention other file types, such as RFC822 as used in Debian files, which
even insist on encoding NOT being UTF-8.

Comment 8 Simos Xenitellis 2005-11-05 20:46:27 UTC

Created attachment 54367 [details] [review]
Addition of testcase to check for encoding problems when parsing XML files.

Checks if the encoding of the text is dealt properly for an xml file, when
intltool-extract is used to parse it.

An issue arises when the XML file has translator comments and translatable text
in an encoding other than US-ASCII. If these are US-ASCII, the problem does not
arise.

The current version of intltool-extract creates a file with a problem:
The comments are saved as iso-8859-1
The messages are saved as UTF-8 (no conversion done).

Then, when editing this file, editors such as vi/gedit/?? would save as
iso-8859-1, corrupting it.

Comment 9 Simos Xenitellis 2005-11-05 21:42:04 UTC

Danilo: I see what you mean by keeping intltool encoding-neutral.
There is a way to fix the encoding problem with a simple change similar to the
following:

--- intltool-old/intltool-extract.in.in      2005-08-01 11:34:42.000000000 +0500
+++ intltool/intltool-extract.in.in      2005-11-05 20:55:09.000000000 +0500
@@ -485,11 +485,10 @@
 sub intltool_tree_comment
 {
     my $expat = shift;
-    my $data  = shift;
     my $clist = $expat->{Curlist};
     my $pos   = $#$clist;

-    push @$clist, 1 => $data;
+    push @$clist, 1 => $expat->original_string();
 }

 # Verbatim copy from intltool-merge.in.in
================

By using "original_string()" we avoid any implicit encoding conversion.

However, the string returned contains "<!--" and "-->", as in
"<!-- Comment for *both* attributes and content -->"
instead of the correct
" Comment for *both* attributes and content "

Is there an elegant way to remove those "<!--" and "-->" apart from using
something like substr?

Comment 10 Simos Xenitellis 2005-11-06 00:05:31 UTC

Created attachment 54375 [details] [review]
Patch to extract comments in XML files without processing them and spoiling the encoding

Patch to allow XML files to have comments in any encoding, without 
Fixes issue with gnome-applets/po-locations/ where the POT file is created
wrongly; it has strings in UTF-8 and ISO-8859-1 encodings in the same file,
making editors regard the file as iso-8859-1 and spoiling any new work done to
them.

Comment 11 Simos Xenitellis 2005-11-15 22:10:30 UTC

I consider these patches ready to use
1. the patch that makes intltool really encoding-agnostic
2. the patch that adds a test case with docbook document that has comments in
non-ASCII encoding.

There is need to fix up some of the .po files in
http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/

These are (contain both extended characters from iso-8859-1 and UTF-8 text):
a. it.po
b. ja.po
c. sr@Latn.po
d. sr.po
In other words, these files are not UTF-8 sane.

The Ukranian translation appears corrupted, though UTF-8 sane:
http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/uk.po?view=markup
Check the end of the file, there are some mathematical symbols which, afaik, do
not belong to the Ukranian alphabet.

Comment 12 Rodney Dawes 2005-11-24 21:02:33 UTC

Do we have a regression test for this?

Comment 13 Rodney Dawes 2005-11-24 21:11:14 UTC

Sorry. Just noticed the testcase patch for the specific case this bug is
targetted at. Do we have a test case to check that it doesn't break for files
encoded in ISO-8859-1 or KOI8-R or other charsets?

Comment 14 Simos Xenitellis 2005-11-24 23:36:27 UTC

Created attachment 55204 [details] [review]
Patch to make intltool (xml) encoding-neutral, also adds two test cases.

This patch 
1. makes intltool encoding-neutral
2. adds a test case with XML file and comments/msgid text in UTF-8
3. adds a test case with XML file and comments/msgid text in ISO-8859-1

Comment 15 Rodney Dawes 2005-11-25 20:01:01 UTC

Committed to CVS. Thanks for the patch.