GNOME Bugzilla – Bug 320721
intltool is not UTF-8 ready, may corrupt PO files
Last modified: 2005-11-25 20:01:01 UTC
Please describe the problem: As Roozbeh noticed (http://mail.gnome.org/archives/gnome-i18n/2005-November/msg00005.html), intltool is not UTF-8 ready. The bug appears when producing the POT file for http://cvs.gnome.org/viewcvs/gnome-applets/gweather/Locations.xml.in by running "intltool-update -P" in http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/ It might also corrupt the .po files when updating them. Steps to reproduce: 1. Create the pot file for po-locations (gnome-applets/po-locations) by running "intltool-update -P 2. The file is not valid UTF-8, test with: iconv -f utf8 -t utf8 < gnome-applets/po-locations/gnome-applets-2.0.pot 3. iconv detects UTF-8 errors in the file. Actual results: The generated POT file is not valid UTF-8, meaning that a person starting a translation may get a corrupted file. Expected results: The resulting file should be valid utf-8. The intl-tool scripts should be utf-8 ready. Does this happen every time? Yes, for the specific POT file that has UTF-8 characters. Other information: intltool does not open files in utf-8 mode. The XML code manages to bypass this issue by accessing nodes though "original_string()" (perldoc XML::Parser::Expat), which prints the original node verbatim, not as UTF-8. However, this "workaround" does not work with the comments in http://cvs.gnome.org/viewcvs/gnome-applets/gweather/Locations.xml.in therefore, the resulting file is not UTF-8 valid. The solution is to open the files in Perl as UTF-8, as in open(MYFILE, "myfile.po"); binmode MYFILE, ":utf8"; Also, access the nodes from XML files using recognized_string(). Patch follows.
Created attachment 54323 [details] [review] Make intltool UTF-8 ready, open files in UTF-8 mode, XML::Parser::Expat accesses nodes as UTF-8 strings Patch created for HEAD
Care to provide a test as well, please? :) (test that will fail with current intltool, but which would work with new one: i.e. UTF-8 in a comment)
How does this deal with po files that are not in UTF-8?
> Care to provide a test as well, please? :) > > (test that will fail with current intltool, but which would work with new one: > i.e. UTF-8 in a comment) Use the current intltool to create the POT file for gnome-applets-locations, (run "intltool-update -P" in gnome-applets/po-locations/). [The file is also available at http://l10n-status.gnome.org/gnome-2.14/PO/gnome-applets-locations.HEAD.pot] The resulting file is not UTF-8 encoded, 1. you can inspect with your favourite text editor 2. you can test with "iconv -f utf8 -t utf8 < gnome-applets-locations.HEAD.pot 3. you can % file gnome-applets-locations.HEAD.pot gnome-applets-locations.HEAD.pot: Non-ISO extended-ASCII English text % If you apply the patch and create the POT file again, it passes as valid UTF-8 :)
> How does this deal with po files that are not in UTF-8? I am not sure of such a case in GNOME translations. Can you point me to such a project/PO file? Currently all PO files contains characters that fit in US-ASCII, which effectively means that they pass as UTF-8 (UTF-8 is compatible with US-ASCII, as the characters with codepoints from 1-127 are represented exactly the same). The situation with all POT files that I know of at GNOME CVS is they simply use US-ASCII, and the tools happened to work. Once non-US-ASCII characters appear in POT files, there is a need to use UTF-8. Using encodings such as iso-8859-x is not an easy task, as files do not contain encoding information. There is a field in the PO headers, however this field acts effectively as a suggestion. GNOME, since 2.0, moved to UTF-8 for all translation work, even if it is Australian English, French, or Malay (they effectively use US-ASCII). I am not sure if there are specific active GNOME projects where non-UTF-8 (such as iso-8859-x or windows-125x) encodings is still a requirement.
Re comment 4: Simos, I was thinking of a short regression test for intltool/tests/ infrastructure. Re comment 5: Simos, intltool is not used only in Gnome projects, and breaking backwards compatibility is not a wise idea anyway. Providing a test-case which passes "make check" would implicitely check if it passes for all the non-UTF-8 PO files as well (we have regression tests for that as well).
Btw, the problem is not with non-UTF-8 PO files, since only intltool-merge works with them, but rather, with other content encoded as non-UTF-8. expat (and by extension, our XML parser) should be able to handle <?xml encoding="<something-not-UTF-8>" ?> as well, right? Not to mention other file types, such as RFC822 as used in Debian files, which even insist on encoding NOT being UTF-8.
Created attachment 54367 [details] [review] Addition of testcase to check for encoding problems when parsing XML files. Checks if the encoding of the text is dealt properly for an xml file, when intltool-extract is used to parse it. An issue arises when the XML file has translator comments and translatable text in an encoding other than US-ASCII. If these are US-ASCII, the problem does not arise. The current version of intltool-extract creates a file with a problem: The comments are saved as iso-8859-1 The messages are saved as UTF-8 (no conversion done). Then, when editing this file, editors such as vi/gedit/?? would save as iso-8859-1, corrupting it.
Danilo: I see what you mean by keeping intltool encoding-neutral. There is a way to fix the encoding problem with a simple change similar to the following: --- intltool-old/intltool-extract.in.in 2005-08-01 11:34:42.000000000 +0500 +++ intltool/intltool-extract.in.in 2005-11-05 20:55:09.000000000 +0500 @@ -485,11 +485,10 @@ sub intltool_tree_comment { my $expat = shift; - my $data = shift; my $clist = $expat->{Curlist}; my $pos = $#$clist; - push @$clist, 1 => $data; + push @$clist, 1 => $expat->original_string(); } # Verbatim copy from intltool-merge.in.in ================ By using "original_string()" we avoid any implicit encoding conversion. However, the string returned contains "<!--" and "-->", as in "<!-- Comment for *both* attributes and content -->" instead of the correct " Comment for *both* attributes and content " Is there an elegant way to remove those "<!--" and "-->" apart from using something like substr?
Created attachment 54375 [details] [review] Patch to extract comments in XML files without processing them and spoiling the encoding Patch to allow XML files to have comments in any encoding, without Fixes issue with gnome-applets/po-locations/ where the POT file is created wrongly; it has strings in UTF-8 and ISO-8859-1 encodings in the same file, making editors regard the file as iso-8859-1 and spoiling any new work done to them.
I consider these patches ready to use 1. the patch that makes intltool really encoding-agnostic 2. the patch that adds a test case with docbook document that has comments in non-ASCII encoding. There is need to fix up some of the .po files in http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/ These are (contain both extended characters from iso-8859-1 and UTF-8 text): a. it.po b. ja.po c. sr@Latn.po d. sr.po In other words, these files are not UTF-8 sane. The Ukranian translation appears corrupted, though UTF-8 sane: http://cvs.gnome.org/viewcvs/gnome-applets/po-locations/uk.po?view=markup Check the end of the file, there are some mathematical symbols which, afaik, do not belong to the Ukranian alphabet.
Do we have a regression test for this?
Sorry. Just noticed the testcase patch for the specific case this bug is targetted at. Do we have a test case to check that it doesn't break for files encoded in ISO-8859-1 or KOI8-R or other charsets?
Created attachment 55204 [details] [review] Patch to make intltool (xml) encoding-neutral, also adds two test cases. This patch 1. makes intltool encoding-neutral 2. adds a test case with XML file and comments/msgid text in UTF-8 3. adds a test case with XML file and comments/msgid text in ISO-8859-1
Committed to CVS. Thanks for the patch.