GNOME Bugzilla – Bug 61967
intltool-extract does not handle multiline XML correctly
Last modified: 2004-12-22 21:47:04 UTC
The way intltool-extract handles XML files is conceptually broken. Strings can't be extracted from XML files with a tool that does not know anything about XML. Simple search-and-replace patterns simply won't work here. You need to drop XML support from this tool or use some of the available perl XML modules. A decent tool to extract translatable strings from an XML file would parse the XML file and output the translatable values after the parser interpreted them, not as they appear in the file. This becomes obvious when extracting strings from an XML file with multiline values. Adjascent whitespace and newlines are only relevant for the formatting of the XML file and are not part of the string values. Thus, they shouldn't appear in the generated header file (you don't want the translator to see them, do you?). Later when merging the po files back into XML, the tool should try do a descent job at outputting well-formatted XML by applying correct indentation and breaking lines when necessary. I'll attach a header generated by intltool-extract to this bug-report to illustrate the problem. Sorry, if I may sound harsh, but I'm very disappointed of the quality of these tools and I can only congratulate you to the choice to rename this project, since it definitely does not deserve the name xml-i18n-tools.
Created attachment 5789 [details] a header generated by intltool-extract to illustrate the described problems with multiline XML values
Created attachment 5790 [details] A messages.po files created from the generated header. Now try to translate this!
At some point, intltool-extract and intltool-merge should be rewritten to use a real XML parser instead of the kludge that's in there now. In the mean time, is there any practical problem with this? Do you have a real-life example where this is causing you trouble?
I think the attached example files illustrate the problem quite well. We considered switching to a simple XML format for the GIMP tips file. To get an idea of the problems we have with intltool-extract, try to translate the messages.po file I've attached. The extracted strings contain all whitespaces and newlines from the XML and all available tools that handle po files will require you to create a translation with exactly these newlines. This is unacceptable.
I'm going to add this as a test case and make our current half-assed XML parser handle at least this much. I could use one additional attachment, the file that was passed in to intltool-extract, to set up the test case.
Created attachment 5791 [details] the XML version of gimp_tips.txt that was used to create the other attachments from
I added a tiny workaround to extract. This does nothing to address the broader issue of using a real XML parser instead of just doing simple text search and replace.
The workaround for intltool-extract seems to work reasonably well, but either I'm doing something wrong or intltool-merge is not able to merge the translations back into the generated XML file. I have created a german translation (well, only a few strings) for the gimp_tips.xml file based on the message catalog generated from intltool-extract (current CVS version). intltool-merge does not give any warnings and creates a valid XML file for me, but I can't find the translated entries. I guess intltool-merge can't match the translations to the xml.in file because the strings differ in whitespaces and newlines.
Oops. This is as expected. I'll have to put the same kind of workaround in intltool-merge. I should have made the test be a complete merge test in the first place!
I fixed intltool-merge to match intltool-extract and added a new test to the intltool test suite. Maybe it's time to mark this bug fixed and open a new one that complains about other specific ways we don't handle XML properly -- we can mark that one fixed when we switch to a real XML parser.
I tried latest intltools from CVS and I don't think your latest changes to intltool-merge fix the problem. Although the included test is passed, it fails for a real-world example. The test.po file in the cases directory contains the original message all on one line: msgid "Nearly all image operations are performed by right-clicking on the image. And don't worry, you can undo most mistakes..." after running msgmerge on this file, this line will however been split like: msgid "Nearly all image operations are performed by right-clicking on the image. " "And don't worry, you can undo most mistakes..." If you do this change, intltool-merge fails to match the message and does not pass the test any longer.
OK. Sorry I made an incorrect test. I'll fix the test and then fix intltool. Just my ignorance of how gettext works.
OK. I fixed the .po code so it can handle multiline entries in the .po file too. Try again?