After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 61967 - intltool-extract does not handle multiline XML correctly
intltool-extract does not handle multiline XML correctly
Status: RESOLVED FIXED
Product: intltool
Classification: Deprecated
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Darin Adler
Darin Adler
Depends on: 45689
Blocks:
 
 
Reported: 2001-10-08 18:46 UTC by Sven Neumann
Modified: 2004-12-22 21:47 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
a header generated by intltool-extract to illustrate the described problems with multiline XML values (10.18 KB, text/plain)
2001-10-08 18:48 UTC, Sven Neumann
Details
A messages.po files created from the generated header. Now try to translate this! (9.86 KB, text/plain)
2001-10-08 18:50 UTC, Sven Neumann
Details
the XML version of gimp_tips.txt that was used to create the other attachments from (11.82 KB, text/plain)
2001-10-08 23:48 UTC, Sven Neumann
Details

Description Sven Neumann 2001-10-08 18:46:19 UTC
The way intltool-extract handles XML files is conceptually broken.
Strings can't be extracted from XML files with a tool that does not
know anything about XML. Simple search-and-replace patterns simply 
won't work here. You need to drop XML support from this tool or use
some of the available perl XML modules. A decent tool to extract 
translatable strings from an XML file would parse the XML file and
output the translatable values after the parser interpreted them, not 
as they appear in the file. This becomes obvious when extracting strings
from an XML file with multiline values. Adjascent whitespace and newlines
are only relevant for the formatting of the XML file and are not part
of the string values. Thus, they shouldn't appear in the generated 
header file (you don't want the translator to see them, do you?). Later
when merging the po files back into XML, the tool should try do a descent
job at outputting well-formatted XML by applying correct indentation and
breaking lines when necessary. I'll attach a header generated by
intltool-extract to this bug-report to illustrate the problem.

Sorry, if I may sound harsh, but I'm very disappointed of the quality
of these tools and I can only congratulate you to the choice to rename 
this project, since it definitely does not deserve the name xml-i18n-tools.
Comment 1 Sven Neumann 2001-10-08 18:48:09 UTC
Created attachment 5789 [details]
a header generated by intltool-extract to illustrate the described problems with multiline XML values
Comment 2 Sven Neumann 2001-10-08 18:50:15 UTC
Created attachment 5790 [details]
A messages.po files created from the generated header. Now try to translate this!
Comment 3 Darin Adler 2001-10-08 19:19:21 UTC
At some point, intltool-extract and intltool-merge should be rewritten to use a real XML parser instead of the kludge that's in there now.

In the mean time, is there any practical problem with this? Do you
have a real-life example where this is causing you trouble?
Comment 4 Sven Neumann 2001-10-08 20:55:26 UTC
I think the attached example files illustrate the problem quite well.
We considered switching to a simple XML format for the GIMP tips file.
To get an idea of the problems we have with intltool-extract, try to 
translate the messages.po file I've attached. The extracted strings
contain all whitespaces and newlines from the XML and all available
tools that handle po files will require you to create a translation 
with exactly these newlines. This is unacceptable.
Comment 5 Darin Adler 2001-10-08 23:37:41 UTC
I'm going to add this as a test case and make our current half-assed
XML parser handle at least this much.

I could use one additional attachment, the file that was passed in
to intltool-extract, to set up the test case.
Comment 6 Sven Neumann 2001-10-08 23:48:54 UTC
Created attachment 5791 [details]
the XML version of gimp_tips.txt that was used to create the other attachments from
Comment 7 Darin Adler 2001-10-09 22:28:37 UTC
I added a tiny workaround to extract.

This does nothing to address the broader issue of using a real XML
parser instead of just doing simple text search and replace.
Comment 8 Sven Neumann 2001-10-15 17:54:56 UTC
The workaround for intltool-extract seems to work reasonably well, but
either I'm doing something wrong or intltool-merge is not able to merge
the translations back into the generated XML file. I have created a 
german translation (well, only a few strings) for the gimp_tips.xml file
based on the message catalog generated from intltool-extract (current
CVS version). intltool-merge does not give any warnings and creates a
valid XML file for me, but I can't find the translated entries. I guess
intltool-merge can't match the translations to the xml.in file because
the strings differ in whitespaces and newlines.
Comment 9 Darin Adler 2001-10-15 18:14:43 UTC
Oops. This is as expected. I'll have to put the same kind of
workaround in intltool-merge. I should have made the test be a
complete merge test in the first place!
Comment 10 Darin Adler 2001-10-16 17:13:14 UTC
I fixed intltool-merge to match intltool-extract and added a new
test to the intltool test suite.

Maybe it's time to mark this bug fixed and open a new one that
complains about other specific ways we don't handle XML properly --
we can mark that one fixed when we switch to a real XML parser.
Comment 11 Sven Neumann 2001-10-25 13:27:13 UTC
I tried latest intltools from CVS and I don't think your latest
changes to intltool-merge fix the problem. Although the included test
is passed, it fails for a real-world example. The test.po file in the
cases directory contains the original message all on one line:

msgid "Nearly all image operations are performed by right-clicking on
the image. And don't worry, you can undo most mistakes..."

after running msgmerge on this file, this line will however been split
like:

msgid "Nearly all image operations are performed by right-clicking on
the image. "
"And don't worry, you can undo most mistakes..."
If you do this change, intltool-merge fails to match the message and
does not pass the test any longer. 
Comment 12 Darin Adler 2001-10-25 15:21:14 UTC
OK. Sorry I made an incorrect test. I'll fix the test and then
fix intltool. Just my ignorance of how gettext works.
Comment 13 Darin Adler 2001-10-29 19:07:26 UTC
OK. I fixed the .po code so it can handle multiline entries in the .po
file too. Try again?