After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 170471 - intltool-merge into XML files should check well-formedness
intltool-merge into XML files should check well-formedness
Status: RESOLVED FIXED
Product: intltool
Classification: Deprecated
Component: general
unspecified
Other Linux
: Normal critical
: ---
Assigned To: intltool maintainers
intltool maintainers
Depends on:
Blocks:
 
 
Reported: 2005-03-15 16:10 UTC by bill.haneman
Modified: 2005-06-26 11:52 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description bill.haneman 2005-03-15 16:10:02 UTC
see bugzilla bug 170328;

if for example a po file string, destined to be merged into an XML file, looks
like this:

msgid "Less"
msgstr "小于号(<)"

you will get a nasty surprise when attempting to parse the resulting XML file.

In some cases this can cause a whole app to crash in all locales (in fact this
is the typical case, if the XML strings are all merged into one XML file - any
illegal string will break parsing).
Comment 1 Danilo Segan 2005-03-15 20:20:38 UTC
This is not intltool bug, this is a feature. Reason is the following:

<_translate>Hello <b>world</b> &amp; others!</_translate>

If you're able to define correct semantics for extracting and merging above
messages, I'm willing to reconsider. In bug #130802 we decided it was much more
useful to allow nested tags.

Translators are very much capable of breaking programs anyway, and we need them
to be aware of the responsibility they have.  In my ideal world, we'd have ",
xml-format" PO file tag (just like ", c-format" tag for printf-style strings),
which would be checked for well-formedness by msgfmt.

Alternative is to have intltool-merge try to check string well-formedness
itself, but I'm not planning on working on that anytime soon.  I ultimately feel
that this should belong in GNU gettext ("xml-format" described above), but we
can put hacks in intltool to at least discard such messages (this is bad because
translators wouldn't know about it, that's why I prefer getting this upstream).

If you insist on having such workaround in intltool, please open another bug, or
retitle and reopen this one.
Comment 2 bill.haneman 2005-03-15 21:10:00 UTC
I am sorry but I don't find that explanation satisfactory.  
To require that all strings in po files are legal well-formed xml strings (the
simple alternative - because of unexpected matching that can occur when a
program reuses a translated string in XML) doesn't make sense.

That really means that intltool-merge needs to guard against breaking the XML
during a merge, in the absence of the ",xml-format" approach you suggest above.
One approach would be to require that the merged strings be well-formed XML
fragments (ok, not sure there's such a thing technically as a 'well-formed
fragment', but I'm sure you see what I mean).  This would require for instance
that any xml tags inside the string have matching closing tags, and escape any
other offending special characters.  I don't think this would be unfeasible.
Comment 3 Danilo Segan 2005-03-15 21:38:44 UTC
We can define a fragment to be well formed if <something>fragment</something> is
well-formed, so that shouldn't be too hard to check.

I was not talking about *all* strings in PO files to be well-formed xml strings,
only those which are merged back into XML files (look for source references
containing .xml[.in].h in #:-style comments above messages).  We have worse
requirements than that. See what you get when translators include strange things
such as NULLs (\000 or \0) in translations: breakages are going to happen all
around.  

Nested tags are a big requirement, and we can't simply ignore their usefulness.

This is the same problem we'd have if there were no checks for printf-style
strings in msgfmt: segmentation faults might occur, et cetera.  While checking
this in intltool might not be unfeasible, it's incorrect way to do it
(translator wouldn't know that there was a problem with his translation).

I'll work on this when time permits, but I'm not planning on implementing a
"correcting parser" (check "that any xml tags inside the string have matching
closing tags, and escape any other offending special characters"; I have done
something like this before for untrusted web-form input, but it's not Perl, and
it's another 250 lines/8k of code--hardly something we want to introduce and
maintain in intltool)--rather, I'd just reuse already existing parser which will
only indicate if a fragment was or was not well formed, and ignore
non-well-formed translations.

Would you consider that suitable for your purposes (basically, we'd only
guarantee that generated XML file would be well-formed)?
Comment 4 Rodney Dawes 2005-05-08 19:51:10 UTC
No response for 2 months. Marking this as needinfo for now.
Comment 5 bill.haneman 2005-05-09 08:47:32 UTC
reopening.  We're awaiting Danilo's patch (he said "I'll work on this when time
permits")
Comment 6 Danilo Segan 2005-06-26 11:49:21 UTC
I've fixed this in CVS by *discarding* translations which are not well-formed. 
I need to add a test-case as well so this doesn't get re-introduced.
Comment 7 Danilo Segan 2005-06-26 11:52:02 UTC
Btw, do we maybe want to issue a warning on stderr that some translations are
being discarded (we can even go as far to indicate which messages and in which
language)?

That should at least help spot a problem in translations during build time.