GNOME Bugzilla – Bug 170471
intltool-merge into XML files should check well-formedness
Last modified: 2005-06-26 11:52:02 UTC
see bugzilla bug 170328; if for example a po file string, destined to be merged into an XML file, looks like this: msgid "Less" msgstr "小于号(<)" you will get a nasty surprise when attempting to parse the resulting XML file. In some cases this can cause a whole app to crash in all locales (in fact this is the typical case, if the XML strings are all merged into one XML file - any illegal string will break parsing).
This is not intltool bug, this is a feature. Reason is the following: <_translate>Hello <b>world</b> & others!</_translate> If you're able to define correct semantics for extracting and merging above messages, I'm willing to reconsider. In bug #130802 we decided it was much more useful to allow nested tags. Translators are very much capable of breaking programs anyway, and we need them to be aware of the responsibility they have. In my ideal world, we'd have ", xml-format" PO file tag (just like ", c-format" tag for printf-style strings), which would be checked for well-formedness by msgfmt. Alternative is to have intltool-merge try to check string well-formedness itself, but I'm not planning on working on that anytime soon. I ultimately feel that this should belong in GNU gettext ("xml-format" described above), but we can put hacks in intltool to at least discard such messages (this is bad because translators wouldn't know about it, that's why I prefer getting this upstream). If you insist on having such workaround in intltool, please open another bug, or retitle and reopen this one.
I am sorry but I don't find that explanation satisfactory. To require that all strings in po files are legal well-formed xml strings (the simple alternative - because of unexpected matching that can occur when a program reuses a translated string in XML) doesn't make sense. That really means that intltool-merge needs to guard against breaking the XML during a merge, in the absence of the ",xml-format" approach you suggest above. One approach would be to require that the merged strings be well-formed XML fragments (ok, not sure there's such a thing technically as a 'well-formed fragment', but I'm sure you see what I mean). This would require for instance that any xml tags inside the string have matching closing tags, and escape any other offending special characters. I don't think this would be unfeasible.
We can define a fragment to be well formed if <something>fragment</something> is well-formed, so that shouldn't be too hard to check. I was not talking about *all* strings in PO files to be well-formed xml strings, only those which are merged back into XML files (look for source references containing .xml[.in].h in #:-style comments above messages). We have worse requirements than that. See what you get when translators include strange things such as NULLs (\000 or \0) in translations: breakages are going to happen all around. Nested tags are a big requirement, and we can't simply ignore their usefulness. This is the same problem we'd have if there were no checks for printf-style strings in msgfmt: segmentation faults might occur, et cetera. While checking this in intltool might not be unfeasible, it's incorrect way to do it (translator wouldn't know that there was a problem with his translation). I'll work on this when time permits, but I'm not planning on implementing a "correcting parser" (check "that any xml tags inside the string have matching closing tags, and escape any other offending special characters"; I have done something like this before for untrusted web-form input, but it's not Perl, and it's another 250 lines/8k of code--hardly something we want to introduce and maintain in intltool)--rather, I'd just reuse already existing parser which will only indicate if a fragment was or was not well formed, and ignore non-well-formed translations. Would you consider that suitable for your purposes (basically, we'd only guarantee that generated XML file would be well-formed)?
No response for 2 months. Marking this as needinfo for now.
reopening. We're awaiting Danilo's patch (he said "I'll work on this when time permits")
I've fixed this in CVS by *discarding* translations which are not well-formed. I need to add a test-case as well so this doesn't get re-introduced.
Btw, do we maybe want to issue a warning on stderr that some translations are being discarded (we can even go as far to indicate which messages and in which language)? That should at least help spot a problem in translations during build time.