GNOME Bugzilla – Bug 116526
intltool merge of XML attributes fails
Last modified: 2004-12-22 21:47:04 UTC
intltool-update finds marked-up XML attributes and extracts them for translation, but intltool-merge doesn't handle them. Note that there's a FIXME in intltool source for this, so it's presumably intended to work someday. This is a blocker for gok (gnome onscreen keyboard, a 2.4 module) localization. Note that what you'd expect is for intltool-merge to produce multiple XML elements, one per translation, substituting the translated attribute strings just as it currently does with CDATA element content.
*** Bug 116529 has been marked as a duplicate of this bug. ***
I unfortunately cannot look into this for the moment as I am on vacation (only internet shop access). :-( I can review patches though, if one is produced.
Marking AP2 to reflect a11y team's assessment of impact (I don't think we consider GOK i18n a release stopper, do we?)
Apologies for spam... marking as GNOMEVER2.3 so it appears on the official GNOME bug list :)
This is still blocking GOK internationalization, thus GNOME 2.4 is not properly localizable! Any help is appreciated!
I am currently working on fixing this bug. However, it isn't completely clear to me what the output should look like. I can imagine two ways that it could be solved: Solution #1: <GOK:accessmethod name="val" displayname="text" displayname:br="translated text for language br" displayname:es="translated text for language es" ...> (not sure if my naming convention is right, but you get the idea. Please let me know what the naming convention should be if this is the right direction). Solution #2: We could have separate tags for each language, which would look something like the technique used for translating CDATA elements. <GOK:accessmethod name="val" displayname="translated text" ... xml:lang="xyz"> If Solution #1 is right, then things will be pretty easy to implement. If Solution #2 is right, then this will be harder to code, and GOK will probably also require changes. Mainly because the GOK:accessmethod tag (which has an element that needs to be translated) is the root node. And my understanding is that you can only have one root node in a properly formatted XML file. Solution #2 also seems bad to me since a tag with an element to be translated could have many internal tags, replicating all the internal tags for each language seems like a lot of bloat. The current behavior of the intltool-merge is to blindly translate the string into multiple languages. If we go with Solution #2 and duplicate the GOK:accessmethod tag once for each language then intltool-merge will will probably need a lot of work to handle this. In other words, the current behavior would create something like this for each language: <GOK:accessmethod name="val" displayname="translated text" ... xml:lang="1st"> <GOK:description>Translated text</GOK:description> <GOK:description xml:lang="1st">translated text</GOK:description> <GOK:description xml:lang="2nd">translated text</GOK:description> ... </GOK:accessmethod> <GOK:accessmethod name="val" displayname="translated text" ... xml:lang="2nd"> <GOK:description>Translated text</GOK:description> <GOK:description xml:lang="1st">translated text</GOK:description> <GOK:description xml:lang="2nd">translated text</GOK:description> ... </GOK:accessmethod> Which seems really broken. So, I hope Solution #1 is the right way to go.
#2 is the normal XML'ish way to do this though AFAIK. CC'ing DV, maybe he has some input
<GOK:accessmethod name="val" displayname="text" displayname:br="translated text for language br" displayname:es="translated text for language es" ...> is REALLY REALLY BAD !!!! do NOT use column ':' in tag names except if this denotes a use of XML Namespace. Second DO NOT PUT text in attributes !!! They MUST be changed and normalized by parser, use separate element Third use the xml:lang tag defined in XML exactly for this purpose Go read the guidelines I did put up, and rework the syntax this really cannot work as is ! http://xmlsoft.org/guidelines.html Do NOT push a solution with such a syntax, you will have tons of problems, guaranteed ! Daniel
Solution #2 is far far better . It should still have displayname="translated text" as a separate node entry instead of an attribute, but that looks acceptable. But really #1 must not go though, Daniel
So if solution #2 is better, then how to you deal with this situation: <outer-tag _element="text_to_translate> <_inside-tag>text</_inside-tag> </outer-tag> Assuming two languages "es" and "fr", would the output look like: <outer-tag _element="traslated-text" xml:lang="es"> <inside-tag>translated-text</inside-tag> </outer-tag> <outer-tag _element="translated-text" xml:lang="fr"> <inside-tag>translated-text</inside-tag> </outer-tag> or would it look like something else?
I don't know the semantic you apply to the tags, so I can't juge on dealing with ---- <outer-tag _element="text_to_translate> <_inside-tag>text</_inside-tag> </outer-tag> ---- Putting the "text_to_translate" in an attribute is a problem Read the spec especially this part http://www.w3.org/TR/REC-xml#AVNormalize if you store text in attributes, it WILL be modified by the parser *before* returning it to the application. So don't do that. Put text in element content. If you want to maintain parallel translated strings keep siblings elements with a distinct xml:lang tag indicating the language for each piece of text. Assuming that _element="translated-text" is a indication of result and not the text to translate (your example is quite confusing) then yes <outer-tag _element="traslated-text" xml:lang="es"> <inside-tag>translated-text</inside-tag> </outer-tag> <outer-tag _element="translated-text" xml:lang="fr"> <inside-tag>translated-text</inside-tag> </outer-tag> seems appropriate Daniel
DV: text does need to be stored in attributes in the file format Brian refers to. Perhaps not ideal, but that's the way it works. Brian: for each xml:lang, we expect only one localized copy of each element. This means that if an element contains both translatable CDATA and translatable string-attribute values, both are contained in one element. If we ignore the issue of nested elements (i.e. just process them in-place, using a one-pass expansion) then I think the result will be fine. We certainly don't want to end up with a geometric expansion of elements, but I don't see that as a problem. I do not think the structure of the XML document will change except for duplication of elements; the copying/branching may occur anywhere from the root node (I know, we need to make sure the root node doesn't get translated) onwards.
> DV: text does need to be stored in attributes in the file format Brian > refers to. Perhaps not ideal, but that's the way it works. NO this CANNOT work that way. Any cariage return or line feed MUST be destroyed by the parser on input. Now explain why you expect that to be a workable solution, it ain't so ! I'm not gonna let this point go. It's a fundamental flaw, if it need a redesign then redesign will have to be done. Daniel
Apparently the current version of the intool uses a non conformant XML parser, the whole chain will break all the I18N data as soon as that parser is fixed to follow the XML standard. You must fix the format or be ready to have a huge fuckup as soon as the non-conformant behaviour is fixed or the toolchain need to evolve to a new parser. This is critically broken and should be fixed ASAP. Honnestly I really can't understand that such a fragile design had not been reviewed before ! Daniel
DV: I don't know what you are talking about regarding a fragile design, etc. Though I can take no credit or blame for the XML format. The translatable attributes do not contain newlines. I re-read the XML spec and believe the attributes will survive normalization fine, as-used. The only whitespace these attributes are allowed to contain are the 'space' character (0x20), which seems to be fine in attributes. intltool doesn't parse the XML at the moment (unless you count regular-expression matching as "parsing").
I agree that it would be reasonable for intltool's XML processing to use a conformant XML parser, by the way.
Okay, it's very hard from the scope of this bug report to assert the limits of the design. If you're sure no newline will ever be needed then keeping the input string in an attribute should not generate serious trouble. But I'm surprised by the garantee you can provide about newlines never being needed, unless this is very specific code this is dangerous anyway. For the result, using different elements with xml:lang is the way to go, no doubt about it though. Concerning the use of a real XML parser, well it will garantee that the strings will be contained within the Unicode ranges allowed by the XML spec upfront and avoid silent corruption of the data from an XML viewpoint. http://www.w3.org/TR/REC-xml#NT-Char Daniel
Is anyone working on a fix for this? Or is this still in debate?
Fixing this will require substantial changes to intltool-merge (i.e. real parsing of the XML instead of just string-substitution). As such, it's a big job and we don't know who will be available to do it.
I am currently working on fixing this. As Bill mentions, making the tool use an XML parser is a bit of work, so it will likely take me some time.
Created attachment 21010 [details] [review] patch that fixes the bug
The attached patch fixes this bug. intltool-merge now uses the XML::Parser Perl module to parse the file and produce the appropriate output. This obviously means that the XML::Parser a dependency of the script, which seems appropriate. I believe I am producing the output appropriately, but would like someone from the intltool team to review and make sure that the patch doesn't need any tweaks. Note that there are two side effects of using the XML parser: * intltool-merge will now be more strict that the input is well formed. The previous logic didn't notice badly formed XML. * Comments are not retained by the XML parser, so they are not included in the output. I suppose post-processing logic could be added to re-insert them in the appropriate places, though this doesn't seem to be worth the effort. I assume that having comments only in the xml.in files is sufficient. * Using an XML parser means that the white space placed in the output file is slightly different than in the input file. Shouldn't be a problem, but I'm just mentioning this fact.
From a first look it looks pretty wel. I would like to see some test though. Does it still pass the "make distcheck"? Could you add additional tests to make sure that everything is working as it should? The requirement seems appropiate for me, thought we ought to ask on gnome-desktop-devel so every one has been heard. Can you send them a mail? You should probably ask about the comment issue as well - maybe someone has a different opinion
Okay, I ran "make distcheck" and some of the tests fail, but I think the failures are acceptable. The failures fall into the following categories: 1. white space differences. This includes some differences in leading white space. Also, the new intltool-merge changes tags that look like: <tag>value</tag> to: <tag> value </tag> Since the XML parser doesn't keep track of the white spacing of the original document, it isn't really possible to keep it the same. 2. Comments are missing from the output files, as expected. 3. Perhaps the only serious difference is that the old script did not output the translation for a given language if the translated text does not exist. The new script prints out the default (English) text when the translated text does not exist. In other words, you now see things like: <bar xml:lang="az"> This is another ' test </bar> Where before the block for lang="az" would just be left out since there really isn't a translation. I think that this is necessary because you can have situations as follows: <foo element="translated_text" xml:lang="az"> <bar> This is not translated text </bar> </foo> In other words, the the element for foo does have a translation but the text for bar does not. I guess if we want different behavior in the intltool-script, I'll need to know more clearly exactly how you want it to work.
no. 1 and 2 seem okay with me - but then the tests need to be updated so 'make dist' will still work. Ie. new result files should be made. I don't know about no. 3. I think it would be better if we don't write anything to the xml file if no translation exist. Otherwise we will end up distributing (in rpm form etc) larger files than necessary. Even if a translator just commits a LANG.po file with just one translated string, it will increase the size of the xml files substancely.
okay, i suppose it is understandable to not translate tags with cdata that do not have translations. For example <_tag>text to translate</_tag> However, what do we do with elements that are not translated. For example: <tag1 _element="element text to translate"> <_tag2>cdata text to translate</_tag2> <tag3>just another tag</tag3> </tag1> In this situation, it is straightforward if we have translations for both strings (then we simply create the appropriate translation), or if we do not have translations for either (then we don't create the whole block). But what seems a bit more problematic are the other situations. For example, what do we do if there is a translation for tag1 but not tag2. What do we do if there is a translation for tag2 but not tag1? Please be specific, examples of the output you would expect would be useful. I'm happy to code it any way that makes sense. I just thought that putting in the default (English) translation was a reasonable way to handle this situation that wouldn't have so much exception coding to handle the above problematic situations.
Brian and Kenneth: In the case of tags that don't have translations, since we are also using the 'output multiple files' option, it seem to me that we need to output tags in all cases. It's not clear to me that the multiple-output file would contain all the necessary tags (whether they had been translated or not) unless we output tags even when translations are not available. Perhaps I am misunderstanding what the output now looks like. Also I believe that in the case of tags that are children of translated nodes (I think this is the case Brian brought up), we must output all the child tags. Otherwise, if our XML client has successfully located the parent mode for the correct xml-lang, it will not be able to find all the relevant children. The alternative would be to implement the heuristics in the client, i.e. 1) find translated parent _and_ C-locale-parent 2) if specified child isn't found in translated parent, read it from the C-locale parent node I think this sort of heuristic is likely to break for GOK, which in some cases may wish to parse XML files whose topology really _does_ depend on the locale, but which in most cases expects locale-independent topology. This also would (it seems to me) break the multiple-output case, unless we include the C locale elements in all of the multiple-output files as well. That near doubling of the length of the output files for multiple output would offset any size savings due to omitting untranslated tags, I think.
Kenneth: I am waiting to hear your feedback on this issue. I would like to finish up this bug, but can not until I understand how to address the issues I mentioned in my last update to this bug.
I think we should just output all tags then - for now anyway. Feel free to check in when you have made sure that 'make distcheck' still works. (add new result files etc) Kenneht
Kenneth/Brian: While we are at it, can we add an intltool rule for INTLTOOL_KBD_RULE ? I can attach a small patch here (which applies on top of Brian's recent patch), or put it in RFE bug #126237.
Patch is committed. I updated the data in the test subdirectory so that "make distcheck" passes. I don't have permission to change this bug's status, though.
oops, yes i do. updating status of the bug.
>1. white space differences. This includes some differences in > leading white space. Also, the new intltool-merge changes > tags that look like: > > <tag>value</tag> > > to: > > <tag> > value > </tag> > > Since the XML parser doesn't keep track of the white spacing > of the original document, it isn't really possible to keep it > the same. Is this problem still present ? If yes then the bug is not FIXED that's a very serious problem. White space are significant, you cannot ignore them or add them within tag values, I hope this is fixed and wait for confirmation Daniel
Daniel: Unless the XML specifies xlm:space = preserve, this isn't really a bug, is it? I don't think there is any guarantee in intltool that leading and trailing whitespace in localized output are preserved. Shouldn't we just document this?
oops, cookie problem. The above annotation should have come from me, not david H. sorry.
I am trying to use Brian's OrigTree module right now - but I am having some throuble getting it to work. Btw, Bill test no. 18 is still broken since your last commit
OrigTree module is committed. Changing ->Tree to ->OrigTree in intltool-merge.in.in should activate it, but apparently something is going wrong - any help appreciated.
Kenneth thanks for the reminder.
Yes it is a bug even if space=preserve is not defined. First: - space="preserve" is the *default* behaviour - second space="preserve" only addess text nodes only made of characters from the S production, i.e. spaces what this is doing is *modifying* text nodes. And that I said and repeated is definitely no conformant. turning <a><b/></a> into <a> <b/> </a> should not be done by default but is okay from purely a formating perspective turning <a><b>text</b></a> into <a> <b>text</b> </a> is no worse it's the same only empty or blanks nodes are changed or added but turning <a><b>text</b></a> into <a> <b> text </b> </a> is a clear violoation. It changes the text node from containing a non-blank string "text" into another non blank string " text " and that is an heresy from an XML processing viewpoint. Daniel
ok, that makes sense. Brian, can we put a fix on the queue? thanks.
I believe the problem that you describe has been fixed in CVS head. Refer to bug #127250. Please verify.
Just wanted to tell that OrigTree is committed and is now the default XML::Parser::Style. Kenneth
problem is fixed in 127250