GNOME Bugzilla – Bug 130802
intltool creates invalid XML files
Last modified: 2004-12-22 21:47:04 UTC
When doing the gimp-2.0pre1 tarball yesterday, make dist choked when trying to validate the gimp-tips.xml file that is generated using intltool. This used to work until recently but it breaks with intltool-0.28. I tried current CVS and got the same result. Downgrading to version 0.27.2 fixed the problem. To test this, please check out gimp/tips and run 'make dist-hook'. This will generate gimp-tips.xml from gimp-tips.xml.in. The resulting XML is completely broken, invalid, not even well-formed.
Please attach the generated gimp-tips.xml file, thanks.
Created attachment 23083 [details] the source file
Created attachment 23084 [details] the generated XML file
I just commited a small fix that corrects the DOCTYPE handling, which is part of the problem. This bug caused the whole document to appear once at the beginning of the file. The other problem is that the XML input isn't really valid. It isn't appropriate to place tags (such as <tt>) in a CDATA section and expect the XML parser to preserve the position of the tags. The intltool-merge program assumes that there won't be tags in CDATA sections, and this causes the output to get messed up. The proper way to specify tags in CDATA sections is as follows: Instead of: <big><tt>test1</tt>Welcome to <tt>test2</tt>The GIMP !<tt>test3</tt></big> You should specify it as follows: <big><![CDATA[<tt>test1</tt>Welcome to <tt>test2</tt>The GIMP !<tt>test3</tt>]]></big> This way the XML parser knows that the tags are actually in the CDATA section. If you do it this way, then it will work with the new intltool-merge, and this is a more appropriate way to specify what you want. This does highlight a possible problem with intltool-merge, though. If the user specifies a CDATA section using the <![CDATA[...]]> syntax, the "<![CDATA[" and the "]]>" get lost in the output. I'm not sure of a way to preserve this. We could do a regular expression on the CDATA section and see if it contains an "<" or ">" character and put it back in this case, but I am not sure that this covers all situations where it would be nice to preserve the "<![CDATA[" and the "]]>". Kenneth, what do you think we should do here?
That is a good question. We could do the regex hack for now - that will leave us with a little time to find a better sollution - before we get the first bug report. - or can we live without "<![CDATA[" and the "]]>" in the output for now?
I don't think the suggested solution is an option for us. I also disagree that the XML input isn't valid. Any XML validator I tried so far seems to agree with me that nesting XML elements is a perfectly fine thing to do. This is a major regression and I suggest you revert your XML code to the state it had with 0.27.2. The use of the of the XML::Parser module is also not acceptable (see #130850). It seems that, until you revert these changes, we will have to stick to intltool-0.27.2.
Why isn't the suggested solution an option for you? If you place the problematic elements into "<![CDATA[...]]>" blocks then the output is exactly what you are looking for. This is just an XML technique for specifying what you want, that tags in CDATA be treated as CDATA.
I don't see why I should make it that complicated to edit the tips file. In this area (documentation) we rely on contributions from non-coders and I don't want to ask people to use such an akward syntax for no good reason. Also if I understand this correctly this syntax would get promoted to the output document which would make changes to the parser necessary.
The "<![CDATA[...]]>" syntax is XML standard, so whatever parser you are using should work fine with it - I assume that you are using an XML parser that supports the XML standards. I disagree that asking people to follow the XML standard is unreasonable awkward. This syntax is the XML standard way of specifying that you want the enclosed text to be treated as CDATA, even if it contains invalid CDATA characters, which include the "<" and ">" characters. Otherwise the XML parser will obviously interpret them as nested tags, and not a part of the CDATA section. You could also use the < and > syntax to specify the "<" and ">" characters if you want to make your CDATA sections valid - though I suspect that you actually want the "<" and ">" characters to be literally in your CDATA section.
Also, the current intltool-merge, does not promote the "<![CDATA[...]]>" syntax to the output file, though it probably should. If we fix intltool-merge to promote this to the output file, it should work fine with whatever parsers you use to process the output file, since this syntax is a part of the XML standard. We could also provide a command-line argument to suppress promoting the syntax to the output file if there are people who are processing the file using XML parsers that somehow do not support this feature.
Well, I doubt GMarkupParser will eat this. Sure, it isn't a real XML parser with all bells and whistles but it used to work reasonably well for this job and I wouldn't want to introduce a dependency on libxml2 just for this.
IMHO intltool shouldn't rely too heavily on advanced XML features and it should also not attempt to enforce all details of the XML specification or to interpret any DTDs used in the input. Well-formedness of the input should be good enough to make intltool do it's job. Perhaps it could at least offer to be run in such a mode.
Let me put it this way: To our parser the formatting tags are not part of the CDATA, they are nested tags and so the XML is perfectly standard-conformant. But I want intltool to treat any nested tags in translatable elements as CDATA and pass it to the translators. This used to work and if I remember correctly there also used to be test cases for this in intltool. We relied on this feature being available and it seems like a regression to remove it. We have been faced with this change without prior warning. This is IMHO highly unreasonable. Please consider to add a mode that allows intltool to be used like this. I am willing to consider a rewrite of our parser. It could actually become simpler if it treated the markup as part of CDATA. But we cannot do this change right now because we are very close to a stable release. So, my options at the moment are begging for a backward-compat mode in intltool or requiring intltool <= 0.27-2 for GIMP-2.0.
It should be able to add the old mode in with an option. What do you think about this Brian? As a temporary sollution at least?
Sven, could you test to see if GMarkupParser handles the "<![CDATA[...]]>" syntax? If so, then that is probably the best approach. intltool-merge currently does not retain the "<![CDATA[...]]>" in the output, so if you use this syntax, then it would work exactly like you want. So there shouldn't be a problem. If we add a feature in the future to retain the "<![CDATA[...]]>" in the output, we can add a command-line argument to disable this feature, so that it works for people in your situation. Adding back the old logic so that it can be used with an argument is also an option, though it won't work for translating entities, obviously (the feature that required adding an XML parser in the first place).
Brian, I agree with Sven and do not understand why you are insisting on having <![CDATA[...]]> where it is not necessary. Input and output files are valid XML. Current tests/results/merge7.xml (which is the gimp-tips.xml file Sven is trying to generate) is fully broken (not valid XML) in CVS, so this test must be fixed. The patch below is a preliminary fix: since you want <![CDATA[...]]> being added to make the XML parser happy, this is done automatically; I also had to fix nested tags so that <_foo>blah <bar>quux</bar> baz</_foo> is not messed up, but parseTree seems to have been patched and patched, and I find its logic hard to follow, so this patch works but parseTree could certainly be improved.
Created attachment 23227 [details] [review] Fix tags in translatable text
I will let Brian comment on the patch, but maybe you should add a comment in the code why you add the <![CDATA[...]]>, maybe refering to this particulair bug? Also could you add the gimp file as a new test? Kenneth
You are right, I complained that parseTree is hard to understand and did not comment my changes :( The gimp file is already part of intltool tests, it is results/merge7.xml.
hmm, but make distcheck passed for me when I did the last release - weird.
Edit tests/results/merge7.xml and search for <tt>, you will see that this file is fully broken. So intltool-merge is broken but outputs the "expected" result, which is why "make check" does not fail.
Brian, what do you think about Denis' patch? Wouldn't it be okay to check this in?
Apologies for taking so long, I missed the fact that you were wanting me to review the patch. I think the idea of making all CDATA sections have the I have a few issues with the patch, though: + I think the regular expression used to find tags will not work properly. It doesn't seem to handle tags like this: <tag attribute=val /> Such tags don't have closing tags. However, it looks like the regular expression used in the patch would try to place an incorrect CDATA section after it. The patch should probably be updated to handle this situation. It would probably be better to use a regular expression like this: s|^(.*?)([ \t]*)<\s*_($w+)\s*>(.*?)<\s*/_\3\s*>([ \t]*\n)?|$1$2$3<![CDATA[$4]]</$3>|s That is very similar to the logic used in the old intltool-merge to find tags and replace them. Since we know that used to work, we should probably stick with something like that. Check out intltool-merge with a date of 2003/09/15 and look at line 674 to see the original regular expression that worked that you should base yours on. + I don't really understand the 1st part of the patch where if $temp_tag is not a word, the data is simply printed out. In what situation is a tag not a word? There should be a comment explaing why this is needed.
Oops, I meant to say that I think the idea of making all CDATA sections have the CDATA syntax is a good one. I just think the patch needs a bit of work in order to function properly.
*** Bug 133879 has been marked as a duplicate of this bug. ***
[Brian Cameron] ] I think the regular expression used to find tags will not ] work properly. It doesn't seem to handle tags like this: ] ] <tag attribute=val /> No, only translatable tags are escaped, and <_tag attribute=val /> does not make sense (at least for me). ] I don't really understand the 1st part of the patch where if ] $temp_tag is not a word, the data is simply printed out. In ] what situation is a tag not a word? There should be a comment ] explaing why this is needed. I do not remember exactly, this was needed to deal with nested tags. Anyway, could someone have a look at tests/results/merge7.xml and fix this file? Search for the first occurence of "xcf", this tip is badly broken. If this test was fixed, it would ease our discussions because current failures would become obvious, and we could check how to best fix this bug.
In my tree, I had to revert the changes to this file from current revision 1.8 to back to revision 1.3. None of the versions in between were well-formed XML. Perhaps you should think about adding a test for well-formedness of the test results.
I'm slightly confused. I believe that the problem with the <tt> tag is addressed by adding the CDATA block. Isn't it just a matter of reviewing this patch and getting it absolutely correct to fix this problem? Looking at the specific Regular Expression in your patch, you are using the following: + $content =~ s{(<_[^>]*>)}{$1<![CDATA[}g; + $content =~ s{(</_[^>]*>)}{]]>$1}g; I guess what I don't like about the above technique is that if there, for any reason were a tag like this "<_tag attribute="val" />", that your script would put a "CDATA" section following this tag where it is inappropriate. I think you are assuming that if any tag starts with "_" that it is a valid tag to be translated. I just think that it is better to avoid this assumption. By breaking the RegularExpression into two lines, you aren't really making sure that there there are matching start/close tags. In my suggestion, it does the RegularExpression in one line and therefore ensures that this sort of problem could never happen. I think that the $temp_tag logic should be removed from the patch if nobody is sure what value it adds to the logic. I would like the following before accepting this part of the patch: 1) An description of the specific problem that it solves. 2) Comments in the code explaining why the logic is necessary. 3) A test case added to one of the "make check" test cases that ensures that the fix is exercised.
Description of the problem: tests/results/merge7.xml is broken. There is no need to add a new test case, this one must be fixed, then patches against intltool-merge to produce the same output can be evaluated.
Solution to the outlined problem: cvs diff -r 1.3 tests/results/merge7.xml | patch -p0 -R
*** This bug has been marked as a duplicate of 135212 ***
Sorry for this mistake. The bug I marked as duplicate is completely unrelated. I hit the wrong button.
Denis pointed me out at this bug when I sent report about this to inlttool@freedesktop.org. I'll resend my comments here which explain a different patch I made for this. ------ cut here --- this is from my email --- This happens because detecting if element is CDATA is broken in parseTree. I had to read XML:Parse man page to discover that $refs in parseTree takes the form of the following: [ "first", [ {}, "_test", [ {}, 0, "this ", "b", [ {}, 0, "doesn't"], 0, " work" ] ] ] for a document like: <first> <_test>this <b>doesn't</b> work</_test> </first> (to see how it (mis)behaves, put this into t.xml.in and run "intltool-merge -u -x . t.xml.in t.xml") The main point here is that $not_cdata should be 1 only if the previous element in the $refs list was 0 (zero). The patch I'm attaching fixes this, but I also had to change space compression to preserve leading and trailing spaces (otherwise, the above example would be rendered as "<test>this<b>doesn't</b>work</test>"). This whitespace is not used when looking up translation, so it doesn't affect compatibility. I still don't like it that intltool-merge doesn't preserve idea of lines ("line is an array of characters terminated with LF") -- it strips the final LF, so we get unfinished line in the resulting file. Unfortunately, I don't have time to look into this right now. Here's the ChangeLog entry: 2004-03-03 Danilo Šegan <dsegan@gmx.net> * intltool-merge.in.in (parseTree): Drop incorrect CDATA section detection. Set $cdata_follows = 1 if $sub == 0, and set $not_cdata for each node as negation of $cdata_follows. When handling CDATA, also keep leading and trailing whitespace. This fixes behaviour with nested tags and CDATA. I hope I haven't misunderstood the code, but if I have, you're encouraged to correct me. I don't want to provide complete test cases yet, because we already have test cases which are incorrect, so lets not build on that :)
Created attachment 25096 [details] [review] Correct detection of CDATA
Created attachment 25097 [details] Correct tests/results/merge7.xml -- please check
Sven, Could you please try this patch with latest intltool? It seems to work fine for me with Gimp CVS checkout of February 18th. Brian, you've got any comments on the patch, or is it ok to commit it? If no one objects merge7.xml I sent, I'll commit it tommorow so we at least have correct tests, if not correct intltool. (All tests except for .schemas one fail, and that is a separate issue -- it failed since February 2nd, I'll post about it on intltool@fd.o)
If it works with Gimp CVS checkout of February 18th, it should also work with a current checkout. We are close to a release and don't do any changes to the tips file.
Yes, the patch looks good to me. Though, weren't we removing the code to compress spaces in CDATA? If so, the patch might need a little tweaking to reflect this. The leading/trailing spaces might no longer be necessary. Is this correct?
Yes, you're right -- this needs to be reworked a bit. OTOH, I'm not sure the current style of entire parseTree would allow that to work well. I'm not entirely sure about this patch anymore. I'll have to consider it a bit more.
Created attachment 25426 [details] [review] Big patch: completely reworked XML merging
Comments for the patch: - I've completely dropped giant parseTree function, it's partially reused in other functions - function getAttributesString is almost completely taken from parseTree, and simply provides a string containing attribute values (translated or not, depending on passed parameters) - function getXMLstring creates a translatable string from XML node (it actually uses second [array] member of node in XML::Parser tree, which has attributes as a hash for the first element, and a list of pairs of all contained nodes) -- it is completely independent of translations, and it can be reused once intltool-extract is switched to XML::Parser as well; it simply returns a string, with all multiple spaces in CDATA (and only CDATA, attribute values are not affected) converted to one space - function traverse gets both elements of a XML node (first is element name or 0 for cdata, other is content array like above or simple text if name was 0); there's some similar code in getXMLstring and traverse, but the latter strips leading and trailing spaces only when final string is gotten, and also handles translations while traversing the tree I believe code is this way much simpler, shorter and more readable. It's is also more correct, since we're actually following the XML::Parse tree definition. At the same time, I've removed special handling of entities (calls to entity_decode and entity_encode), because it causes problems when used this way eg. with nested tags. For some more info (and reasons why we need not special-case entities, or how to actually handle borked input from translations), read http://pdx.freedesktop.org/pipermail/intltool/2004-March/000045.html With this patch, I discovered that tests/results/merge5p.sheet is incorrect as well (apart from merge7.xml -- see above message on intltool list), which means that current xml merging doesn't work that nice at all. Test merge6.xml would have to be modified if this patch was commited (because it tests just for entity_decode and entity_encode being performed, which I removed). This patch was tested to work fine for gimp/tips/gimp-tips.xml and gnome-desktop/gnome-version.xml (apart from testcases). I raised some more issues on intltool list which are not directly related to this issue. Can anyone please comment on the patch (I'm particularly unsure about attribute translations)?
In general it looks very good. I'm a little concerned that you are dropping the handling of entities. I'm pretty sure that some people need the special behavior. I say this because I broke it when I wrote the initial parseTree function and it took me quite a while to deal with all the related bug reports. I agree that the new logic is much more readible and clean. I think this patch should be committed, assuming that people are happy with the way entities will be handled. One minor nit, I notice that the $depth variable isn't used in the myParseTree function and should probably be removed. Another nice idea that I never had time for would be to write a comment handler. If the comments were somehow added to the Tree data structure, we could retain them in the output - which would be nice. I never quite figured out how to add such a thing to the Tree data structure, but I suspect it is possible.
Thanks for your comments Brian. I'll remove $depth from myParseTree (I'll also rename it to parseTree during commit). The problem with entities is that if we handle them, there's simply no way to have nested tags in translations. This means that even things like Pango markup "This <b>hurts</b>." wouldn't work -- it would be transformed to "This <b>hurts</b>.". This will probably mean some bugs for those depending on this, but I think this feature is useful enough to warrant even those bugs. OTOH, putting entity_encode and entity_decode in appropriate places is just changing two lines of code. As I mentioned, I think it's better to add a check for well-formedness of translations: we read the translation in, surround it by some bogus tags like "<test>" and "</test>", and pass that to "parse()" function. This means that we will stop when there's error in the input, which quite good IMHO. XML::Parser Tree style doesn't support comments, so we cannot handle them. (Btw, I've already commited correct results/merge7.xml test file)
It is important for intltool-merge to leave entities as they are, so that "<" in the input remains "<" in the output (rather than being transformed into "<"). I agree with you that removing these functions does allow nested tags, which is more useful. However, I think we are going to break a bunch of translations that'll need to get fixed by making the change you suggest. So, we might have to be careful about making that change. It might be a good idea to leave the entity-encode and entity-decode functions in for now, and remove them when we have time to coordinate with other projects to make sure that all translations are properly converted to a new format that doesn't depend on the entity-encode and entity-decode functions. You think?
Yes, I agree. So, I'll commit this patch (with suggested modifications) tommorow (I'll try building entire Gnome tonight with patched intltool and jhbuild just to check if everything is fine), and ask Mark to make a release of 0.31. After that, I'll post to desktop-devel-list to let Gnome developers know that we will be breaking compatibility in one of next releases and ask them for comments (and complaints) -- we'd not change it before at least 2.6.0 is released. This means that Gimp should require intltool 0.27.2 for the time being, but it will be supported in the proposed new release, so that's when I'll close this bug (with these changes, it should work "almost" -- tips which contain <tt> will appear as <tt> in translations, if new intltool is used, which is not too bad).
Sorry for not being around, but I haven't been home (and still ain't) since I am trying to plan some things for next year. It seems you have everything under control, though. Thanks everybody! Danilo, it is OK if Mark does the release - you just plan that with him.
So Danilo. How well is your "big patch" tested?
Well, I've used it to build entire Gnome soon after I made it, and I've yet to notice any problems with it (gnome-version, gimp, etc. work correctly). I've also tested it with Gimp. I'm planning on giving it one more test. Test "merge5p.sheet" is incorrect by itself; results/merge5p.sheet contains a line: <description xml:lang="fr_FR">Un mélangeur à pâtisserie</description> This is incorrect, since the "é" was actually supposed to be two-byte UTF-8 sequence representing "é". With my patch, we get UTF-8 output here, but to get correct output in "fr_BE" translation (which is not UTF-8 itself), we'd need to pass "-u" option as well in the test (FWIW, I think "-u" option should be default for all XML processing, since UTF-8 header is already emitted for XML). merge6.xml test fails because the semantics are changed (I don't do quoting anymore, because if we do quoting, we cannot embed subtags, such as required in Gimp, since "<" would end up as "<").
Well, just commit that patch and correct the tests. People will complain if there are new bugs and we can probably fix them easily.
Was this ever done?
I don't know if this is committed or not. It looks like "make check" is still giving me 3 failures in the tests/ directory.
New XML parsing has not been committed (I waited for at least Gnome 2.6.1 to be out, but it turned out I ran out of spare time now). I'll commit soon, so we could have it tested for enough time before next Gnome release is due. As for tests, I just checked out a fresh copy from CVS, and it fails only on the relevant test (merge7.xml, test number 3), what the "Big patch" tries to solve. Rodney, can you describe what tests fail for you, and provide differences between cases/xxx and results/xxx (if relevant). Thanks.
4. Merge translations into a multi-line XML file: [FAILED] (merge7.xml) 10. Merge translations that include escaped .po strings: [FAILED] (merge6.xml) 18. Merge translations into multiple files: [FAILED] (merge6.xml) These are the tests that failed. I will attach the diff -u output, and a patch to selftest.pl.in, to fix the alignment of the [FAILED]... bit for test 18.
Created attachment 28210 [details] [review] Patch to fix text alignment in output for test 18
Created attachment 28211 [details] Diff for merge6.xml test results
Created attachment 28212 [details] Diff for merge7.xml test results
merge7.xml is not expected to work yet, so I won't look into that. merge6.xml test is influenced by parameter reordering resulting from XML::Parser processing. Try looking at line 762 of intltool-merge, it seems that you have instead of: foreach my $e (reverse(keys %{ $sub })) { something like: foreach my $e (sort(keys %{ $sub })) { Your example uses sorted attributes, whereas results/merge6.xml does not. There may be some disparity between different intltool versions here, since we get different results. Does anybody else see the same output? Kenneth, can you please run "make check" for the CVS intltool?
is this committed? If it is I will try running 'make check' as soon as I can connect to cvs again (having networking problems)
I still get 3 failures with make check. I've just committed my patch to fix the alignment of the text output for test 18.
Comment on attachment 25426 [details] [review] Big patch: completely reworked XML merging I've finally commited The Big patch. I'll add further comments below.
Several issues left with XML merging in intltool-merge. 1. We must pass clear XML (i.e. no entity encoding/decoding when merging/extracting messages) for things like <tt> subtag to work (like in Gimp tips file) 2. We should probably *always* use UTF-8 when merging XML files (look at results/merge5p.sheet, which new intltool makes obviously erroneous -- it was wrong before, because UTF-8 sequence for eg. "à" was encoded as "à") This is all related to encoding/decoding of of XML entities. Only tests that fail for me with new intltool-merge are: 9. Merge translations into a dia sheet style XML file [broken XML]: [FAILED] (merge5p.sheet) 10. Merge translations that include escaped .po strings: [FAILED] (merge6.xml) 18. Merge translations into multiple files: [FAILED] (merge6.xml) (All due to "&entities;". I may have to look into multiple-file merging as well.) I propose to do the following: 1. call intltool-merge with "-u" parameter for merge5p.sheet 2. Adjust intltool-extract not to worry too much about XML entities 3. Adjust cases/merge6.po to include entities just like they appear in cases/merge6.xml.in (basically, what new post-step2 intltool-extract would produce) If, OTOH, you want to keep entity en/de-coding, I'll add that back to intltool-merge, but that means there's no way to have things like embedded tags in translations coming out of XML files (imagine having "<_this>Blah blah <em>booh</em>.</_this>" which would become "Blah blah <em>booh</em>." in PO file). Rodney, I'll try finding a machine where failures similar to those you see occur.
One more thing: once we agree on whether we want entity_encode/entity_decode, and we opt for not wanting them (to allow for nested tags), I'll write a message to d-d-l@g.o to let everybody know about incompatible changes in intltool (which are already in, but I hope us to move faster now). Another note: latest intltool CVS should be completely useable for Gimp.
Created attachment 28657 [details] [review] Make intltool-extract not en/de-code entities (with testcases fixes) This is a simple fix for intltool-extract to work correctly with new intltool-merge, please review quickly (it's is a must if new behaviour of intltool-merge is kept). All testcases (several adjusted in the patch) except for merge5p.sheet (as above) pass here.
It looks fine and I agree with your reasoning to remove the entity en-/decoding. Sorry I first reply now, but I had no internet access yesterday (dead firewalls - again) Kennneth
Comment on attachment 28657 [details] [review] Make intltool-extract not en/de-code entities (with testcases fixes) Ok, committed. Should I add a NEWS entry?
Marking as RESOLVED/FIXED -- please intervene if you disagree. Should we up version number as well (so we get easier to track bug reports)? I'll discuss other issues (such as merge5p.sheet merging) in other bugs, for instance in bug 138512.
Yes, thanks! actually everytime you commit something that is news worthy then please write it in NEWS. Cheesr, Kenneth