GNOME Bugzilla – Bug 127218
new parser does not re-escape entities when merging
Last modified: 2004-12-22 21:47:04 UTC
If the original text to translate contains an entity eg 'Latex & HTML' The parser unescapes things for translation but does not re-escape for merging. Which results in <foo>Latex & HTML</foo> and parse warnings
It seems caused by the patch for #116526
Okay, I am delaying the release before this is fixed. Will you take a look Brian?
On a semi-related note. I've noticed that - intltool did not branch for gnome-2-4 before adding this major change - intltool does not seem to tag CVS for its releases Both are good release practices that you may want to adopt.
You're right. Next release will be CVS-tagged.
Okay, there are a few ways we could address this problem. 1. We could preprocess the file and change Entity tags to something else before we parse it, and then change them back after the XML parser has processed the file. For example, we could change & to something like INTLTOOLENTITYSTARTampINTLTOOLENTITYEND (or somesuch) via regular expression before parsing the file. Then change it back after the XML parser is done parsing. This is obviously a bit of a hack, but simple to implement if we can come up with keywords that we are sure would never be found in a real XML document. 2. If we are happy handling just the 5 Entities that are Predefined in XML (<, >, &, &apos, ")., then we could just use the Expat::xml_escape function to convert them back into the <, >, &, &apos, " format. 3. If we are happy handling just unsafe HTML Entities, then we could use the HTML::Entities.encode_entities method on CDATA elements. This does what #2 does and also handles some additional HTML Entities. 4. The most robust solution would be to define our own XML::Filter Start, End, and Char handlers which work much the same way as the Tree style, but use $expat->original_string() rather than using the UTF-8 output. This could be done by writing our own custom XML::Style based on the Tree style. This means a slight variation of the Tree.pm file would become a part of the intltool project. However this technique works on all Entities *except* for the 5 Entities that are a part of the XML standard (<, >, &, &apos, "), so it would need to be used in conjunction with solution #2. Let me know what solution works best for intltool-merge's needs, and I'll go ahead and fix the code.
I would think that 2 is enought for our needs. If not 4 seems like the best sollution. the pm modules are written in C and compiled right? Kenneth
Here is patch for solution #4. I realize that my earlier analysis was a bit wrong. Solution #4 does properly handle the Predefined XML tags, so it is not necessary to use xml_escape(). Also, note that I needed to comment out the call to entity_decode() on line 814. This logic was in the code before I ever added the XML::Parser logic. I'm not sure how "&" was ever preserved in the output with this function call. I think that &apos, ", &, and < were never preserved in escape-form. Not sure what is right here. I'm also attaching the new "OrigTree.pm" file. This is not c-code, this is just a perl module that must be installed somewhere in a directory named XML/Parser/Style/OrigTree.pm. intltool-merge will need to set @INC to include whatever directory we decide to place this custom XML::Parser style. The OrigTree.pm uses the original_string() function so that it doesn't change any Entity values.
Created attachment 21591 [details] [review] now uses OrigTree style
Created attachment 21592 [details] OrigTree.pm file
Jody, are you giving this a test run? Brian, you should probably add a few tests so we won't break this functionality again in the future. Kenneth
Kenneth, you never answered my question. I believe that the old logic was calling entity_decode() which would have always converted "&" to "&" anyway. So, Jody's original situation would have failed with the old intltool-merge, unless I am missing something. What is wrong here. Should the entity_decode function calls be removed from intltool-merge? They were there before the XML::Parser logic was added.
What I did was: I looked at the xml.in file. Decoded & to & etc - since that is how they are store in the po files. ie. the xml string "blah & blah" is shows as "blah & blah" in the po file. Then I lookup the translated string. Then I encode the & to & etc and store the "translatedblah & translatedblah" string in the xml output file. If the Parser outputs escaped special letters (& etc) then you don't need this function and can safely remove it from the XML output. Jody's example wouldn't have failed with the old method. Does this enlighten you? Feel free to ask more. Kenneth
Okay, I just committed the fix for bug #127250 which will cause my above patch to fail. However, the change is so simple to make I'm not sure a patch is really necessary. You just need to change the line that says: my $xp = new XML::Parser(Style => 'Tree'); to my $xp = new XML::Parser(Style => 'OrigTree'); Install the OrigTree.pm file to a directory named (base_dir)/XML/Parser/Style, and add the following line to the beginning of intltool-merge: push(@INC, "(base_dir)/lib/"); Please test this, and verify that this fixes the problem. Also, could somebody tell me where the XML/Parser/Style/OrigTree.pm file should be installed? Once this has been established, I can set up the Makefile to install it to the proper location. Note that I am going on vacation for the next 1.5 weeks, so someone else may need to integrate this patch into intltool-merge if you need it done sooner.
I should mention that the issue with entity_decode was fixed in bug #127250. Also, it would be useful if someone told me where the OrigTree.pm file should be stored within the intltool project. Should it go in intltool/lib/perl/XML/Parser/Style? I assume that @INC should also include "./lib/perl" so that it works when running from the source directory as well?
we normally install intltool resources in PREFIX/share/intltool/ Kenneth
Okay, but where should it go in the CVS module?
intltool/module ? intltool/perlmodule ? intltool/parser ? It doesn't really matter so much - if you have better suggestions please write them here. Kenneth
Note, I am going on vacation for the next 1.5 weeks, so someone else will need to integrate this change into the code if it needs to be included sooner than that.
Is it in CVS already ? I tried cvs last night and things did not work. ChangeLog suggested that some of the patch was in.
No it is not in cvs yet. If it works for you we will commit it. Kenneth
Kenneth: I understand from reading bug 116526 that you are having trouble getting the OrigTree XML:Parser Style module to work. What are you seeing that is wrong, specifically?
I figured it out. It is all in cvs now and working AFAICT. Please test this Jody - so we can close this bug.
The 0.28 release seems to almost fix this, except that the OrigTree.pm module is installed where my Perl doesn't see it. I have perl (along with all of gnome) installed locally, under $USER. The 0.28 release of intltool installed OrigTree.pm under $USER/share/intltool/XML/Parser/Style, but perl doesn't find it, and I get problems compiling gnome-vfs when it tries to use intltool-merge...it complains about OrigTree.pm not being installed properly. If I put OrigTree.pm into: $USER/lib/perl5/site_perl/5.8.1/sun4-solaris/XML/Parser/Style/, then everything works fine.
Hi, I believe that OrigTree is useless, the Char handler can be redefined with the Tree style with my $xp = new XML::Parser(Style => 'Tree'); $xp->setHandlers(Char => \&intltool_tree_char); where intltool_tree_char is the Char function in OrigTree.pm. Denis
Created attachment 22327 [details] [review] Use the standard Tree style instead of XML::Parser::Style::OrigTree
That's clever. That is a much better solution to the same problem. However, the intltoolize.in and modules/Makefile.am also contain references to OrigTree.pm which need to be removed.
And you are saying that now :-) Please supply a patch that removes all traces of OrigTree. Please check the ChangeLog to see what files that has been modified. This probably requires a specific version of XML::Parser. A check for this has to be added to intltool.m4
The XML::Parser changelog begins with 2.33 - Fixed Tree style (grantm) I do not know whether these fixes are important. XML::Parser::Style::Tree is less than 50 lines of code, maybe it could be embedded into intltool-merge? Denis
CVS still appears to generate raw '&' in spots. Was this patch in 0.28 ?
it was
Something is still unhappy. 1) Before updating this morning raw ampersands were still being generated in gnumeric/plugins/excel 2) After updating the source text was escaped, but no translations appeared. (again in gnumeric/plugins/excel).
I'm hitting intltool unescaping entities in attributes in the freedesktop mime xml file. For example: <match type="string" value="<!DOCTYPE\ xbel" offset="0:64"/> is going to: <match type="string" value="<DOCTYPE\ xbel" offset="0:64"/> This is preventing a release of the new Mime database.
Brian will you please look into this? I am going on christmas holidays tomorrow, so if you fix this later I will allow you to do a new release afterward, since I wont be able to do it Kenneth
jrb can you test the attached patch. I think this resolves your issues.
Created attachment 22549 [details] [review] patch fixing bug reported by jrb
Brian, why did you reindent the whole file? When applied, I will have to fix my patch sent to #123911 :( I tried to reproduce Jody's problems with gnumeric/plugins/excel. My test shows that it worked fine with intltool 0.28, but CVS is broken (with or without your latest patch). Translations are not incorporated, but if I manually replace msgid "MS Excel (tm) 97/2000/XP & 5.0/95" by msgid "MS Excel (tm) 97/2000/XP & 5.0/95" in PO files, translated tag is displayed again. This means that entities should be decoded before being compared to msgids.
Okay, I'm pretty sure that I just fixed the bug with intltool-merge not properly translating strings. Please verify. Sorry about the re-indenting. I noticed that there was a lot of inconsistancy in the code regarding using space vs. tab for indentation. Since tabs were only used in a few places, I replaced them with spaces. Is it okay if you rebuild the patch for #123911? I think the code is much cleaner looking with consistant spacing.
Note that I am going on vacation starting tomorrow, and will not be back into the office until January 5th. Therefore, let me know if my patch above works okay. If so, I'll apply it to CVS. Otherwise, someone else will need to integrate it.
AFAICT you fixed the bug reported by jrb, not Jody's one. To test it, I downloaded gnumeric-1.2.2 and removed all po/*.po (except po/fr.po), otherwise test case took a long time on my old box. Eventually I ran $ intltool-merge -x . ../plugins/excel/plugin.xml.in out and looked into 'out' file, with several intltool versions. With CVS (your patch being applied or not), translations are discarded when text contains an ampersand, as in the last <description> tag: <service type="file_saver" id="excel_biff8" file_extension="xls"> <information> <description>MS Excel (tm) 97/2000/XP</description> <description xml:lang="fr">MS Excel(tm) 97/2000/XP</description> </information> </service> <service type="file_saver" id="excel_dsf" file_extension="xls"> <information> <description>MS Excel (tm) 97/2000/XP & 5.0/95</description> </information> </service> This entry is translated in fr.po. About indentation, of course I agree that being consistent is much better, but IMO this can be fixed when no patches are sleeping in bugzilla. But if this is necessary, I will provide an updated patch.
Added the conglomerate bugtracking mailing list to the CC list.
I updated the patch that fixes jrb's problem so it applies to CVS head. This patch should really go in, since it fixes some serious problems: 1) It retains the DOCTYPE from the original document in the output. 2) It translates entities in attributes properly (see the intltool_tree_start function 3) No longer uses the OrigTree Style. Now just uses Tree and overrides the start and char handlers. Much cleaner. I updated the patch so it no longer changes white-space. Denis, I am a bit confused by your problem. I copied the following lines into the intltool/tests/cases/merge6.po file: msgid "MS Excel (tm) 97/2000/XP & 5.0/95" msgstr "worked - MS Excel(tm) 97/2000/XP & 5.0/95" And then I added the following test line to the intltool/tests/cases/merge6.xml.in file: <_test>MS Excel (tm) 97/2000/XP & 5.0/95</_test> I then run "make check" in the intltool/tests directory. Note that it complains that test #10 (merge6.xml) fails if you don't also update the intltool/tests/results/merge6.xml file that it compares against, but you can just ignore the "failure" message. Anyway, after running "make check", I see the following 2 lines in the intltool/tests/cases/merge6.xml output file <test>MS Excel (tm) 97/2000/XP & 5.0/95</test> <test xml:lang="merge6">worked - MS Excel(tm) 97/2000/XP & 5.0/95</test> This seems to look right to me. Are you sure that the problem isn't on your end, or that one of the recent fixes to intltool didn't correct your problem?
Created attachment 22950 [details] [review] updated patch
Please commit - it looks fine. Are you going to do anything about comments? Kenneth
Brian, I am all for your patch. It fixes some bugs, but from my tests not this one. Your test seems to show that I am wrong, I will try again after you commit your patch. Denis
patch applied. Denis, please test.
Note that my latest patch removed all references to OrigTree.pm, but I did not delete the OrigTree.pm file from the module. Should this be done?
I guess so...too bad, now a lot of modules need to removed OrigTree from EXTRA_DIST. Can you up the intltool version after this change? Didn't Jrb make a test case? If so, maybe this one should be committed
The test mentioned above works now just fine, so I believe that you fix all bugs mentioned in this bugreport. Denis
I am trusting you and closing this. Re-open if I am closing this too early ;-)
Will there be tarball released with the patch included?
gnumeric head and 1.2.x are happy with this patch, as is libgnomeprint.
Ja, maar je moet even een paar weekjes wachten. (Yes, but you have to wait a couple of weeks.)