Bug 130802 – intltool creates invalid XML files

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 130802 - intltool creates invalid XML files


Summary:	intltool creates invalid XML files


Status:	RESOLVED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:

Duplicates:	133879 (view as bug list)
Depends on:
Blocks:	138512

Reported:	2004-01-07 19:49 UTC by Sven Neumann
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	2.5/2.6

Attachments
the source file (11.47 KB, text/plain) 2004-01-07 20:03 UTC, Sven Neumann		Details
the generated XML file (124.69 KB, text/plain) 2004-01-07 20:04 UTC, Sven Neumann		Details
Fix tags in translatable text (1.29 KB, patch) 2004-01-11 13:19 UTC, Denis Barbier	needs-work	Details \| Review
Correct detection of CDATA (1.63 KB, patch) 2004-03-03 14:17 UTC, Danilo Segan	needs-work	Details \| Review
Correct tests/results/merge7.xml -- please check (2.38 KB, text/plain) 2004-03-03 14:19 UTC, Danilo Segan		Details
Big patch: completely reworked XML merging (14.05 KB, patch) 2004-03-10 01:54 UTC, Danilo Segan	committed	Details \| Review
Patch to fix text alignment in output for test 18 (950 bytes, patch) 2004-05-31 20:57 UTC, Rodney Dawes	committed	Details \| Review
Diff for merge6.xml test results (563 bytes, text/plain) 2004-05-31 21:00 UTC, Rodney Dawes		Details
Diff for merge7.xml test results (1.83 KB, text/plain) 2004-05-31 21:00 UTC, Rodney Dawes		Details
Make intltool-extract not en/de-code entities (with testcases fixes) (4.27 KB, patch) 2004-06-13 12:46 UTC, Danilo Segan	committed	Details \| Review

Description Sven Neumann 2004-01-07 19:49:52 UTC

When doing the gimp-2.0pre1 tarball yesterday, make dist choked when trying
to validate the gimp-tips.xml file that is generated using intltool. This
used to work until recently but it breaks with intltool-0.28. I tried
current CVS and got the same result. Downgrading to version 0.27.2 fixed
the problem.

To test this, please check out gimp/tips and run 'make dist-hook'. This
will generate gimp-tips.xml from gimp-tips.xml.in. The resulting XML is
completely broken, invalid, not even well-formed.

Comment 1 Kenneth Rohde Christiansen 2004-01-07 19:54:30 UTC

Please attach the generated gimp-tips.xml file, thanks.

Comment 2 Sven Neumann 2004-01-07 20:03:54 UTC

Created attachment 23083 [details]
the source file

Comment 3 Sven Neumann 2004-01-07 20:04:54 UTC

Created attachment 23084 [details]
the generated XML file

Comment 4 Brian Cameron 2004-01-07 21:59:00 UTC

I just commited a small fix that corrects the DOCTYPE handling, which
is part of the problem.  This bug caused the whole document to
appear once at the beginning of the file.

The other problem is that the XML input isn't really valid.
It isn't appropriate to place tags (such as <tt>) in a CDATA
section and expect the XML parser to preserve the position of the 
tags.  The intltool-merge program assumes that there won't be tags
in CDATA sections, and this causes the output to get messed up.

The proper way to specify tags in CDATA sections is as follows:

Instead of:

<big><tt>test1</tt>Welcome to <tt>test2</tt>The GIMP !<tt>test3</tt></big>

You should specify it as follows:

<big><![CDATA[<tt>test1</tt>Welcome to <tt>test2</tt>The GIMP
!<tt>test3</tt>]]></big>

This way the XML parser knows that the tags are actually in the
CDATA section.  If you do it this way, then it will work with
the new intltool-merge, and this is a more appropriate way to
specify what you want.

This does highlight a possible problem with intltool-merge, though.

If the user specifies a CDATA section using the <![CDATA[...]]>
syntax, the "<![CDATA[" and the "]]>" get lost in the output.
I'm not sure of a way to preserve this.  We could do a regular
expression on the CDATA section and see if it contains an "<"
or ">" character and put it back in this case, but I am not sure
that this covers all situations where it would be nice to 
preserve the "<![CDATA[" and the "]]>".

Kenneth, what do you think we should do here?

Comment 5 Kenneth Rohde Christiansen 2004-01-07 22:37:33 UTC

That is a good question. We could do the regex hack for now - that
will leave us with a little time to find a better sollution -  before
we get the first bug report. - or can we live without "<![CDATA[" and
the "]]>" in the output for now?

Comment 6 Sven Neumann 2004-01-08 14:21:30 UTC

I don't think the suggested solution is an option for us. I also
disagree that the XML input isn't valid. Any XML validator I tried so
far seems to agree with me that nesting XML elements is a perfectly
fine thing to do.

This is a major regression and I suggest you revert your XML code to
the state it had with 0.27.2. The use of the of the XML::Parser module
is also not acceptable (see #130850).

It seems that, until you revert these changes, we will have to stick
to intltool-0.27.2.

Comment 7 Brian Cameron 2004-01-08 16:04:14 UTC

Why isn't the suggested solution an option for you?  If you place
the problematic elements into "<![CDATA[...]]>" blocks then the
output is exactly what you are looking for.  This is just an XML
technique for specifying what you want, that tags in CDATA be
treated as CDATA.

Comment 8 Sven Neumann 2004-01-08 16:30:41 UTC

I don't see why I should make it that complicated to edit the tips
file. In this area (documentation) we rely on contributions from
non-coders and I don't want to ask people to use such an akward syntax
for no good reason. Also if I understand this correctly this syntax
would get promoted to the output document which would make changes to
the parser necessary.

Comment 9 Brian Cameron 2004-01-08 17:04:19 UTC

The "<![CDATA[...]]>" syntax is XML standard, so whatever parser
you are using should work fine with it - I assume that you are using
an XML parser that supports the XML standards.  I disagree that asking
people to follow the XML standard is unreasonable awkward.

This syntax is the XML standard way of specifying that you want
the enclosed text to be treated as CDATA, even if it contains
invalid CDATA characters, which include the "<" and ">" characters.
Otherwise the XML parser will obviously interpret them as nested
tags, and not a part of the CDATA section.

You could also use the &lt; and &gt; syntax to specify the "<" and
">" characters if you want to make your CDATA sections valid - 
though I suspect that you actually want the "<" and ">" characters
to be literally in your CDATA section.

Comment 10 Brian Cameron 2004-01-08 17:08:15 UTC

Also, the current intltool-merge, does not promote the "<![CDATA[...]]>"
syntax to the output file, though it probably should.  If we fix
intltool-merge to promote this to the output file, it should work
fine with whatever parsers you use to process the output file, since
this syntax is a part of the XML standard.  We could also provide
a command-line argument to suppress promoting the syntax to the output
file if there are people who are processing the file using XML
parsers that somehow do not support this feature.

Comment 11 Sven Neumann 2004-01-08 19:06:49 UTC

Well, I doubt GMarkupParser will eat this. Sure, it isn't a real XML
parser with all bells and whistles but it used to work reasonably well
for this job and I wouldn't want to introduce a dependency on libxml2
just for this.

Comment 12 Sven Neumann 2004-01-08 19:13:11 UTC

IMHO intltool shouldn't rely too heavily on advanced XML features and
it should also not attempt to enforce all details of the XML specification
or to interpret any DTDs used in the input. Well-formedness of the
input should be good enough to make intltool do it's job. Perhaps it
could at least offer to be run in such a mode.

Comment 13 Sven Neumann 2004-01-08 19:53:03 UTC

Let me put it this way: To our parser the formatting tags are not part
of the CDATA, they are nested tags and so the XML is perfectly
standard-conformant. But I want intltool to treat any nested tags in
translatable elements as CDATA and pass it to the translators. This
used to work and if I remember correctly there also used to be test
cases for this in intltool. We relied on this feature being available
and it seems like a regression to remove it. We have been faced with
this change without prior warning. This is IMHO highly unreasonable.
Please consider to add a mode that allows intltool to be used like this. 

I am willing to consider a rewrite of our parser. It could actually
become simpler if it treated the markup as part of CDATA. But we
cannot do this change right now because we are very close to a stable
release. So, my options at the moment are begging for a
backward-compat mode in intltool or requiring intltool <= 0.27-2 for
GIMP-2.0.

Comment 14 Kenneth Rohde Christiansen 2004-01-08 19:58:27 UTC

It should be able to add the old mode in with an option. What do you
think about this Brian? As a temporary sollution at least?

Comment 15 Brian Cameron 2004-01-08 20:20:12 UTC

Sven, could you test to see if GMarkupParser handles the
"<![CDATA[...]]>" syntax?  If so, then that is probably the best approach.

intltool-merge currently does not retain the "<![CDATA[...]]>"
in the output, so if you use this syntax, then it would work
exactly like you want.  So there shouldn't be a problem.

If we add a feature in the future to retain the "<![CDATA[...]]>"
in the output, we can add a command-line argument to disable
this feature, so that it works for people in your situation.

Adding back the old logic so that it can be used with an argument
is also an option, though it won't work for translating entities,
obviously (the feature that required adding an XML parser in the
first place).

Comment 16 Denis Barbier 2004-01-11 13:17:30 UTC

Brian, I agree with Sven and do not understand why you are insisting on
having <![CDATA[...]]> where it is not necessary.  Input and output
files are valid XML.
Current tests/results/merge7.xml (which is the gimp-tips.xml file Sven 
is trying to generate) is fully broken (not valid XML) in CVS, so this
test must be fixed.
The patch below is a preliminary fix: since you want <![CDATA[...]]>
being added to make the XML parser happy, this is done automatically;
I also had to fix nested tags so that <_foo>blah <bar>quux</bar> baz</_foo>
is not messed up, but parseTree seems to have been patched and patched,
and I find its logic hard to follow, so this patch works but parseTree
could certainly be improved.

Comment 17 Denis Barbier 2004-01-11 13:19:11 UTC

Created attachment 23227 [details] [review]
Fix tags in translatable text

Comment 18 Kenneth Rohde Christiansen 2004-01-11 13:39:45 UTC

I will let Brian comment on the patch, but maybe you should add a
comment in the code why you add the <![CDATA[...]]>, maybe refering to
this particulair bug?

Also could you add the gimp file as a new test?

Kenneth

Comment 19 Denis Barbier 2004-01-11 14:18:35 UTC

You are right, I complained that parseTree is hard to understand and
did not comment my changes :(
The gimp file is already part of intltool tests, it is results/merge7.xml.

Comment 20 Kenneth Rohde Christiansen 2004-01-11 14:28:08 UTC

hmm, but make distcheck passed for me when I did the last release - weird.

Comment 21 Denis Barbier 2004-01-11 23:18:58 UTC

Edit tests/results/merge7.xml and search for <tt>, you will see that
this file is fully broken.  So intltool-merge is broken but outputs
the "expected" result, which is why "make check" does not fail.

Comment 22 Kenneth Rohde Christiansen 2004-01-29 12:29:43 UTC

Brian, what do you think about Denis' patch? Wouldn't it be okay to
check this in?

Comment 23 Brian Cameron 2004-01-29 19:26:32 UTC

Apologies for taking so long, I missed the fact that you were wanting
me to review the patch.  I think the idea of making all CDATA sections
have the 

I have a few issues with the patch, though:

+ I think the regular expression used to find tags will not
  work properly.  It doesn't seem to handle tags like this:

  <tag attribute=val />

  Such tags don't have closing tags.  However, it looks like the
  regular expression used in the patch would try to place an
  incorrect CDATA section after it.  The patch should probably
  be updated to handle this situation.  It would probably be 
  better to use a regular expression like this:

  s|^(.*?)([ \t]*)<\s*_($w+)\s*>(.*?)<\s*/_\3\s*>([
\t]*\n)?|$1$2$3<![CDATA[$4]]</$3>|s

  That is very similar to the logic used in the old intltool-merge
  to find tags and replace them.  Since we know that used to work,
  we should probably stick with something like that.  Check out
  intltool-merge with a date of 2003/09/15 and look at line 674
  to see the original regular expression that worked that you
  should base yours on.
 
+ I don't really understand the 1st part of the patch where if
  $temp_tag is not a word, the data is simply printed out.  In
  what situation is a tag not a word?   There should be a comment
  explaing why this is needed.

Comment 24 Brian Cameron 2004-01-29 19:27:31 UTC

Oops, I meant to say that I think the idea of making all CDATA sections
have the CDATA syntax is a good one.  I just think the patch needs
a bit of work in order to function properly.

Comment 25 Sven Neumann 2004-02-09 12:43:12 UTC

*** Bug 133879 has been marked as a duplicate of this bug. ***

Comment 26 Denis Barbier 2004-02-17 21:32:34 UTC

[Brian Cameron]
] I think the regular expression used to find tags will not
] work properly.  It doesn't seem to handle tags like this:
]
]   <tag attribute=val />

No, only translatable tags are escaped, and <_tag attribute=val />
does not make sense (at least for me).

] I don't really understand the 1st part of the patch where if
] $temp_tag is not a word, the data is simply printed out.  In
] what situation is a tag not a word?   There should be a comment
] explaing why this is needed.

I do not remember exactly, this was needed to deal with nested tags.

Anyway, could someone have a look at tests/results/merge7.xml and
fix this file?  Search for the first occurence of "xcf", this tip
is badly broken.
If this test was fixed, it would ease our discussions because current
failures would become obvious, and we could check how to best fix this
bug.

Comment 27 Sven Neumann 2004-02-17 22:21:31 UTC

In my tree, I had to revert the changes to this file from current
revision 1.8 to back to revision 1.3. None of the versions in between
were well-formed XML.

Perhaps you should think about adding a test for well-formedness of
the test results.

Comment 28 Brian Cameron 2004-02-17 22:31:42 UTC

I'm slightly confused.  I believe that the problem with the <tt> tag
is addressed by adding the CDATA block.  Isn't it just a matter of
reviewing this patch and getting it absolutely correct to fix this
problem?

Looking at the specific Regular Expression in your patch, you are
using the following:

+    $content =~ s{(<_[^>]*>)}{$1<![CDATA[}g;
+    $content =~ s{(</_[^>]*>)}{]]>$1}g;

I guess what I don't like about the above technique is that if there,
for any reason were a tag like this "<_tag attribute="val" />", that
your script would put a "CDATA" section following this tag where it is
inappropriate.  I think you are assuming that if any tag starts with
"_" that it is a valid tag to be translated.  I just think that it is
better to avoid this assumption.

By breaking the RegularExpression into two lines, you aren't really
making sure that there there are matching start/close tags.  In my
suggestion, it does the RegularExpression in one line and therefore
ensures that this sort of problem could never happen.

I think that the $temp_tag logic should be removed from the patch if
nobody is sure what value it adds to the logic.  I would like the
following before accepting this part of the patch:

1) An description of the specific problem that it solves.
2) Comments in the code explaining why the logic is necessary.
3) A test case added to one of the "make check" test cases that
   ensures that the fix is exercised.

Comment 29 Denis Barbier 2004-02-18 11:08:52 UTC

Description of the problem: tests/results/merge7.xml is broken.

There is no need to add a new test case, this one must be fixed,
then patches against intltool-merge to produce the same output
can be evaluated.

Comment 30 Sven Neumann 2004-02-18 12:48:43 UTC

Solution to the outlined problem:

cvs diff -r 1.3 tests/results/merge7.xml | patch -p0 -R

Comment 31 Sven Neumann 2004-02-23 17:17:59 UTC


*** This bug has been marked as a duplicate of 135212 ***

Comment 32 Sven Neumann 2004-02-23 17:18:51 UTC

Sorry for this mistake. The bug I marked as duplicate is completely
unrelated. I hit the wrong button.

Comment 33 Danilo Segan 2004-03-03 14:10:29 UTC

Denis pointed me out at this bug when I sent report about this to
inlttool@freedesktop.org.  I'll resend my comments here which explain
a different patch I made for this.

------ cut here --- this is from my email ---

This happens because detecting if element is CDATA is broken in
parseTree.  I had to read XML:Parse man page to discover that $refs
in parseTree takes the form of the following:

  [ "first", [ {}, 
               "_test", [ {}, 
                          0, 
                          "this ", 
                          "b", [ {}, 0, "doesn't"],
                          0,
                          " work" ] ] ]

for a document like:

<first>
  <_test>this <b>doesn't</b> work</_test>
</first>

(to see how it (mis)behaves, put this into t.xml.in and run
"intltool-merge -u -x . t.xml.in t.xml")

The main point here is that $not_cdata should be 1 only if the
previous element in the $refs list was 0 (zero).  The patch I'm
attaching fixes this, but I also had to change space compression to
preserve leading and trailing spaces (otherwise, the above example
would be rendered as "<test>this<b>doesn't</b>work</test>").  This
whitespace is not used when looking up translation, so it doesn't
affect compatibility.

I still don't like it that intltool-merge doesn't preserve idea of
lines ("line is an array of characters terminated with LF") -- it
strips the final LF, so we get unfinished line in the resulting file.
Unfortunately, I don't have time to look into this right now.

Here's the ChangeLog entry:

2004-03-03  Danilo Šegan  <dsegan@gmx.net>

	* intltool-merge.in.in (parseTree): Drop incorrect CDATA section
	detection.  Set $cdata_follows = 1 if $sub == 0, and set
	$not_cdata for each node as negation of $cdata_follows.
	When handling CDATA, also keep leading and trailing whitespace.

        This fixes behaviour with nested tags and CDATA.

I hope I haven't misunderstood the code, but if I have, you're
encouraged to correct me.  I don't want to provide complete test
cases yet, because we already have test cases which are incorrect, so
lets not build on that :)

Comment 34 Danilo Segan 2004-03-03 14:17:46 UTC

Created attachment 25096 [details] [review]
Correct detection of CDATA

Comment 35 Danilo Segan 2004-03-03 14:19:12 UTC

Created attachment 25097 [details]
Correct tests/results/merge7.xml -- please check

Comment 36 Danilo Segan 2004-03-03 14:25:46 UTC

Sven,
Could you please try this patch with latest intltool?  It seems to
work fine for me with Gimp CVS checkout of February 18th.

Brian, you've got any comments on the patch, or is it ok to commit it?

If no one objects merge7.xml I sent, I'll commit it tommorow so we at
least have correct tests, if not correct intltool.

(All tests except for .schemas one fail, and that is a separate issue
-- it failed since February 2nd, I'll post about it on intltool@fd.o)

Comment 37 Sven Neumann 2004-03-03 14:31:39 UTC

If it works with Gimp CVS checkout of February 18th, it should also
work with a current checkout. We are close to a release and don't do
any changes to the tips file.

Comment 38 Brian Cameron 2004-03-03 20:12:58 UTC

Yes, the patch looks good to me.  Though, weren't we removing the code
to compress spaces in CDATA?  If so, the patch might need a little
tweaking to reflect this.  The leading/trailing spaces might no longer
be necessary.  Is this correct?

Comment 39 Danilo Segan 2004-03-04 00:03:27 UTC

Yes, you're right -- this needs to be reworked a bit. OTOH, I'm not
sure the current style of entire parseTree would allow that to work well.

I'm not entirely sure about this patch anymore.  I'll have to consider
it a bit more.

Comment 40 Danilo Segan 2004-03-10 01:54:56 UTC

Created attachment 25426 [details] [review]
Big patch: completely reworked XML merging

Comment 41 Danilo Segan 2004-03-10 02:16:02 UTC

Comments for the patch:
- I've completely dropped giant parseTree function, it's partially
reused in other functions
- function getAttributesString is almost completely taken from
parseTree, and simply provides a string containing attribute values
(translated or not, depending on passed parameters)
- function getXMLstring creates a translatable string from XML node
(it actually uses second [array] member of node in XML::Parser tree,
which has attributes as a hash for the first element, and a list of
pairs of all contained nodes) -- it is completely independent of
translations, and it can be reused once intltool-extract is switched
to XML::Parser as well; it simply returns a string, with all multiple
spaces in CDATA (and only CDATA, attribute values are not affected)
converted to one space
- function traverse gets both elements of a XML node (first is element
name or 0 for cdata, other is content array like above or simple text
if name was 0); there's some similar code in getXMLstring and
traverse, but the latter strips leading and trailing spaces only when
final string is gotten, and also handles translations while traversing
the tree

I believe code is this way much simpler, shorter and more readable.
It's is also more correct, since we're actually following the
XML::Parse tree definition.

At the same time, I've removed special handling of entities (calls to
entity_decode and entity_encode), because it causes problems when used
this way eg. with nested tags.

For some more info (and reasons why we need not special-case entities,
or how to actually handle borked input from translations), read
http://pdx.freedesktop.org/pipermail/intltool/2004-March/000045.html

With this patch, I discovered that tests/results/merge5p.sheet is
incorrect as well (apart from merge7.xml -- see above message on
intltool list), which means that current xml merging doesn't work that
nice at all.

Test merge6.xml would have to be modified if this patch was commited
(because it tests just for entity_decode and entity_encode being
performed, which I removed).

This patch was tested to work fine for gimp/tips/gimp-tips.xml and
gnome-desktop/gnome-version.xml (apart from testcases).

I raised some more issues on intltool list which are not directly
related to this issue.

Can anyone please comment on the patch (I'm particularly unsure about
attribute translations)?

Comment 42 Brian Cameron 2004-03-10 15:17:05 UTC

In general it looks very good.  I'm a little concerned that you are
dropping the handling of entities.  I'm pretty sure that some people
need the special behavior.  I say this because I broke it when I wrote
the initial parseTree function and it took me quite a while to deal
with all the related bug reports.

I agree that the new logic is much more readible and clean.  I think
this patch should be committed, assuming that people are happy with
the way entities will be handled.

One minor nit, I notice that the $depth variable isn't used in 
the myParseTree function and should probably be removed.

Another nice idea that I never had time for would be to write a
comment handler.  If the comments were somehow added to the Tree
data structure, we could retain them in the output - which would be
nice.  I never quite figured out how to add such a thing to the
Tree data structure, but I suspect it is possible.

Comment 43 Danilo Segan 2004-03-10 18:17:52 UTC

Thanks for your comments Brian. I'll remove $depth from myParseTree
(I'll also rename it to parseTree during commit).

The problem with entities is that if we handle them, there's simply no
way to have nested tags in translations. This means that even things
like Pango markup "This <b>hurts</b>." wouldn't work -- it would be
transformed to "This &lt;b>hurts&lt;/b>.".

This will probably mean some bugs for those depending on this, but I
think this feature is useful enough to warrant even those bugs. OTOH,
putting entity_encode and entity_decode in appropriate places is just
changing two lines of code.

As I mentioned, I think it's better to add a check for well-formedness
of translations: we read the translation in, surround it by some bogus
tags like "<test>" and "</test>", and pass that to "parse()" function.
This means that we will stop when there's error in the input, which
quite good IMHO.

XML::Parser Tree style doesn't support comments, so we cannot handle them.

(Btw, I've already commited correct results/merge7.xml test file)

Comment 44 Brian Cameron 2004-03-10 18:35:06 UTC

It is important for intltool-merge to leave entities as they are, so
that "&lt;" in the input remains "&lt;" in the output (rather than
being transformed into "<").  

I agree with you that removing these functions does allow nested tags,
which is more useful.  However, I think we are going to break a bunch
of translations that'll need to get fixed by making the change you
suggest.  So, we might have to be careful about making that change.

It might be a good idea to leave the entity-encode and entity-decode
functions in for now, and remove them when we have time to coordinate
with other projects to make sure that all translations are properly
converted to a new format that doesn't depend on the entity-encode and
entity-decode functions.

You think?

Comment 45 Danilo Segan 2004-03-10 18:51:57 UTC

Yes, I agree.  So, I'll commit this patch (with suggested
modifications) tommorow (I'll try building entire Gnome tonight with
patched intltool and jhbuild just to check if everything is fine), and
ask Mark to make a release of 0.31.

After that, I'll post to desktop-devel-list to let Gnome developers
know that we will be breaking compatibility in one of next releases
and ask them for comments (and complaints) -- we'd not change it
before at least 2.6.0 is released.

This means that Gimp should require intltool 0.27.2 for the time
being, but it will be supported in the proposed new release, so that's
when I'll close this bug (with these changes, it should work "almost"
-- tips which contain <tt> will appear as &lt;tt> in translations, if
new intltool is used, which is not too bad).

Comment 46 Kenneth Rohde Christiansen 2004-03-13 10:57:33 UTC

Sorry for not being around, but I haven't been home (and still 
ain't) since I am trying to plan some things for next year. It seems 
you have everything under control, though. Thanks everybody! Danilo, 
it is OK if Mark does the release - you just plan that with him.

Comment 47 Kenneth Rohde Christiansen 2004-04-09 15:14:18 UTC

So Danilo. How well is your "big patch" tested?

Comment 48 Danilo Segan 2004-04-09 21:41:32 UTC

Well, I've used it to build entire Gnome soon after I made it, and I've yet to
notice any problems with it (gnome-version, gimp, etc. work correctly).  I've
also tested it with Gimp.  I'm planning on giving it one more test.  Test
"merge5p.sheet" is incorrect by itself; results/merge5p.sheet contains a line:
      <description xml:lang="fr_FR">Un m&#195;&#169;langeur &#195;&#160;
p&#195;&#162;tisserie</description>

This is incorrect, since the "&#195;&#169;" was actually supposed to be two-byte
UTF-8 sequence representing "é".  With my patch, we get UTF-8 output here, but
to get correct output in "fr_BE" translation (which is not UTF-8 itself), we'd
need to pass "-u" option as well in the test (FWIW, I think "-u" option should
be default for all XML processing, since UTF-8 header is already emitted for XML).

merge6.xml test fails because the semantics are changed (I don't do quoting
anymore, because if we do quoting, we cannot embed subtags, such as required in
Gimp, since "<" would end up as "&lt;").

Comment 49 Kenneth Rohde Christiansen 2004-04-26 08:49:39 UTC

Well, just commit that patch and correct the tests. People will complain if
there are new bugs and we can probably fix them easily.

Comment 50 Kenneth Rohde Christiansen 2004-05-30 15:34:42 UTC

Was this ever done?

Comment 51 Rodney Dawes 2004-05-30 16:39:32 UTC

I don't know if this is committed or not. It looks like "make check" is still
giving me 3 failures in the tests/ directory.

Comment 52 Danilo Segan 2004-05-31 19:43:22 UTC

New XML parsing has not been committed (I waited for at least Gnome 2.6.1 to be
out, but it turned out I ran out of spare time now). I'll commit soon, so we
could have it tested for enough time before next Gnome release is due.

As for tests, I just checked out a fresh copy from CVS, and it fails only on the
relevant test (merge7.xml, test number 3), what the "Big patch" tries to solve.

Rodney, can you describe what tests fail for you, and provide differences
between cases/xxx and results/xxx (if relevant).  Thanks.

Comment 53 Rodney Dawes 2004-05-31 20:56:12 UTC

 4. Merge translations into a multi-line XML file:                   [FAILED]
(merge7.xml)
10. Merge translations that include escaped .po strings:             [FAILED]
(merge6.xml)
18. Merge translations into multiple files:                          [FAILED]
(merge6.xml)

These are the tests that failed. I will attach the diff -u output, and a patch
to selftest.pl.in, to fix the alignment of the [FAILED]... bit for test 18.

Comment 54 Rodney Dawes 2004-05-31 20:57:38 UTC

Created attachment 28210 [details] [review]
Patch to fix text alignment in output for test 18

Comment 55 Rodney Dawes 2004-05-31 21:00:26 UTC

Created attachment 28211 [details]
Diff for merge6.xml test results

Comment 56 Rodney Dawes 2004-05-31 21:00:57 UTC

Created attachment 28212 [details]
Diff for merge7.xml test results

Comment 57 Danilo Segan 2004-06-01 09:09:21 UTC

merge7.xml is not expected to work yet, so I won't look into that.

merge6.xml test is influenced by parameter reordering resulting from XML::Parser
processing.  Try looking at line 762 of intltool-merge, it seems that you have
instead of:
            foreach my $e (reverse(keys %{ $sub })) {
something like:
            foreach my $e (sort(keys %{ $sub })) {

Your example uses sorted attributes, whereas results/merge6.xml does not.  There
may be some disparity between different intltool versions here, since we get
different results.

Does anybody else see the same output? Kenneth, can you please run "make check"
for the CVS intltool?

Comment 58 Kenneth Rohde Christiansen 2004-06-12 21:25:42 UTC

is this committed? If it is I will try running 'make check' as soon as I can
connect to cvs again (having networking problems)

Comment 59 Rodney Dawes 2004-06-13 00:41:21 UTC

I still get 3 failures with make check. I've just committed my patch to fix the
alignment of the text output for test 18.

Comment 60 Danilo Segan 2004-06-13 11:52:26 UTC

Comment on attachment 25426 [details] [review]
Big patch: completely reworked XML merging

I've finally commited The Big patch.  I'll add further comments below.

Comment 61 Danilo Segan 2004-06-13 12:07:51 UTC

Several issues left with XML merging in intltool-merge.

1. We must pass clear XML (i.e. no entity encoding/decoding when
merging/extracting messages) for things like <tt> subtag to work (like in Gimp
tips file)
2. We should probably *always* use UTF-8 when merging XML files (look at
results/merge5p.sheet, which new intltool makes obviously erroneous -- it was
wrong before, because UTF-8 sequence for eg. "à" was encoded as "&#195;&#160;")

This is all related to encoding/decoding of of XML entities.

Only tests that fail for me with new intltool-merge are:
 9. Merge translations into a dia sheet style XML file [broken XML]: [FAILED]
(merge5p.sheet)
10. Merge translations that include escaped .po strings:             [FAILED]
(merge6.xml)
18. Merge translations into multiple files:                          [FAILED]
(merge6.xml)

(All due to "&entities;".  I may have to look into multiple-file merging as well.)

I propose to do the following:
1. call intltool-merge with "-u" parameter for merge5p.sheet
2. Adjust intltool-extract not to worry too much about XML entities
3. Adjust cases/merge6.po to include entities just like they appear in
cases/merge6.xml.in (basically, what new post-step2 intltool-extract would produce)

If, OTOH, you want to keep entity en/de-coding, I'll add that back to
intltool-merge, but that means there's no way to have things like embedded tags
in translations coming out of XML files (imagine having "<_this>Blah blah
<em>booh</em>.</_this>" which would become "Blah blah
&lt;em&gt;booh&lt;/em&gt;." in PO file).

Rodney, I'll try finding a machine where failures similar to those you see occur.

Comment 62 Danilo Segan 2004-06-13 12:30:49 UTC

One more thing: once we agree on whether we want entity_encode/entity_decode,
and we opt for not wanting them (to allow for nested tags), I'll write a message
to d-d-l@g.o to let everybody know about incompatible changes in intltool (which
are already in, but I hope us to move faster now).

Another note: latest intltool CVS should be completely useable for Gimp.

Comment 63 Danilo Segan 2004-06-13 12:46:44 UTC

Created attachment 28657 [details] [review]
Make intltool-extract not en/de-code entities (with testcases fixes)

This is a simple fix for intltool-extract to work correctly with new
intltool-merge, please review quickly (it's is a must if new behaviour of
intltool-merge is kept).  All testcases (several adjusted in the patch) except
for merge5p.sheet (as above) pass here.

Comment 64 Kenneth Rohde Christiansen 2004-06-14 08:25:35 UTC

It looks fine and I agree with your reasoning to remove the entity en-/decoding.
Sorry I first reply now, but I had no internet access yesterday (dead firewalls
- again)

Kennneth

Comment 65 Danilo Segan 2004-06-14 09:19:31 UTC

Comment on attachment 28657 [details] [review]
Make intltool-extract not en/de-code entities (with testcases fixes)

Ok, committed. Should I add a NEWS entry?

Comment 66 Danilo Segan 2004-06-14 09:24:13 UTC

Marking as RESOLVED/FIXED -- please intervene if you disagree.
Should we up version number as well (so we get easier to track bug reports)?

I'll discuss other issues (such as merge5p.sheet merging) in other bugs, for
instance in bug 138512.

Comment 67 Kenneth Rohde Christiansen 2004-06-14 09:28:08 UTC

Yes, thanks! actually everytime you commit something that is news worthy then
please write it in NEWS.

Cheesr, Kenneth