Bug 116526 – intltool merge of XML attributes fails

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 116526 - intltool merge of XML attributes fails


Summary:	intltool merge of XML attributes fails


Status:	RESOLVED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:	AP2

Duplicates:	116529 (view as bug list)
Depends on:
Blocks:	90500 127218

Reported:	2003-07-02 12:53 UTC by bill.haneman
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	2.3/2.4

Attachments
patch that fixes the bug (11.65 KB, patch) 2003-10-28 16:14 UTC, Brian Cameron	none	Details \| Review

Description bill.haneman 2003-07-02 12:53:55 UTC

intltool-update finds marked-up XML attributes and extracts them for
translation, but intltool-merge doesn't handle them.  Note that there's a
FIXME in intltool source for this, so it's presumably intended to work someday.

This is a blocker for gok (gnome onscreen keyboard, a 2.4 module) localization.

Note that what you'd expect is for intltool-merge to produce multiple XML
elements, one per translation, substituting the translated attribute
strings just as it currently does with CDATA element content.

Comment 1 bill.haneman 2003-07-02 15:47:23 UTC

*** Bug 116529 has been marked as a duplicate of this bug. ***

Comment 2 Kenneth Rohde Christiansen 2003-07-04 17:16:35 UTC

I unfortunately cannot look into this for the moment as I am on 
vacation (only internet shop access). :-( I can review patches 
though, if one is produced.

Comment 3 Calum Benson 2003-08-01 13:16:54 UTC

Marking AP2 to reflect a11y team's assessment of impact (I don't think
we consider GOK i18n a release stopper, do we?)

Comment 4 Calum Benson 2003-08-07 16:15:37 UTC

Apologies for spam... marking as GNOMEVER2.3 so it appears on the official GNOME
bug list :)

Comment 5 bill.haneman 2003-09-09 13:22:06 UTC

This is still blocking GOK internationalization, thus GNOME 2.4 is not
properly localizable!

Any help is appreciated!

Comment 6 Brian Cameron 2003-09-12 19:41:54 UTC

I am currently working on fixing this bug.  However, it isn't
completely clear to me what the output should look like.  I
can imagine two ways that it could be solved:

Solution #1:

<GOK:accessmethod name="val" displayname="text"
 displayname:br="translated text for language br"
 displayname:es="translated text for language es"
 ...>

(not sure if my naming convention is right, but you get 
the idea.  Please let me know what the naming convention
should be if this is the right direction).

Solution #2:

We could have separate tags for each language, which would
look something like the technique used for translating
CDATA elements.

<GOK:accessmethod name="val" displayname="translated text" ...
xml:lang="xyz">

If Solution #1 is right, then things will be pretty easy to
implement.  If Solution #2 is right, then this will be harder
to code, and GOK will probably also require changes.  Mainly
because the GOK:accessmethod tag (which has an element that
needs to be translated) is the root node.  And my understanding
is that you can only have one root node in a properly formatted
XML file.  

Solution #2 also seems bad to me since a tag with an element
to be translated could have many internal tags, replicating
all the internal tags for each language seems like a lot of
bloat.

The current behavior of the intltool-merge is to blindly
translate the string into multiple languages.  If we go with
Solution #2 and duplicate the GOK:accessmethod tag once for
each language then intltool-merge will will probably need a
lot of work to handle this. 

In other words, the current behavior would create something like
this for each language:

<GOK:accessmethod name="val" displayname="translated text" ...
xml:lang="1st">
<GOK:description>Translated text</GOK:description>
<GOK:description xml:lang="1st">translated text</GOK:description>
<GOK:description xml:lang="2nd">translated text</GOK:description>
...
</GOK:accessmethod>
<GOK:accessmethod name="val" displayname="translated text" ...
xml:lang="2nd">
<GOK:description>Translated text</GOK:description>
<GOK:description xml:lang="1st">translated text</GOK:description>
<GOK:description xml:lang="2nd">translated text</GOK:description>
...
</GOK:accessmethod>

Which seems really broken.  So, I hope Solution #1 is the right
way to go.

Comment 7 Kenneth Rohde Christiansen 2003-09-12 20:26:29 UTC

#2 is the normal XML'ish way to do this though AFAIK. CC'ing DV, maybe
he has some input

Comment 8 Daniel Veillard 2003-09-12 21:19:28 UTC

<GOK:accessmethod name="val" displayname="text"
 displayname:br="translated text for language br"
 displayname:es="translated text for language es"
 ...>


 is REALLY REALLY BAD !!!!
do NOT use column ':' in tag names except if this denotes
a use of XML Namespace.
Second DO NOT PUT text in attributes !!! They MUST be changed 
and normalized by parser, use separate element
Third use the xml:lang tag defined in XML exactly for this purpose

  Go read the guidelines I did put up, and rework the 
syntax this really cannot work as is !
    http://xmlsoft.org/guidelines.html

  Do NOT push a solution with such a syntax, you will have 
tons of problems, guaranteed !

Daniel

Comment 9 Daniel Veillard 2003-09-12 21:25:57 UTC

Solution #2 is far far better .
It should still have displayname="translated text" as a separate
node entry instead of an attribute, but that looks acceptable.
But really #1 must not go though,

Daniel

Comment 10 Brian Cameron 2003-09-12 22:28:44 UTC

So if solution #2 is better, then how to you deal with this
situation:

<outer-tag _element="text_to_translate>
   <_inside-tag>text</_inside-tag>
</outer-tag>

Assuming two languages "es" and "fr", would the output look like:

<outer-tag _element="traslated-text" xml:lang="es">
   <inside-tag>translated-text</inside-tag>
</outer-tag>
<outer-tag _element="translated-text" xml:lang="fr">
   <inside-tag>translated-text</inside-tag>
</outer-tag>

or would it look like something else?

Comment 11 Daniel Veillard 2003-09-13 00:37:52 UTC

I don't know the semantic you apply to the tags,
so I can't juge on dealing with
----
<outer-tag _element="text_to_translate>
   <_inside-tag>text</_inside-tag>
</outer-tag>
----

  Putting the "text_to_translate" in an attribute is a problem
Read the spec especially this part
  http://www.w3.org/TR/REC-xml#AVNormalize
if you store text in attributes, it WILL be modified by the
parser *before* returning it to the application. So don't do 
that.
Put text in element content.
If you want to maintain parallel translated strings
keep siblings elements with a distinct xml:lang tag
indicating the language for each piece of text.

Assuming that _element="translated-text" is a indication of
result and not the text to translate (your example is quite
confusing) then yes

<outer-tag _element="traslated-text" xml:lang="es">
   <inside-tag>translated-text</inside-tag>
</outer-tag>
<outer-tag _element="translated-text" xml:lang="fr">
   <inside-tag>translated-text</inside-tag>
</outer-tag>

  seems appropriate

Daniel

Comment 12 bill.haneman 2003-09-15 15:26:21 UTC

DV: text does need to be stored in attributes in the file format Brian
refers to.  Perhaps not ideal, but that's the way it works.

Brian: for each xml:lang, we expect only one localized copy of each
element.  This means that if an element contains both translatable
CDATA and translatable string-attribute values, both are contained in
one element.

If we ignore the issue of nested elements (i.e. just process them
in-place, using a one-pass expansion) then I think the result will be
fine.  We certainly don't want to end up with a geometric expansion of
elements, but I don't see that as a problem.  I do not think the
structure of the XML document will change except for duplication of
elements; the copying/branching may occur anywhere from the root node
(I know, we need to make sure the root node doesn't get translated)
onwards.

Comment 13 Daniel Veillard 2003-09-15 15:51:22 UTC

> DV: text does need to be stored in attributes in the file format Brian
> refers to.  Perhaps not ideal, but that's the way it works.

 NO this CANNOT work that way. Any cariage return or line feed MUST
be destroyed by the parser on input. Now explain why you expect
that to be a workable solution, it ain't so !
 I'm not gonna let this point go. It's a fundamental flaw, if
it need a redesign then redesign will have to be done.

Daniel

Comment 14 Daniel Veillard 2003-09-15 16:02:53 UTC

Apparently the current version of the intool uses a non
conformant XML parser, the whole chain will break all the
I18N data as soon as that parser is fixed to follow the
XML standard. You must fix the format or be ready to have 
a huge fuckup as soon as the non-conformant behaviour is
fixed or the toolchain need to evolve to a new parser.

  This is critically broken and should be fixed ASAP.
Honnestly I really can't understand that such a fragile
design had not been reviewed before !

Daniel

Comment 15 bill.haneman 2003-09-15 16:32:53 UTC

DV:
I don't know what you are talking about regarding a fragile design,
etc.  Though I can take no credit or blame for the XML format.

The translatable attributes do not contain newlines.  I re-read the
XML spec and believe the attributes will survive normalization fine,
as-used.  The only whitespace these attributes are allowed to contain
are the 'space' character (0x20), which seems to be fine in attributes.

intltool doesn't parse the XML at the moment (unless you count
regular-expression matching as "parsing").

Comment 16 bill.haneman 2003-09-15 16:35:35 UTC

I agree that it would be reasonable for intltool's XML processing to
use a conformant XML parser, by the way.

Comment 17 Daniel Veillard 2003-09-15 16:47:38 UTC

Okay, it's very hard from the scope of this bug report to
assert the limits of the design. If you're sure no newline
will ever be needed then keeping the input string in an 
attribute should not generate serious trouble. But I'm 
surprised by the garantee you can provide about newlines never
being needed, unless this is very specific code this is dangerous
anyway. 
For the result, using different elements with xml:lang
is the way to go, no doubt about it though.
Concerning the use of a real XML parser, well it will garantee
that the strings will be contained within the Unicode ranges
allowed by the XML spec upfront and avoid silent corruption
of the data from an XML viewpoint.
    http://www.w3.org/TR/REC-xml#NT-Char

Daniel

Comment 18 David Bolter 2003-09-24 18:27:19 UTC

Is anyone working on a fix for this?  Or is this still in debate?

Comment 19 bill.haneman 2003-09-25 14:17:58 UTC

Fixing this will require substantial changes to intltool-merge (i.e.
real parsing of the XML instead of just string-substitution).  As
such, it's a big job and we don't know who will be available to do it.

Comment 20 Brian Cameron 2003-09-25 15:38:46 UTC

I am currently working on fixing this.  As Bill mentions, making
the tool use an XML parser is a bit of work, so it will likely
take me some time.

Comment 21 Brian Cameron 2003-10-28 16:14:15 UTC

Created attachment 21010 [details] [review]
patch that fixes the bug

Comment 22 Brian Cameron 2003-10-28 17:00:02 UTC

The attached patch fixes this bug.  intltool-merge now uses the
XML::Parser Perl module to parse the file and produce the appropriate
output.  This obviously means that the XML::Parser a dependency of
the script, which seems appropriate.

I believe I am producing the output appropriately, but would
like someone from the intltool team to review and make sure that 
the patch doesn't need any tweaks.

Note that there are two side effects of using the XML parser:

* intltool-merge will now be more strict that the input is well
  formed.  The previous logic didn't notice badly formed XML.

* Comments are not retained by the XML parser, so they are not
  included in the output.  I suppose post-processing logic could
  be added to re-insert them in the appropriate places, though
  this doesn't seem to be worth the effort.  I assume that having
  comments only in the xml.in files is sufficient.

* Using an XML parser means that the white space placed in the
  output file is slightly different than in the input file.  
  Shouldn't be a problem, but I'm just mentioning this fact.

Comment 23 Kenneth Rohde Christiansen 2003-10-29 00:26:01 UTC

From a first look it looks pretty wel. I would like to see some test
though. Does it still pass the "make distcheck"? Could you add
additional tests to make sure that everything is working as it should?
The requirement seems appropiate for me, thought we ought to ask on
gnome-desktop-devel so every one has been heard. Can you send them a
mail? You should probably ask about the comment issue as well - maybe
someone has a different opinion

Comment 24 Brian Cameron 2003-10-29 22:29:33 UTC

Okay, I ran "make distcheck" and some of the tests fail, but I think
the failures are acceptable.  The failures fall into the following
categories:

1. white space differences.  This includes some differences in
   leading white space.  Also, the new intltool-merge changes
   tags that look like:

   <tag>value</tag>

   to:

   <tag>
     value
   </tag>

   Since the XML parser doesn't keep track of the white spacing
   of the original document, it isn't really possible to keep it
   the same.

2. Comments are missing from the output files, as expected.

3. Perhaps the only serious difference is that the old script 
   did not output the translation for a given language if the
   translated text does not exist.  The new script prints out
   the default (English) text when the translated text does not
   exist.  In other words, you now see things like:

   <bar xml:lang="az">
      This is another ' test
   </bar>

   Where before the block for lang="az" would just be left out
   since there really isn't a translation.  I think that this
   is necessary because you can have situations as follows:

   <foo element="translated_text" xml:lang="az">
      <bar>
          This is not translated text
      </bar>
   </foo>

   In other words, the the element for foo does have a translation
   but the text for bar does not.  

   I guess if we want different behavior in the intltool-script,
   I'll need to know more clearly exactly how you want it to work.

Comment 25 Kenneth Rohde Christiansen 2003-10-29 22:43:39 UTC

no. 1 and 2 seem okay with me - but then the tests need to be updated
so 'make dist' will still work. Ie. new result files should be made.

I don't know about no. 3. I think it would be better if we don't write
anything to the xml file if no translation exist. Otherwise we will
end up distributing (in rpm form etc) larger files than necessary.
Even if a translator just commits a LANG.po file with just one
translated string, it will increase the size of the xml files substancely.

Comment 26 Brian Cameron 2003-10-30 03:04:38 UTC

okay, i suppose it is understandable to not translate tags with cdata
that do not have translations.  For example

  <_tag>text to translate</_tag>

However, what do we do with elements that are not translated.  For
example:

  <tag1 _element="element text to translate">
     <_tag2>cdata text to translate</_tag2>
     <tag3>just another tag</tag3>
  </tag1>

In this situation, it is straightforward if we have translations for
both strings (then we simply create the appropriate translation), or
if we do not have translations for either (then we don't create the
whole block).

But what seems a bit more problematic are the other situations.  For
example, what do we do if there is a translation for tag1 but not
tag2.  What do we do if there is a translation for tag2 but not
tag1?  Please be specific, examples of the output you would expect 
would be useful.  I'm happy to code it any way that makes sense.

I just thought that putting in the default (English) translation was
a reasonable way to handle this situation that wouldn't have so
much exception coding to handle the above problematic situations.

Comment 27 bill.haneman 2003-10-30 12:14:21 UTC

Brian and Kenneth:

In the case of tags that don't have translations, since we are also
using the 'output multiple files' option, it seem to me that we need
to output tags in all cases.  It's not clear to me that the
multiple-output file would contain all the necessary tags (whether
they had been translated or not) unless we output tags even when
translations are not available.  Perhaps I am misunderstanding what
the output now looks like.

Also I believe that in the case of tags that are children of
translated nodes (I think this is the case Brian brought up), we must
output all the child tags.  Otherwise, if our XML client has
successfully located the parent mode for the correct xml-lang, it will
not be able to find all the relevant children.  The alternative would
be to implement the heuristics in the client, i.e.

1) find translated parent _and_ C-locale-parent
2) if specified child isn't found in translated parent, read it from
the C-locale parent node

I think this sort of heuristic is likely to break for GOK, which in
some cases may wish to parse XML files whose topology really _does_
depend on the locale, but which in most cases expects
locale-independent topology.  This also would (it seems to me) break
the multiple-output case, unless we include the C locale elements in
all of the multiple-output files as well.  That near doubling of the
length of the output files for multiple output would offset any size
savings due to omitting untranslated tags, I think.

Comment 28 Brian Cameron 2003-11-04 21:34:32 UTC

Kenneth: 

I am waiting to hear your feedback on this issue.  I would like to
finish up this bug, but can not until I understand how to address
the issues I mentioned in my last update to this bug.

Comment 29 Kenneth Rohde Christiansen 2003-11-04 22:46:50 UTC

I think we should just output all tags then - for now anyway. Feel
free to check in when you have made sure that 'make distcheck' still
works. (add new result files etc)

Kenneht

Comment 30 bill.haneman 2003-11-05 00:19:25 UTC

Kenneth/Brian:  While we are at it, can we add an intltool rule for
INTLTOOL_KBD_RULE ?  I can attach a small patch here (which applies on
top of Brian's recent patch), or put it in RFE bug #126237.

Comment 31 Brian Cameron 2003-11-05 17:49:17 UTC

Patch is committed.  I updated the data in the test subdirectory so
that "make distcheck" passes. 

I don't have permission to change this bug's status, though.

Comment 32 Brian Cameron 2003-11-05 17:49:58 UTC

oops, yes i do.  updating status of the bug.

Comment 33 Daniel Veillard 2003-11-26 13:29:17 UTC

>1. white space differences.  This includes some differences in
>   leading white space.  Also, the new intltool-merge changes
>   tags that look like:
>
>   <tag>value</tag>
>
>   to:
>
>   <tag>
>     value
>   </tag>
>
>   Since the XML parser doesn't keep track of the white spacing
>   of the original document, it isn't really possible to keep it
>   the same.

  Is this problem still present ? If yes then the bug is not FIXED
that's a very serious problem. White space are significant, you cannot
ignore them or add them within tag values, I hope this is fixed and
wait for confirmation

Daniel

Comment 34 david.hawthorne 2003-11-26 14:20:28 UTC

Daniel:

Unless the XML specifies xlm:space = preserve, this isn't really a
bug, is it? 

I don't think there is any guarantee in intltool that leading and
trailing whitespace in localized output are preserved.  Shouldn't we
just document this?

Comment 35 bill.haneman 2003-11-26 15:01:23 UTC

oops, cookie problem.  The above annotation should have come from me,
not david H.  sorry.

Comment 36 Kenneth Rohde Christiansen 2003-11-26 15:14:40 UTC

I am trying to use Brian's OrigTree module right now - but I am having
some throuble getting it to work. Btw, Bill test no. 18 is still
broken since your last commit

Comment 37 Kenneth Rohde Christiansen 2003-11-26 15:20:39 UTC

OrigTree module is committed. Changing ->Tree to ->OrigTree in
intltool-merge.in.in should activate it, but apparently something is
going wrong - any help appreciated.

Comment 38 bill.haneman 2003-11-26 16:09:52 UTC

Kenneth thanks for the reminder.

Comment 39 Daniel Veillard 2003-11-26 17:08:05 UTC

Yes it is a bug even if space=preserve is not defined.
First:
  - space="preserve" is the *default* behaviour
  - second space="preserve" only addess text nodes only
    made of characters from the S production, i.e. 
    spaces
what this is doing is *modifying* text nodes. And that
I said and repeated is definitely no conformant.

turning <a><b/></a> into
<a>
  <b/>
</a>
should not be done by default but is okay from purely a 
formating perspective
turning <a><b>text</b></a> into
<a>
  <b>text</b>
</a>
is no worse it's the same only empty or blanks nodes are changed
or added
but turning <a><b>text</b></a> into
<a>
  <b>
    text
  </b>
</a>
is a clear violoation. It changes the text node from 
containing a non-blank string "text" into another non
blank string "
    text
  "
and that is an heresy from an XML processing viewpoint.

Daniel

Comment 40 bill.haneman 2003-11-26 17:19:06 UTC

ok, that makes sense.  Brian, can we put a fix on the queue? thanks.

Comment 41 Brian Cameron 2003-12-02 18:12:37 UTC

I believe the problem that you describe has been fixed in CVS
head.  Refer to bug #127250.  Please verify.

Comment 42 Kenneth Rohde Christiansen 2003-12-02 18:40:00 UTC

Just wanted to tell that OrigTree is committed and is now the default
XML::Parser::Style.

Kenneth

Comment 43 bill.haneman 2003-12-03 11:27:25 UTC

problem is fixed in 127250