Bug 169197 – Docbook mode: xml2po extracts markup

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 169197 - Docbook mode: xml2po extracts markup


Summary:	Docbook mode: xml2po extracts markup


Status:	RESOLVED FIXED

Product:	gnome-doc-utils
Classification:	Deprecated
Component:	xml2po
Version:	CVS HEAD
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Danilo Segan
QA Contact:	Danilo Segan

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-03-04 14:16 UTC by Christian Rose
Modified:	2019-03-25 23:15 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Christian Rose 2005-03-04 14:16:46 UTC

#: ../rnutilities.xml:17 (para)
msgid ""
"Many people will be glad to see that it no longer suggests saving changes "
"when all changes have actually been undone. And in addition it even starts "
"faster than previous versions. <figure id=\"rnscreenshot-gedit\"><title>Text "
"Editor</title><screenshot><mediaobject><imageobject><imagedata fileref="
"\"&urlfiguresbase;figure-gedit.png\" format=\"PNG\"/></"
"imageobject><textobject><phrase>The text editor, highlighting the current "
"line.</phrase></textobject></mediaobject></screenshot></figure>"


The above message doesn't look quite right. It seems that the interesting texts
that should have been extracted should actually have been:

msgid ""
"Many people will be glad to see that it no longer suggests saving changes "
"when all changes have actually been undone. And in addition it even starts "
"faster than previous versions."

msgid "Text Editor"

msgid "The text editor, highlighting the current line."

Everything else is just markup that doesn't seem to have any relevance at all
for translation.

Comment 1 Danilo Segan 2005-03-04 16:51:16 UTC

It's not that easy, because you're missing an important point here.

To illustrate, what should happen if paragraph was actually:

<para>If you take a look at the lower right of
<figure>...</figure>
you can see that there's something interesting there.</para>

Or, if it had two figures which are not close in the paragraph?

OTOH, this is easy to fix in xml2po (it's about tuning DocBook mode), we just
need to set <figure>, <phrase> and <title> as "final" tags, and a couple of
others as "ignored" tags. This is actually what was done for many other tags
(i.e. see how are embedded lists in paragraphs, or footnotes handled).  The
problem is that DocBook is very extensive, and it's hard to cover it fully (we
need to be careful at the same time, see below for problems with <phrase>).

Basically, if it's not clear, we need someone to go through entire DocBook
standard and note which tags should be final, and which shouldn't.  For
instance, <phrase> is not a good candidate here, simply because it can appear
even in cases like "<para>This is a <phrase>short phrase</phrase>.</para>", when
we need want it as a single message.  To get best of such cases, we want our
documentation writers to standardize on a subset of DocBook instead.  For
instance, I see no reason not to use <para> instead of <phrase> in the message
above.

FYI, this *cannot* be determined automatically, because cases such as
"<para>This <em>is</em> some text.</para>" are even more common (and
technically, completely the same, unless we get some NLP in, which is not going
to happen, at least not soon :), and we *must* have this as single message "This
<em>is</em> some text."

I'm not going to fix this yet (and I can't actually do it without breaking other
cases), because that would mean breaking the strings for all the other
translators of release notes now (and they're supposed to be frozen).

Comment 2 Christian Rose 2005-03-04 17:32:21 UTC

For obvious reasons, I'm not suggesting this be fixed immediately, but please
fix it when you can do so.

Comment 3 Danilo Segan 2005-03-27 17:19:44 UTC

Fixed in CVS, though we will probably get similar stuff for other tags.  I need
a DocBook guru to help tune xml2po DocBook mode.

Comment 4 Christian Rose 2005-03-27 23:35:38 UTC

I agree with comment #1 but I don't think xml2po can, or should, solve
everything here.
We probably need to educate documentation writers that figures are not to be
placed inside paragraphs unless absolutely necessary in the context. But then
again, I can't find a good, valid, reason why it should in some cases be
necessary to do so. To me, it seems that the example of a figure in a paragraph
in comment #1 goes against good documentation writing policy in many ways.

Comment 5 Shaun McCance 2005-03-28 00:02:57 UTC

DocBook allows some fairly insane mixing of inline and block-level content.  The
para element allows almost all block-level content.  I discourage making use of
this for many reasons, not the least of which is that it's harder to process and
format nicely.  So if we want an active campaign to smack writers that do this,
you have my support.

Nonetheless, there are inline media objects, and all media objects should have a
textobject element, so we have to decide how those are to be handled.

Danilo, what sort of DocBook guru questions do you have?  I'm still only a
DocBook wizard, but I'll be finishing the guru certification soon.

Comment 6 Christian Rose 2005-03-28 00:32:33 UTC

xml2po currently extracts the textobject content, and makes it a seperate
message,  which I suspect is what most people would expect. So that seems to be
taken care of.
What I think is more problematic is that the contents of the textobject was
embedded in a phrase element in the case above. Is there any (logical) reason
for this organization, or is it just purely for rendering purposes? If it is
only to change the rendering of the textobject, then we probably should tell doc
writers to avoid such tricks as well, as I imagine it makes the job of xml2po
harder, and instead fix their stylesheets to do what they want with textobjects.

Comment 7 Shaun McCance 2005-03-28 04:31:06 UTC

http://www.docbook.org/tdg/en/html/textobject.html

Using phrase is the only way to do inline content with textobject.  You must
either use phrase, textdata, or some mix of block-level content.  That's just
the way DocBook does things.

Note that, if phrase is used, it must be the only child of textobject, except
the optional objectinfo.  Can xml2po recognize this necessarily-singleton child
and avoid the extra message in PO files?

Comment 8 Danilo Segan 2005-03-28 08:39:47 UTC

Shaun, that's exactly the sort of wizardry (until you're a certified guru :) I
need :)

I'd need to know what sorts of tags are allowed as "inline" (not to be separated
out as new messages, such as <em>, <strong>, etc. in HTML), and which are "block
level" (<para>, <title>'s,... in DocBook).  The distinction needs to be: block
level elements are to be translated in their entirety (except if there are other
embedded block level elements, which are replaced with placeholders and
translated separately), and inline elements are those that are part of the
sentence, so don't make sense being standalone.

According to http://www.docbook.org/tdg/en/html/phrase.html, phrase may also
appear in context like:

<para>Effectivity attributes can be used to keep track of modifications
to a document <phrase revisionflag="deleted">at the word or
sentence level</phrase>...</para>

It's hard for xml2po to treat this case specially from <textobject><phrase>blah
blah</phrase></textobject>.  Actually, in both cases automatic detection would
work fine, but I didn't use it because I don't know <textobject>s might appear
in cases such as:

<figure>something blah blah<textobject><phrase>how yes
no</phrase></textobject>here</figure>
or:
<textobject><phrase>This is</phrase> <phrase>only a start</phrase></textobject>

If this is allowed, and textobject is not final, then we'd get two messages
"This is" and "only a start", which is wrong, because they're part of one
sentence, and should be translated as a unit (otoh, if instead of a space, there
was some other text such as a hyphen between two phrases, xml2po would correctly
give only one message).

For a case like this, it is essential that xml2po extracts <textobject> as a
separate message, so I define <textobject> to be a "final" tag in xml2po speak,
which should give us two messages for translation (this is for translators' sake):
  "something blah blah<placeholder-1 />here"
  "<phrase>how yes no</phrase>"

However, xml2po has some magic to detect when there's only one nested tag
(basically Shaun, yes, it supports such 'singleton' child, as you called it), it
already gives "how yes no" for a second message (remember, I'm a translator too,
and I know of the Glade mark-up problem ;).  So Christian, I don't think there
should be a problem with <phrase> inside textobject, unless you can give me an
example :)

Actually, it's very simple to remove <textobject> as one of final tags in
modes/docbook.py, and I'm willing to do that if it doesn't work correctly as it is.

Comment 9 Shaun McCance 2005-03-28 09:18:40 UTC

Well, you could look at every element that's handled by db2html-inline.xsl as a
start for everything that's inline.  Although I wouldn't take that as a
definitive list of everything that could be treated as inline.

Alternatively, if you just tell me the choices for how each element should be
treated by xml2po (final, inline, etc.), I could go through and tell you how
each element behaves.

Comment 10 Danilo Segan 2005-03-28 09:29:00 UTC

Basically, xml2po needs to be special cased on three types of elements:
1. final (basically, block-level), they will be put into separate messages for
translation
2. ignored (don't output messages for these, even if they're final; this is used
to trigger <placeholder> addition, but not actually output a message which would
be something like
<listitem><placeholder-1/></listitem><listitem><placeholder-2/></listitem>)
3. space-preserving: don't mess with spaces in these ("xml:space='preserve'" can
still be used on other elements to get the same behaviour)

All other elements are by default treated as "inline" or "non-final", i.e. if
they're part of one larger block, don't make a new message out of them, but if
they're standalone, make them a message.

Comment 11 Danilo Segan 2005-03-28 09:31:00 UTC

Also, just take a look at xml2po/modes/docbook.py, the three interesting
functions which just return lists of tags are:
    def getIgnoredTags(self):
    def getFinalTags(self):
    def getSpacePreserveTags(self):
(everything else in docbook.py is there for handling images and translator credits)