GNOME Bugzilla – Bug 169197
Docbook mode: xml2po extracts markup
Last modified: 2019-03-25 23:15:02 UTC
#: ../rnutilities.xml:17 (para) msgid "" "Many people will be glad to see that it no longer suggests saving changes " "when all changes have actually been undone. And in addition it even starts " "faster than previous versions. <figure id=\"rnscreenshot-gedit\"><title>Text " "Editor</title><screenshot><mediaobject><imageobject><imagedata fileref=" "\"&urlfiguresbase;figure-gedit.png\" format=\"PNG\"/></" "imageobject><textobject><phrase>The text editor, highlighting the current " "line.</phrase></textobject></mediaobject></screenshot></figure>" The above message doesn't look quite right. It seems that the interesting texts that should have been extracted should actually have been: msgid "" "Many people will be glad to see that it no longer suggests saving changes " "when all changes have actually been undone. And in addition it even starts " "faster than previous versions." msgid "Text Editor" msgid "The text editor, highlighting the current line." Everything else is just markup that doesn't seem to have any relevance at all for translation.
It's not that easy, because you're missing an important point here. To illustrate, what should happen if paragraph was actually: <para>If you take a look at the lower right of <figure>...</figure> you can see that there's something interesting there.</para> Or, if it had two figures which are not close in the paragraph? OTOH, this is easy to fix in xml2po (it's about tuning DocBook mode), we just need to set <figure>, <phrase> and <title> as "final" tags, and a couple of others as "ignored" tags. This is actually what was done for many other tags (i.e. see how are embedded lists in paragraphs, or footnotes handled). The problem is that DocBook is very extensive, and it's hard to cover it fully (we need to be careful at the same time, see below for problems with <phrase>). Basically, if it's not clear, we need someone to go through entire DocBook standard and note which tags should be final, and which shouldn't. For instance, <phrase> is not a good candidate here, simply because it can appear even in cases like "<para>This is a <phrase>short phrase</phrase>.</para>", when we need want it as a single message. To get best of such cases, we want our documentation writers to standardize on a subset of DocBook instead. For instance, I see no reason not to use <para> instead of <phrase> in the message above. FYI, this *cannot* be determined automatically, because cases such as "<para>This <em>is</em> some text.</para>" are even more common (and technically, completely the same, unless we get some NLP in, which is not going to happen, at least not soon :), and we *must* have this as single message "This <em>is</em> some text." I'm not going to fix this yet (and I can't actually do it without breaking other cases), because that would mean breaking the strings for all the other translators of release notes now (and they're supposed to be frozen).
For obvious reasons, I'm not suggesting this be fixed immediately, but please fix it when you can do so.
Fixed in CVS, though we will probably get similar stuff for other tags. I need a DocBook guru to help tune xml2po DocBook mode.
I agree with comment #1 but I don't think xml2po can, or should, solve everything here. We probably need to educate documentation writers that figures are not to be placed inside paragraphs unless absolutely necessary in the context. But then again, I can't find a good, valid, reason why it should in some cases be necessary to do so. To me, it seems that the example of a figure in a paragraph in comment #1 goes against good documentation writing policy in many ways.
DocBook allows some fairly insane mixing of inline and block-level content. The para element allows almost all block-level content. I discourage making use of this for many reasons, not the least of which is that it's harder to process and format nicely. So if we want an active campaign to smack writers that do this, you have my support. Nonetheless, there are inline media objects, and all media objects should have a textobject element, so we have to decide how those are to be handled. Danilo, what sort of DocBook guru questions do you have? I'm still only a DocBook wizard, but I'll be finishing the guru certification soon.
xml2po currently extracts the textobject content, and makes it a seperate message, which I suspect is what most people would expect. So that seems to be taken care of. What I think is more problematic is that the contents of the textobject was embedded in a phrase element in the case above. Is there any (logical) reason for this organization, or is it just purely for rendering purposes? If it is only to change the rendering of the textobject, then we probably should tell doc writers to avoid such tricks as well, as I imagine it makes the job of xml2po harder, and instead fix their stylesheets to do what they want with textobjects.
http://www.docbook.org/tdg/en/html/textobject.html Using phrase is the only way to do inline content with textobject. You must either use phrase, textdata, or some mix of block-level content. That's just the way DocBook does things. Note that, if phrase is used, it must be the only child of textobject, except the optional objectinfo. Can xml2po recognize this necessarily-singleton child and avoid the extra message in PO files?
Shaun, that's exactly the sort of wizardry (until you're a certified guru :) I need :) I'd need to know what sorts of tags are allowed as "inline" (not to be separated out as new messages, such as <em>, <strong>, etc. in HTML), and which are "block level" (<para>, <title>'s,... in DocBook). The distinction needs to be: block level elements are to be translated in their entirety (except if there are other embedded block level elements, which are replaced with placeholders and translated separately), and inline elements are those that are part of the sentence, so don't make sense being standalone. According to http://www.docbook.org/tdg/en/html/phrase.html, phrase may also appear in context like: <para>Effectivity attributes can be used to keep track of modifications to a document <phrase revisionflag="deleted">at the word or sentence level</phrase>...</para> It's hard for xml2po to treat this case specially from <textobject><phrase>blah blah</phrase></textobject>. Actually, in both cases automatic detection would work fine, but I didn't use it because I don't know <textobject>s might appear in cases such as: <figure>something blah blah<textobject><phrase>how yes no</phrase></textobject>here</figure> or: <textobject><phrase>This is</phrase> <phrase>only a start</phrase></textobject> If this is allowed, and textobject is not final, then we'd get two messages "This is" and "only a start", which is wrong, because they're part of one sentence, and should be translated as a unit (otoh, if instead of a space, there was some other text such as a hyphen between two phrases, xml2po would correctly give only one message). For a case like this, it is essential that xml2po extracts <textobject> as a separate message, so I define <textobject> to be a "final" tag in xml2po speak, which should give us two messages for translation (this is for translators' sake): "something blah blah<placeholder-1 />here" "<phrase>how yes no</phrase>" However, xml2po has some magic to detect when there's only one nested tag (basically Shaun, yes, it supports such 'singleton' child, as you called it), it already gives "how yes no" for a second message (remember, I'm a translator too, and I know of the Glade mark-up problem ;). So Christian, I don't think there should be a problem with <phrase> inside textobject, unless you can give me an example :) Actually, it's very simple to remove <textobject> as one of final tags in modes/docbook.py, and I'm willing to do that if it doesn't work correctly as it is.
Well, you could look at every element that's handled by db2html-inline.xsl as a start for everything that's inline. Although I wouldn't take that as a definitive list of everything that could be treated as inline. Alternatively, if you just tell me the choices for how each element should be treated by xml2po (final, inline, etc.), I could go through and tell you how each element behaves.
Basically, xml2po needs to be special cased on three types of elements: 1. final (basically, block-level), they will be put into separate messages for translation 2. ignored (don't output messages for these, even if they're final; this is used to trigger <placeholder> addition, but not actually output a message which would be something like <listitem><placeholder-1/></listitem><listitem><placeholder-2/></listitem>) 3. space-preserving: don't mess with spaces in these ("xml:space='preserve'" can still be used on other elements to get the same behaviour) All other elements are by default treated as "inline" or "non-final", i.e. if they're part of one larger block, don't make a new message out of them, but if they're standalone, make them a message.
Also, just take a look at xml2po/modes/docbook.py, the three interesting functions which just return lists of tags are: def getIgnoredTags(self): def getFinalTags(self): def getSpacePreserveTags(self): (everything else in docbook.py is there for handling images and translator credits)