GNOME Bugzilla – Bug 97556
Eel functions for adding message context markers should be moved to glib
Last modified: 2011-02-18 16:07:08 UTC
Eel has some functions for parsing messages marked for translation so that developers can add context in messages, in case the message otherwise can be interpreted differently or the English word has multiple meanings that needs to be translated differently in other languages. The context comment is then removed prior to display in the user interface. It works like this example (taken from bug 97482): Q_("Russian[ charset]"); The word "Russian" is what should be translated, and that inside [] is the contextual comment. In this example it's so that the message for the "Russian" language can be differentiated from the message for the "Russian" character set in Galeon, which needs to be translated differently in other languages where the words don't happen to be the same. I suggest this eel feature be moved to glib, since it's sometimes a requirement for proper localization.
This is the function discussed /* Remove all text in brackets. Used where context is included in strings to * be internationalized, to help translators, and to make sure that strings * that may be used in different places with a different meaning may be * translated separately. If brackets are not even, it will just return a * copy of the original string. */ char * eel_str_remove_bracketed_text (const char *text);
Ugh, but maybe as good as it is going to get. Definitely would need a facility for escaping brackets.
This should be done as explained in the gettext manual. The translations should not contain the contextual comments in brackets, thus in the normal case, no stripping is necessary. Only in the case where no translation is found and the original string is returned, the stripping needs to be done. See http://www.gnu.org/manual/gettext/html_node/gettext_151.html#SEC151 for an example.
My trust that translators can be convinced to translate Russian[ charset] As: Russe Not: Russe[ charset] or: Russe[ <charset translated into French>] Is frankly fairly low (from experience with places where there have been explicit comments telling the translators what to do...) but maybe if it's standard enough... Note also that the main advantage of the no-strip-if-translated approach can only be achieved if you either: - Hash the stripped results to avoid the caller having to deeallocate. - Use form for the comments that can be stripped in place, such as "charset|Russian" (which is perhaps less clear to the translator that "Russian" is an adjective that should be translated to the proper form to apply to Russian). It certainly would be nice to not have to have the caller have to deallocate though... Maybe you could do: "Russian[ charset]|Russian" And accept \| and \\ escapes before the |. intltool could possibly be convinced to do checks that the translator translated the string properly.
Sounds good Owen, but maybe we should mark what should be translated instead. That seems more logical to me. Example string Danish translation Q_("search [after] files|after"); -> "efter" Q_("[russian] charset|russian"); -> "russisk" Q_("[view] picture|view"); -> "vis" Q_("a [view]|view"); -> "visning"
I think the gettext manual example very clearly shows how to do this properly: use a prefix for the context information and avoid any string copies: "charset|Russian" or, if brackets are considered necessary, "[charset]Russian" We obviously need a convention for handling cases where the message itself starts with a bracketed string. Silly example: "[ and ] are brackets" would have to be entered as something like "[dummy context][ and ] are brackets". I would expect translators to quickly internalize the information that context information must not be translated if they see it come up in the GUI of their translated apps once. And as you mentioned, Owen, the tools could easily check for translated context information.
I don't see how this: Q_("search [after] files|after"); -> "efter" doesn't follow the gettext manual. We use the text after the | to show in the GUI. Instead of just writing "adjective|blah" or what ever, it is better to write the exact sentence in which it will be used. Especially because endings change in heavy conjugated languages (for instance finnish has 15 cases, and 2 geni afaik). So when writing more context it is nice to have a standard convention of pointing out the word to translate. I only suggested using [] for this. The function doesn't need to know anything about []. The only thing it might needs to it to do, is to allow to escape |.
Sorry Kenneth, my comment was on the inital proposal of using a suffix for the comments. But let me ask you a question: Why would anybody want to translate "after" as a separate string when it appears in a larger context like "search after files" ? Surely you would translate the whole string as one unit, putting all available context in the translatable message. Or did I somehow misunderstand your example ?
Oh well, it was just an example...probably not very well thought out :). but it could for instance have been: "Put thumbnail [after] filename|after" "Put thumbnail [before] filename|before" if there was a pulldown menu. I have seen something like that before.
> Why would anybody want to translate "after" as a separate string > when it appears in a larger context like "search after files" ? > Surely you would translate the whole string as one unit, putting > all available context in the translatable message. Or did I somehow > misunderstand your example ? Happens all the time in practice. One example is "Search for [files] in [ ] where [size] is [larger than] [ ] MB" or stuff like that in search dialogs, where the stuff in brackets are text fields or drop-down boxes. There's a lot more occasions in GUI:s where sentences (unfortunately) are in no way possible to translate in their entirety since they have widget elements or the like in them.
Another common example is "Time-out after [ ] seconds." or any other case where the unit follows.
But this is "broken as designed" from an I18N perspective anyway, isn't it. What if the "widget elements" which are effectively part of the sentence have to be reordered for the sentence to make sense in a translation ? If you want to embed GUI elements into a sentence, you should probably translate something like: "Search for {files} in {location} where {size} is {larger than} {number} MB" to "Suche in {location} nach {files} mit {size} {larger than} {number} MB" then post-process the translated string and arrange for the proper GUI elements to be inserted in place of the {xyz} placeholders.
Please remember that this is just one of the badly needed uses for this Q syntax with more context, and that this is needed now, not when all current GUI l10n problems are solved in the future. Matthias, are you subscribed to gnome-i18n@gnome.org? This has been discussed for years there and been rehashed over and over, and it seems terribly redundant to rehash everything and all consensus that led to the Q thing in eel in this bug report.
Christian, I'm not subscribed to gnome-i18n, so I have probably missed years of interesting discussion... I'm not opposed to the feature, I'm only concerned that after years of discussion, you still ended up with an implementation in eel which dups strings, when the right approach has been explained (with example code) in the gettext manual for even longer.
Syntactically, the solution proposed in the gettext manual isn't much different, but I'd argue that the Q syntax is clearer for the translator. But the big issue is toolkit support. We sometimes have trouble convincing maintainers about changing even the most trivial message code so that it will improve the situation for translators, often on the grounds of "this will make my code ugly/I don't want to add that much junk code/I like it the way it is/writing it like this will just introduce memory leaks, no way". And by my own experience, asking developers to reinvent the wheel every time, and spend time writing bugfree additional code for something that may seem like a trivial and nonimportant thing to people that don't directly experience the localization problems and the impact of those, usually isn't a very successful task. This kind of stuff needs support directly in the programming environment.
I think the technically cleanest thing would be to introduce two-argument macros like Q_(msgkey,context) - no need to invent a syntax for encoding of context in msgkey, no danger of translating it, no need to strip the context out of the msgkey. This would of course need support in the extraction tools, which would have to put the context as a comment in the pot file.
We still need the strings to be unique.
We also still need something that doesn't break gettext or any other tool working with po files. The Q syntax in Eel doesn't do that.
I found out that xgettext almost lets you do the two-parameter approach: with xgettext --add-comment, gettext("Russian" /*Russian[charset]*/) will yield #. Russian[charset] msgid "Russian" msgstr "" Unfortunately, the comment syntax can't be hidden behind a macro, since xgettext operates on the unpreprocessed source. And, as you rightly pointed out, this approach doesn't solve the msgid collision problem, so we will have to encode the context in the msgid anyway. Here is a very simple, but efficient implementation: #define Q_(String) g_sgettext(String) const char * g_strip_context (const char *msgid, const char *msgval) { if (msgval == msgid) { const char *c = strchr (msgid, '|'); if (c != NULL) return c + 1; } return msgval; } const char * g_sgettext (const char *msgid) { return g_strip_context (msgid, gettext (msgid)); } Then Q_("Russian[charset]|Russian") will come out as msgid "Russian[charset]|Russian" msgstr "" and translators can (hopefully) be trained to translate only the part after the first |. When Q_() is mixed with _(), problems can arise: _("boolean operators are |&^") will leave the translator puzzled whether he has to translate the part before the |. Possible ways to avoid this confusion are: 1) always use Q_() (at least use Q_() whenever the message contains a |): then the above would have to be coded as Q_(no context|boolean operators are |&^") 2) give hints to translators like above: Q_("Russian[charset]|Russian"/* translate after |*/) _("boolean operators are |&^"/* | is part of the message */) would come out as #. translate after | msgid "Russian[charset]|Russian" msgstr "" #. | is part of the message msgid "boolean operators are |&^"
'|' occurs infrequently enough in translations, that it's probably not a big deal what happens there. And I think in those cases, it should be pretty clear that the part before the | doesn't look like the a context.
I see the some functions have been added to glib/gi18n.h a few days ago. Which format do they implement? There's no direct API documentation right in these header files. However, this problem would probably be even solved better if support for this is moved to libintl/libc. I started a little bit of discussion on translation-i18n@lists.sourceforge.net, but it's not yet clear whether the guys there can be convinced to include such a solution into gettext/libintl/libc. Here's my posting to that list: The real problem with this is that there is *no standard* for such non-ambiguous msgids. What exactly should be the po file format for the non-ambiguous msgids? For Qt/KDE it's "_: disambiguating comment\nmsgid", but if you follow the proposal in the gettext manual then it would be "disambiguating comment|msgid". And the bug report http://bugzilla.gnome.org/show_bug.cgi?id=97556 even thinks about yet another way and discusses either "[disambiguating comment]msgid" or "msgid[disambiguating comment]". And what are the parameters for the respective q_gettext call? For Qt/KDE it accepts two strings where one of them is the non-ambiguous comment and the other is the msgid. From the gettext manual's proposal it would accept one string just like the usual gettext call. So if you keep the position that this problem should be solved by each GUI library on their own, then each library will invent its own format for both the msgid format in the po file and for the parameter format of the q_gettext call. This will only increase the confusion over time. Instead, if *you* as gettext/libintl/libc project now introduce *one* solution for this, then this kind of format will be unified throughout the whole GNU translation community. Needless to say, this will also increase the chance that translators are going to handle this correctly as opposed to having to adapt to each project's non-ambiguous-solution format. This is why I think this is really important and should be solved on libintl/libc level. > > Therefore I would like to ask you, the gettext developers: Are there > > plans to include such a prefix_gettext() function into the gettext > > library? > > There is no plan to include such a function in the libintl/libc library. > The reason is simply that any project can write this function with 10 > lines of code. Again, as I stated above: The problem is not the amount of code. The problem is that a standard format is needed. Really needed. > However, the real limitation is on the xgettext side. xgettext currently > can only extract "context" when it comes from a comment. Some other > conventions, like > _("msgid", "disambiguating comment") > exist in other GUI toolkits (Qt), and we can talk about what can be done > on this side. If gettext agrees on a standard convention, then surely xgettext can provide an implementation for extracting these conventions. Personally I would prefer the proposal from the gettext manual: "disambiguating comment|msgid" and that's it. No need to change xgettext. Even no need to change any GUI-creation tool like glade/libglade. However, a solution that keeps compatibility to Qt/KDE folks would probably be even better. > But first, can you please brief me on what a "context" or "disambiguating > comment" can look like in practice? Think of any english word that can both be a noun and a verb (e.g. "a file" and "to file"). Think of the fact that almost always in at least some languages the translation of the verb is [very] different from the translation of the noun (e.g. in German the noun is "Datei" and the verb is "ablegen"). Now think of a GUI button that is labelled with this word. Now think of a case where this button has the meaning of the verb, and another case where this button has the meaning of the noun (e.g. "File" meaning "to file something somewhere" as opposed to "File" meaning "do something with a file"). There you are -- the msgid in both cases is identical, but the msgstr should be different. Therefore we need a disambiguating addition in the msgids. In the example this can be as simple as (in gettext manual's format) "noun|File" and "verb|File", but you could also use the real meaning: "to file something somewhere|File" and so on. I hope you get the point.
The current implementation of this in glib has some limitations; for example, it doesn't work in case the translator translates the context as well. This has been put in bug 164373.