GNOME Bugzilla – Bug 116236
Use ngettext for handling plurals in GNOME
Last modified: 2006-03-03 23:00:30 UTC
As mentioned in http://developer.gnome.org/doc/tutorials/gnome-i18n/developer.html#plurals, the common way of handling plurals is broken for many locales. A way to solve this is by using ngettext instead, as mentioned in that document. A simple code example of code using ngettext: g_printf (ngettext ("Found %d file.", "Found %d files.", nbr_of_files), nbr_of_files); We should make all of GNOME use this when needed.
I think the consensus at the recent GUADEC was that it was alright to include ngettext calls, since common platforms like both Linux and Solaris already supports this, and ifdef:s can be used to catch the cases where ngettext is not supported (something about using *HAVE_NGETTEXT or something like that). I admit I'm not very knowledgeable about the details in this proposed solution though. Jody, Havoc, could you please correct me and fill in the details?
Hmm, I would propose to add a g_ngettext() in glib, but I think the real problem is the po format, isn't it ? Will a ngettext()-less gettext implemenation understand the ngettext()-enhanced po files ?
I think something like a P_() macro for ngettext () was already suggested at GUADEC, similar to _() and N_(). P as in plural or something like that. As for compatibility on the po format level, that's a very interesting issue, something we forgot at GUADEC. It shouldn't be hard to figure out; gnome-games (bug 106697) among others already use ngettext and have po files with ngettext syntax, so it should be trivial to test. In fact, I did so now, and it seems that msgfmt on an unpatched, unupdated Solaris 8 machine issues an error with such a file. So it seems we're still breaking compatibility here on the po file level for systems that don't support ngettext. But in the choice of keeping compatibility with unpatched, unupdated environments and never moving forward and fixing issues, or occasionally doing so, I know which one I prefer. ngettext isn't exactly a brand new thing either, it's been around for a few years now on Linux, so it's not like we would require bleeding edge stuff. It's also included in Solaris 9 and available as a Sun patch for Solaris 8, and mandated in the OpenI18N standard, so other compatible environments are expected to follow if they aren't already supporting it.
Maybe the cleanest solution would be to make glib require ngettext() like it already requires gettext() so that the rest of the stack can depend on it being there.
Seems like an good idea. I'd like to see a full set of _(, N_(, L_( in glib so that we can stop having them pop up in various random places.
Ok, I put that in the glib bug 119790.
As mentioned in glib bug 119790, glib won't require this until GTK+ 2.6. But we should just bite the bullet and use this for GNOME 2.5 (http://lists.gnome.org/archives/gnome-i18n/2003-August/msg00127.html) anyway. This is severely needed but has been delayed so much in many cases that it's just tragic.
Another reason why's it ok to use ngettext. Most of the tarballs already include generated MO files (so, there's no need to generate them from PO files, which might cause problems with non-GNU msgfmt if they contain entries like msgid_plural, or msgstr[2]). The interesting thing here is that MO files are quite simple, and addition of plural forms to them didn't break backward compatibility. So, if those MO files worked with other gettext's in the past, they'll also work in the future. So, doing #define ngettext(a,b,c) (a) should be enough (along with check for HAVE_NGETTEXT, and conditioning this definition on that) to make *tarballs* compile and install on any system with gettext (not neccessarily GNU's) support. However, this leaves the problem of compiling from CVS, and compiling a couple of packages that don't include MO files in the tarball (I think Gnumeric is one of them, though, there might be some that are in the Desktop/Developer platform too). One approach to solve this problem, besides using definitions as above, is to preproccess PO files and remove "offending" features (msgid_plural, and msgstr[N] forms). The Perl program I'll attach below does this, and can be used for piping (it reads standard input, writes to standard output). Concretely, this program replaces any occurence of msgid_plural line with empty translation (msgstr "") -- ie. this string will be untranslated for those folks (what means we should *recommend* using GNU gettext for Gnome, but we won't require it); also, it comments out all occurences of "^msgstr\[[0-9]+\]" lines. Thus generated file would (should?) be compilable with any msgfmt out there. I've tested it on a couple of Serbian translations with plural forms, but any input is greatly desired (I don't have any other gettext available other than GNU's, so please test it if you can). How anyone else feels about this solution?
Created attachment 21476 [details] Preproccess PO files for non-plural-forms capable msgfmt
Just to add some other thoughts (sorry for the spam). I think all of this can be integrated into intltool autoconf macros, except maybe for noop ngettext definition, thus making it straightforward for build maintenance (not requiring every module to include all the same checks in configure.in/ac). intltool automake/autoconf macros should define HAVE_NGETTEXT if it's present. Since these macros are also used to build MO files out of PO files, it would require a change from something like: msgfmt -o $(OUTPUTFILE) to if HAVE_NGETTEXT; then msgfmt -o $(OUTPUTFILE) $(INPUTFILE) else pre-msgfmt.pl < $(INPUTFILE) | msgfmt -o $(OUTPUTFILE) - fi I guess this simple Perl script could also be improved to make use of environment variable HAVE_NGETTEXT and output file without changes, which would make the above simply: HAVE_NGETTEXT=$(HAVE_NGETTEXT) pre-msgfmt.pl < $(INPUTFILE) | msgfmt -o $(OUTPUTFILE) - Of course, all Gnome 2.6 packages with plural forms should require this "new and improved" intltool version. Since I'm not really a best friend with Autoconf/Automake, I'll wait for others' comments before even trying to hack on it.
External bug this one depends on, regarding Evolution: http://bugzilla.ximian.com/show_bug.cgi?id=53464
How can we deal with plurals in XML files? gnopernicus for example will need to mark as translatable strings which include plural constructions, which live in XML files. Its possible for us to do something with our XML parser so that it calls ngettext at runtime, for instance we can include the 'format string' in the XML content: <_ngettext-format>%d items found. <some-element-that-evaluates-to-an-int/> </_ngettext-format> but will the _ngettext-format element get properly pulled into the .PO files for translation in a way appropriate to pluralization? OR do we need something like: <ngettext-format _singular="%d item found." _plural="%d items found."> <some-element-that-evaluates-to-an-int/> </ngettext-format> so that both strings get pulled into the .po files, as localized attributes?
That wouldn't be enough. Number of plural forms may be upto 4 (perhaps even more, but I don't know about any such language), and way to determine which of those is used is expressed in a form of C expression in PO files. So, first step would be to introduce something like: <ngettext locale="C"> <plural num="0">%d item</plural> <plural num="1">%d items</plural> </ngettext> and to have translations of the following form integrated: <ngettext locale="sr@Latn"> <plural num="0">%d stavka</plural> <plural num="1">%d stavke</plural> <plural num="2">%d stavki</plural> </ngettext> This is just example syntax, but it needs to scale on the number of plural forms (i.e. for two strings we may get 1--4 or possibly more translations). The problem here is how to decide which form to use. GNU gettext library parses string of the form (Serbian example): "nplurals=3; plural = n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;" and evaluates "plural" as index of the plural form. So, if you want to put plural forms inside XML files, we've got two choices: either to hardcode language algorithms, or to (re)implement entire C arithmetic parser and use gettext("") to get header from MO file and extract field "Plural-Forms". Number "n" is not needed in the XML file, because string is chosen at display time, and its choice depends on the number to be used with it. So, the best way would probably be to keep this data in PO/MO files instead of putting them in XML, and using gettext to find out the needed string.
Danilo, I do not understand your reply. I do of course understand that there are potentially more than 2 plural forms - but my question had to do with the extraction of format strings _suitable for passing to ngettext_. Since C code using ngettext only passes two marked strings in, it seems obvious to me that only two strings need to be marked in the 'C' locale (i.e. only two msgids are required). It is required for this data to be in XML; perhaps you could re-read my previous question. The issue has to do with: * a means of extracting strings from XML in a way that allows ngettext-appropriate plural translation Your suggestion above does not at all seem workable since it presupposes that the XML.in file know all the possible plural forms, which we know is not feasible.
Ah, I thought you wanted translations to be in XML file as well (what is done usually using "xml:lang"). This would require adding functionality to intltool, because it handles extracting string from everything apart from sourcecode -- perhaps it's best to discuss a longterm solution on intltool@freedesktop.org? OTOH, short-term solution and probably most painless way to do this right away, without depending on the latest intltool (which is not even created yet :), is to extract these strings into a .h file, and put that in po/POTFILES.in.
Danilo: intltool-extract already extracts translatable strings from XML files, including both element content, and attributes. So if we use either technique I mention, the strings will get pulled into the .po files. The issue is, what's "special" about the way plural-form strings get listed in the PO files that allows ngettext-type translation? DO the translators just grep for "%d", or what? I think this may be technically feasible without changing intltool-extract, but I need more info about how the ngettext-type extraction and translation work (not the internals of ngettext, which I can find info about, but how translators find stuff in .po files that needs ngettext-appropriate translation). - Bill
They're completely special-cased in PO files, and intltool-extract would *have* to be extended to support it. And no, intltool-extract doesn't support it at this time. Instead of the "regular": msgid "Original string" msgstr "Translation of it" PO file contains something like: msgid "%d original string" msgid_plural "%d original strings" msgstr[0] "" msgstr[1] "" ... As far as I could tell, intltool-extract is designed with only one form per message (it puts all messages in a hash/array in Perl, and constructs a PO file later on), which means small architectural changes would be needed (like, allowing arrays/pairs to be keys in a hash as well, and treating them as plural forms). As for finding out about this, it's documented in "info gettext" as well, topic "PO Files" (use "m PO Files[RET]"): two completely separate styles of "items" are documented: one without, and other with plural forms. I hope it's now finally clear that this is *not* possible without changes to intltool, and that's why I'd really like this discussion to be moved to intltool@freedesktop.org.
Are you saying that the apps that already use ngettext have hand-edited .pot files?
No, intltool-update calls xgettext which can extract gettext and ngettext messages out of C files just fine.
Bill, look at my suggestion above to extract strings from XML files and put them in a C header (.h) file, where I implied using xgettext to extract ngettext calls. You, of course, wouldn't use this .h file anywhere in your code, except for putting strings into PO file, and that's why they would have to be the same as those you use in XML files, and pass to ngettext later on: in that case, you may choose DTD which suits you best. The other option, as I already said, is to extend intltool to support it (which should probably be done anyway, but it seems not to solve your immediate problem).
intltool-update currently extracts marked-up strings to (temporary) .h files, then runs xgettext on them, it seems. The possibility remains open that intltool-update's current behavior can be leveraged to do without having to create new Makefile rules to create and update the .h files manually (and add the persistent .h files to the POTFILES.in). It might take a small tweak to intltool-update, not sure without reading the code more closely. The point I am investigating here is leveraging intltool's existing functionality.
As I face this problem of missing ngettext support in glib right now and once again, I'd like to raise that bug once again. Maybe we should start with providing the relevant macros in <glib/gi18n.h>. Once they start to be used, als the other problems will resolved quickly.
What macros ? I don't think there are any "standard" macros for ngettext, or are there ? I any case, glib requires ngettext support now, so you can feel free to use ngettext() whereever you need it.
Regarding comment #22: Mathias, just use ngettext() function call directly, you need no macros. Anyway, this was just a "container" bug for all the ngettext bugs we found when we finally started supporting ngettext(). Those missing instances are pretty rare now, and I believe we can close this bug (all dependants seem resolved). Matthias?
agreed