After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 710142 - Add more impressive transliteration to GLib
Add more impressive transliteration to GLib
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: i18n
unspecified
Other All
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks: 724194
 
 
Reported: 2013-10-14 20:59 UTC by Allison Karlitskaya (desrt)
Modified: 2014-02-20 23:29 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
transliteration based on gconv (6.35 KB, patch)
2014-02-17 14:19 UTC, Allison Karlitskaya (desrt)
none Details | Review
Add locale-sensitive ASCII transliteration API (50.30 KB, patch)
2014-02-17 18:20 UTC, Allison Karlitskaya (desrt)
reviewed Details | Review
tests: test transliteration API (3.75 KB, patch)
2014-02-17 18:20 UTC, Allison Karlitskaya (desrt)
committed Details | Review
g_str_tokenize_and_fold: do proper transliteration (2.09 KB, patch)
2014-02-17 18:20 UTC, Allison Karlitskaya (desrt)
committed Details | Review
Add locale-sensitive ASCII transliteration API (50.76 KB, patch)
2014-02-19 04:05 UTC, Allison Karlitskaya (desrt)
committed Details | Review

Description Allison Karlitskaya (desrt) 2013-10-14 20:59:25 UTC
See bug 709753.

g_str_tokenize_and_fold() does ASCII transliteration of strings for purposes of loose matching.  The API appears to support locale-specific transliteration (for example, translating "ö" -> "oe" for German locales) but this is not actually implemented.

iconv supports this sort of transliteration but it does it in whatever happens to be the current locale.  This is obviously not something that we can do from a library (unless we have xlocale, but even then we would not want to).

glibc contains the transliteration rules in its localedata.

glibc also exposes a gconv interface which seems more powerful (but I didn't have time to look into using it).

The documentation for g_str_tokenize_and_fold() is pretty good about making no promises about what sort of transliteration we do, so we could use gconv if we have it or do nothing if we don't.  We could also try to steal the tables out of glibc and implement a mapping function ourselves.
Comment 1 Matthias Clasen 2013-10-16 18:05:23 UTC
(In reply to comment #0)

> iconv supports this sort of transliteration but it does it in whatever happens
> to be the current locale.  This is obviously not something that we can do from
> a library (unless we have xlocale, but even then we would not want to).

Why not ?

> 
> glibc contains the transliteration rules in its localedata.
> 
> glibc also exposes a gconv interface which seems more powerful (but I didn't
> have time to look into using it).

gconv.h doesn't look promising:

/* This header provides no interface for a user to the internals of
   the gconv implementation in the libc.  Therefore there is no use
   for these definitions beside for writing additional gconv modules.  */
Comment 2 Allison Karlitskaya (desrt) 2014-02-17 14:19:07 UTC
Created attachment 269407 [details] [review]
transliteration based on gconv

Here's something I hacked up over the weekend -- it uses xlocale and nl_langinfo_l() to pull the per-locale transliteration information out of the locale, just like gconv does.

This would only ever work on glibc, and that assumes they don't change the API (which they might -- the header is slightly vague on if this is "API" or not).

It's also much slower and much more complicated than another approach, which will be following soon...
Comment 3 Allison Karlitskaya (desrt) 2014-02-17 18:20:48 UTC
Created attachment 269446 [details] [review]
Add locale-sensitive ASCII transliteration API

Add a new function, g_str_to_ascii() that does locale-dependent ASCII
transliteration of UTF-8 strings.
Comment 4 Allison Karlitskaya (desrt) 2014-02-17 18:20:52 UTC
Created attachment 269447 [details] [review]
tests: test transliteration API

Add some tests for the new transliteration API.
Comment 5 Allison Karlitskaya (desrt) 2014-02-17 18:20:56 UTC
Created attachment 269448 [details] [review]
g_str_tokenize_and_fold: do proper transliteration

g_str_tokenize_and_fold() can now do proper locale-sensitive
transliteration for ascii alternatives.
Comment 6 Matthias Clasen 2014-02-17 21:19:39 UTC
Review of attachment 269446 [details] [review]:

::: glib/gtranslit.c
@@ +225,3 @@
+  if (language_len == 0 || *next_char)
+    return default_item_id;
+

We do have code elsewhere in glib to parse locale ids - could it be reused here ?

@@ +298,3 @@
+ * g_str_to_ascii:
+ * @str: a string, in UTF-8
+ * @from_locale: (allow-none): the source locale, if known

Might be good to give an example of the kind of string you can pass here: does 'de' work ? 'de_DE' ? 'de_DE@euro' ? any of the above ?

@@ +313,3 @@
+ * If @from_locale is %NULL then the current locale is used.
+ *
+ * If you want a consistent transliteration, specify "C" for

'consistent' here means just 'independent of locale', correct ? even for C, we're still not guaranteeing stable transliteration across versions or platforms, right ?

@@ +333,3 @@
+    item_id = get_default_item_id ();
+
+  result = g_string_new (NULL);

should this be g_string_new (length) ?

@@ +345,3 @@
+
+          /* We only have characters <= 0xffff in the table */
+          if (c <= 0xffff)

Might be slightly nicer to avoid the conversion to ucs4 here, and just go by the utf8 sequence ? given that you are looking at the sequence below, anyway...
Comment 7 Matthias Clasen 2014-02-17 21:22:21 UTC
Review of attachment 269447 [details] [review]:

::: glib/tests/strfuncs.c
@@ +1418,3 @@
+  g_free (out);
+
+  out = g_str_to_ascii ("ö", "doesnotexist");

Shouldn't nonsense locale names trigger a warning ?
Comment 8 Allison Karlitskaya (desrt) 2014-02-19 02:59:54 UTC
(In reply to comment #6)
> Review of attachment 269446 [details] [review]:
> We do have code elsewhere in glib to parse locale ids - could it be reused here
> ?

This is g_get_locale_variants() but I believe it to be inappropriate for two reasons:

  1) it allocates a bunch of memory and I want this function to be fast

  2) I want to decompose the locale in the same order that the update script
     did it.  Particularly, after language, I consider to be the @modifier to
     be the most important thing.  The normal POSIX rules say that country
     should be next, however.

     The reason for my considering variant to be most important is that
     the @latin variants of some locales have substantial impact on the
     transliteration rules (whereas the country code typically doesn't).

> > @@ +298,3 @@
> + * g_str_to_ascii:
> + * @str: a string, in UTF-8
> + * @from_locale: (allow-none): the source locale, if known
> 
> Might be good to give an example of the kind of string you can pass here: does
> 'de' work ? 'de_DE' ? 'de_DE@euro' ? any of the above ?

Any valid POSIX locale string should work.  I will add a note to the docs.


> @@ +313,3 @@
> + * If @from_locale is %NULL then the current locale is used.
> + *
> + * If you want a consistent transliteration, specify "C" for
> 
> 'consistent' here means just 'independent of locale', correct ? even for C,
> we're still not guaranteeing stable transliteration across versions or
> platforms, right ?

Right.  I'll clarify this.

> @@ +333,3 @@
> +    item_id = get_default_item_id ();
> +
> +  result = g_string_new (NULL);
> 
> should this be g_string_new (length) ?

g_string_new() takes a string to copy as the original value, or NULL (equivalent to "").

> 
> @@ +345,3 @@
> +
> +          /* We only have characters <= 0xffff in the table */
> +          if (c <= 0xffff)
> 
> Might be slightly nicer to avoid the conversion to ucs4 here, and just go by
> the utf8 sequence ? given that you are looking at the sequence below, anyway...

True.  I'll tweak that.

(In reply to comment #7)
> Review of attachment 269447 [details] [review]:
> 
> ::: glib/tests/strfuncs.c
> @@ +1418,3 @@
> +  g_free (out);
> +
> +  out = g_str_to_ascii ("ö", "doesnotexist");
> 
> Shouldn't nonsense locale names trigger a warning ?

I don't think so.  I have no way to know that "doesnotexist" is not the valid name of a language -- and not all languages are two digits (although I didn't see one over three, to be honest...).

I think a fallback to the standard rules is appropriate in this case.  It also means that people can say "POSIX" (which is a widely-accepted alias for "C").
Comment 9 Allison Karlitskaya (desrt) 2014-02-19 03:59:37 UTC
Some notes about data size:

As the patch currently sits, the size is 6181 bytes of data.

Switching to full-width gunichar support would make that 6309 but there is nothing up there that I wouldn't have already blacklisted anyway, by the same standards that I applied elsewhere; it's all of the mathematical bold/double-strike/italic/etc. forms)...

Enabling Ethiopic would bump that to 11013 or 12133 with full-width gunichar.

Enabling absolutely everything, including using full-width gunichar would use 20389 bytes.

Enabling everything below 0xffff and using guint16 would use 13509 bytes.

Enabling everything below 0xffff (with guint16) except Ethiopic would use 8645 bytes.

Of the Ethiopic languages which include the transliteration information, only Amharic has a GNOME translation, and it is at 2% and declining.  Looking at 'yum provides' for /usr/share/locale/am/ shows that GNOME programs are just about the only ones that install messages for this locale.



By comparison, the Unicode tables use 241718 bytes.

In light of this, my attempts to reduce the size of the data seem to be a bit misguided...

Thoughts?
Comment 10 Allison Karlitskaya (desrt) 2014-02-19 04:05:42 UTC
Created attachment 269659 [details] [review]
Add locale-sensitive ASCII transliteration API

Add a new function, g_str_to_ascii() that does locale-dependent ASCII
transliteration of UTF-8 strings.

This function works off of an internal database.  We get the data out of
the localedata shipped with glibc, which seems to be just about the best
source of locale-sensitive transliteration information available
anywhere.

We include a update script with this commit that's not used by anything
at all -- it will just sit in git.  It is intended to be run manually
from time to time.
Comment 11 Matthias Clasen 2014-02-19 15:25:13 UTC
Review of attachment 269659 [details] [review]:

::: glib/gtranslit.c
@@ +336,3 @@
+    item_id = get_default_item_id ();
+
+  result = g_string_new (NULL);

What I actually meant earlier is: this could be g_string_sized_new (strlen (str));
Comment 12 Matthias Clasen 2014-02-20 23:11:07 UTC
Review of attachment 269447 [details] [review]:

sure
Comment 13 Matthias Clasen 2014-02-20 23:11:37 UTC
Review of attachment 269448 [details] [review]:

ok
Comment 14 Matthias Clasen 2014-02-20 23:12:43 UTC
Review of attachment 269659 [details] [review]:

looks fine to me
Comment 15 Allison Karlitskaya (desrt) 2014-02-20 23:29:14 UTC
Attachment 269447 [details] pushed as d729176 - tests: test transliteration API
Attachment 269448 [details] pushed as a8ea3dc - g_str_tokenize_and_fold: do proper transliteration
Attachment 269659 [details] pushed as 941b897 - Add locale-sensitive ASCII transliteration API

Pushed.

As discussed on IRC, I removed the blacklist -- 20k of shared memory is probably no big deal.