GNOME Bugzilla – Bug 100456
unicode normalization doesn't handle hangul syllables
Last modified: 2011-02-18 16:07:18 UTC
g_unicode_canonical_decomposition does not decode Hangul syllables. UAX #15 (http://www.unicode.org/unicode/reports/tr15/) says "Canonical decomposition is the process of taking a string, recursively replacing composite characters using the Unicode canonical decomposition mappings (including the algorithmic Hangul canonical decomposition mappings, see Annex 10: Hangul), and putting the result in canonical order." Here is a patch. The Hangul decomposition code is nearly verbatim from the sample implementation in UAX #15. Index: glib/gunidecomp.c =================================================================== RCS file: /cvs/gnome/glib/glib/gunidecomp.c,v retrieving revision 1.13 diff -u -r1.13 gunidecomp.c --- glib/gunidecomp.c 4 Dec 2002 01:27:43 -0000 1.13 +++ glib/gunidecomp.c 5 Dec 2002 18:35:36 -0000 @@ -128,6 +128,50 @@ return NULL; } +/* http://www.unicode.org/unicode/reports/tr15/#Hangul */ +static gunichar * +hangul_decomposition (gunichar s, gsize *result_len) +{ +#define SBase 0xAC00 +#define LBase 0x1100 +#define VBase 0x1161 +#define TBase 0x11A7 +#define LCount 19 +#define VCount 21 +#define TCount 28 +#define NCount (VCount * TCount) +#define SCount (LCount * NCount) + + gunichar *r = malloc (3 * sizeof (gunichar)); + gint SIndex = s - SBase; + + /* not a hangul syllable */ + if (SIndex < 0 || SIndex >= SCount) + { + r[0] = s; + *result_len = 1; + } + else + { + gunichar L = LBase + SIndex / NCount; + gunichar V = VBase + (SIndex % NCount) / TCount; + gunichar T = TBase + SIndex % TCount; + + r[0] = L; + r[1] = V; + + if (T != TBase) + { + r[2] = T; + *result_len = 3; + } + else + *result_len = 2; + } + + return r; +} + /** * g_unicode_canonical_decomposition: * @ch: a Unicode character. @@ -142,10 +186,15 @@ g_unicode_canonical_decomposition (gunichar ch, gsize *result_len) { - const guchar *decomp = find_decomposition (ch, FALSE); + const guchar *decomp; gunichar *r; - if (decomp) + /* Hangul syllable */ + if (ch >= 0xac00 && ch <= 0xd7af) + { + r = hangul_decomposition (ch, result_len); + } + else if ((decomp = find_decomposition (ch, FALSE)) != NULL) { /* Found it. */ int i, len; I tested the patch with the program below. The only difference between the old output and the new was the Hangul decomposition, and that part appeared to be correct, so I'm pretty confident that the patch is right. :) #include <glib.h> gint main () { gunichar uc; gunichar *decomposition; gsize result_len; gint i; for (uc = 0; uc < 0x10ffff; uc++) { decomposition = g_unicode_canonical_decomposition (uc, &result_len); g_print ("U+%4.4X = U+%4.4X", uc, decomposition[0]); for (i = 1; i < result_len; i++) g_print (" + U+%4.4X", decomposition[i]); g_print ("\n"); g_free (decomposition); } return 0; }
Looks plausible at first glance (please attach patches as attachments in the future, prevents mangling), but I don't have time to investigate in detail right now. Probably need equivalent handling for combine() in gunidecomp.c. We should also investigate adding a decomposition test case in glib/tests.
Created attachment 18915 [details] [review] full patch doing composition and decomposition
*** Bug 123156 has been marked as a duplicate of this bug. ***
I modified the patch somewhat and it is now used in GNU Libidn, thanks Noah! Complete modified patch at: http://savannah.gnu.org/cgi-bin/viewcvs/libidn/libidn/lib/nfkc.c.diff?r1=1.1&r2=1.2 I'm only using composition though, so I haven't tested decomposition, but if it passes the Unicode Inc test vectors, it is likely good. Thanks also to Owen for pointing me in the right direction. (I did search for 'hangul' in the BTS, but somehow I didn't find this bug...)
*** Bug 96314 has been marked as a duplicate of this bug. ***
Perhaps Jungshik Shin would want to review this patch (though I'm fine with it going in without review.) Some sort of extension of the automated test suite to test this would be nice, however.
As far as Unicode normalization goes, the patch looks fine. Unfortunately, Hangul encoding model in general and the normalization in particular are broken and are frozen forever so that we can't fix them. [1] Therefore, the layout module needs to do an additional job instead of relying on this. [2] So, bug 96314 is not a dupe of this bug. [1] http://i18nl10n.com/korean/jamocomp.html http://std.dkuug.dk/JTC1/SC22/WG20/docs/N954.PDF (full); http://std.dkuug.dk/JTC1/SC22/WG20/docs/N953.PDF (summary) [2] ICU (in Jitterbug) has a rather 'generic' bug on this that was filed apaprently to deal with Indic scripts but is only applicable to Korean script as well.
This part of the patch should take care of testing: Index: tests/unicode-normalize.c =================================================================== RCS file: /cvs/gnome/glib/tests/unicode-normalize.c,v retrieving revision 1.10 retrieving revision 1.11 diff -u -p -r1.10 -r1.11 --- tests/unicode-normalize.c 5 Aug 2003 03:41:34 -0000 1.10 +++ tests/unicode-normalize.c 4 Dec 2003 19:47:52 -0000 1.11 @@ -23,13 +23,6 @@ decode (const gchar *input) exit (1); } - /* FIXME: We don't handle the Hangul syllables */ - if (ch >= 0xac00 && ch <= 0xd7ff) /* Hangul syllables */ - { - g_string_free (result, TRUE); - return NULL; - } - g_string_append_unichar (result, ch); while (input[offset] && input[offset] != ' ') 2003-12-04 Noah Levitt <nlevitt@columbia.edu> * glib/gunidecomp.c: Add hangul composition and decomposition to unicode normalization. (#100456) * tests/unicode-normalize.c: Test hangul.