Bug 100456 – unicode normalization doesn't handle hangul syllables

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 100456 - unicode normalization doesn't handle hangul syllables


Summary:	unicode normalization doesn't handle hangul syllables


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	123156 (view as bug list)
Depends on:
Blocks:

Reported:	2002-12-05 19:11 UTC by Noah Levitt
Modified:	2011-02-18 16:07 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
full patch doing composition and decomposition (5.63 KB, patch) 2003-08-05 04:45 UTC, Noah Levitt	none	Details \| Review

Description Noah Levitt 2002-12-05 19:11:18 UTC

g_unicode_canonical_decomposition does not decode Hangul syllables. 

UAX #15 (http://www.unicode.org/unicode/reports/tr15/) says "Canonical
decomposition is the process of taking a string, recursively replacing
composite characters using the Unicode canonical decomposition mappings
(including the algorithmic Hangul canonical decomposition mappings, see
Annex 10: Hangul), and putting the result in canonical order."

Here is a patch. The Hangul decomposition code is nearly verbatim from the
sample implementation in UAX #15.

Index: glib/gunidecomp.c
===================================================================
RCS file: /cvs/gnome/glib/glib/gunidecomp.c,v
retrieving revision 1.13
diff -u -r1.13 gunidecomp.c
--- glib/gunidecomp.c   4 Dec 2002 01:27:43 -0000       1.13
+++ glib/gunidecomp.c   5 Dec 2002 18:35:36 -0000
@@ -128,6 +128,50 @@
   return NULL;
 }
 
+/* http://www.unicode.org/unicode/reports/tr15/#Hangul */
+static gunichar *
+hangul_decomposition (gunichar s, gsize *result_len)
+{
+#define SBase 0xAC00 
+#define LBase 0x1100 
+#define VBase 0x1161 
+#define TBase 0x11A7
+#define LCount 19 
+#define VCount 21
+#define TCount 28
+#define NCount (VCount * TCount)
+#define SCount (LCount * NCount)
+
+  gunichar *r = malloc (3 * sizeof (gunichar));
+  gint SIndex = s - SBase;
+
+  /* not a hangul syllable */
+  if (SIndex < 0 || SIndex >= SCount)
+    {
+      r[0] = s;
+      *result_len = 1;
+    }
+  else
+    {
+      gunichar L = LBase + SIndex / NCount;
+      gunichar V = VBase + (SIndex % NCount) / TCount;
+      gunichar T = TBase + SIndex % TCount;
+
+      r[0] = L;
+      r[1] = V;
+
+      if (T != TBase) 
+        {
+          r[2] = T;
+          *result_len = 3;
+        }
+      else
+        *result_len = 2;
+    }
+
+  return r;
+}
+
 /**
  * g_unicode_canonical_decomposition:
  * @ch: a Unicode character.
@@ -142,10 +186,15 @@
 g_unicode_canonical_decomposition (gunichar ch,
                                   gsize   *result_len)
 {
-  const guchar *decomp = find_decomposition (ch, FALSE);
+  const guchar *decomp;
   gunichar *r;
 
-  if (decomp)
+  /* Hangul syllable */
+  if (ch >= 0xac00 && ch <= 0xd7af)
+    {
+      r = hangul_decomposition (ch, result_len);
+    }
+  else if ((decomp = find_decomposition (ch, FALSE)) != NULL)
     {
       /* Found it.  */
       int i, len;



I tested the patch with the program below. The only difference between the
old output and the new was the Hangul decomposition, and that part appeared
to be correct, so I'm pretty confident that the patch is right. :)

#include <glib.h>

gint
main ()
{
  gunichar uc;
  gunichar *decomposition;
  gsize result_len;
  gint i;

  for (uc = 0;  uc < 0x10ffff;  uc++)
    {
      decomposition = g_unicode_canonical_decomposition (uc, &result_len);
      g_print ("U+%4.4X = U+%4.4X", uc, decomposition[0]);
      for (i = 1;  i < result_len;  i++)
        g_print (" + U+%4.4X", decomposition[i]);
      g_print ("\n");
      g_free (decomposition);
    }

  return 0;
}

Comment 1 Owen Taylor 2002-12-09 16:52:30 UTC

Looks plausible at first glance (please attach patches as
attachments in the future, prevents mangling), but I don't 
have time to investigate in detail right now.

Probably need equivalent handling for combine() in 
gunidecomp.c. 

We should also investigate adding a decomposition test case 
in glib/tests.

Comment 2 Noah Levitt 2003-08-05 04:45:40 UTC

Created attachment 18915 [details] [review]
full patch doing composition and decomposition

Comment 3 Owen Taylor 2003-09-25 02:14:15 UTC

*** Bug 123156 has been marked as a duplicate of this bug. ***

Comment 4 Simon Josefsson 2003-09-25 17:43:53 UTC

I modified the patch somewhat and it is now used in GNU Libidn, thanks
Noah!  Complete modified patch at:
http://savannah.gnu.org/cgi-bin/viewcvs/libidn/libidn/lib/nfkc.c.diff?r1=1.1&r2=1.2

I'm only using composition though, so I haven't tested decomposition,
but if it passes the Unicode Inc test vectors, it is likely good.

Thanks also to Owen for pointing me in the right direction.  (I did
search for 'hangul' in the BTS, but somehow I didn't find this bug...)

Comment 5 Changwoo Ryu 2003-11-14 11:42:34 UTC

*** Bug 96314 has been marked as a duplicate of this bug. ***

Comment 6 Owen Taylor 2003-12-04 17:27:15 UTC

Perhaps Jungshik Shin would want to review this patch (though
I'm fine with it going in without review.) Some sort of extension
of the automated test suite to test this would be nice, however.

Comment 7 Jungshik Shin 2003-12-04 18:21:11 UTC

As far as Unicode normalization goes, the patch looks fine. 

Unfortunately, Hangul encoding model in general and the normalization
in particular are broken and are frozen forever so that we can't fix
them. [1] Therefore, the layout module needs to do an additional job
instead of relying on this. [2] So, bug 96314 is not a dupe of this bug. 

[1] http://i18nl10n.com/korean/jamocomp.html
    http://std.dkuug.dk/JTC1/SC22/WG20/docs/N954.PDF (full);
    http://std.dkuug.dk/JTC1/SC22/WG20/docs/N953.PDF (summary)

[2] ICU (in Jitterbug) has a rather 'generic' bug on this that was
filed apaprently to deal with Indic scripts but is only applicable to
Korean script as well.

Comment 8 Noah Levitt 2003-12-04 19:50:16 UTC

This part of the patch should take care of testing:

Index: tests/unicode-normalize.c
===================================================================
RCS file: /cvs/gnome/glib/tests/unicode-normalize.c,v
retrieving revision 1.10
retrieving revision 1.11
diff -u -p -r1.10 -r1.11
--- tests/unicode-normalize.c   5 Aug 2003 03:41:34 -0000       1.10
+++ tests/unicode-normalize.c   4 Dec 2003 19:47:52 -0000       1.11
@@ -23,13 +23,6 @@ decode (const gchar *input)
          exit (1);
        }
 
-      /* FIXME: We don't handle the Hangul syllables */
-      if (ch >= 0xac00 && ch <= 0xd7ff)  /* Hangul syllables */
-       {
-         g_string_free (result, TRUE);
-         return NULL;
-       }
-
       g_string_append_unichar (result, ch);
       
       while (input[offset] && input[offset] != ' ')


2003-12-04  Noah Levitt  <nlevitt@columbia.edu>

	* glib/gunidecomp.c: Add hangul composition and decomposition to
	unicode normalization. (#100456)

	* tests/unicode-normalize.c: Test hangul.