GNOME Bugzilla – Bug 790391
Found Korean Syllables Canonical Decomposition bug
Last modified: 2017-11-26 10:50:23 UTC
Created attachment 363718 [details] [review] [PATCH] Fixed Korean Hangul Syllables Canonical Decomposition bug on GNOME-characters I found Korean Syllables Canonical Decomposition bug Not fully decompose Hangul Syllables. Expected: U+D4DB → <U+1111, U+1171, U+11B6> = Full canonical composition result. Result: U+D4DB → <U+D4CC,U+11B6> = intermediate step. tracked the Bug, The base of this bug exists in GNU libunistring. It's GNU libunistring Korean Hangul Syllables Canonical Decomposition bug. It also depends on GNU libunistring. The Hangul Decomposition Algorithm as specified above directly decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters. I fixed GNU libunistring's Hangul Decomposition Algorithm as known as Korean Alphabet Decomposition algorithm. Check the documentation The Unicode® Standard Version 10.0 – Core Specification http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf 3.12 Conjoining Jamo Behavior Unicode® Standard Annex #15 - UNICODE NORMALIZATION FORMS http://unicode.org/reports/tr15/ A detailed explanation will be written this weekend. I'll also send the libunistring bug to GNU libunistring committer about Korean Canonical Decomposition bug.
Created attachment 363721 [details] present Korean Hangul Canonical Decomposition. It's bug. present Korean Hangul Canonical Decomposition. It's bug.
Created attachment 363723 [details] Expected Korean Hangul Canonical Decomposition. Expected Korean Hangul Canonical Decomposition.
Hangul elements are commonly referred to as jamo(자모/字母), meaning “alphabet” Korean has special term for the jamo that are used to construct hangul syllable, depending on where in the syllable they appear: - Choseong(초성/初聲) for the initial sound, usually a consonant - Jungseong(중성/中聲) for the middle sound, usually a vowel - Jongseong(종성/終聲) for the final sound, usually a consonant Hangul syllables are the characters that are used to express contemporary Korean texts in writing. ex1) Decomposition of hangul syllable Unicode codepoint: U+AC00 Hangul(한글) ‘가’ jamo(자모/字母): ㄱ plus ㅏ choseong(초성/初聲): ㄱ (codepoint: U+1100) jungseong(중성/中聲): ㅏ(codepoint: U+1161) Selected Hangul syllable ‘가’(U+AC00) Present Canonical decomposition: ㄱ U+1100 HANGUL CHOSEONG KIYEOK -> only shown 'ㅏ U+1161 HANGUL JUNGSEONG A' is hidden Expected result Canonical decomposition: ㄱ U+1100 HANGUL CHOSEONG KIYEOK ㅏ U+1161 HANGUL JUNGSEONG A Hangul Choseong:ᄀ Hangul Jungseong:ᅡ ex2) Decomposition of hangul syllable Unicode code point: U+AC01 Hangul(한글) ‘각’ jamo(자모/字母): ‘ᄀ’ plus ‘ᅡ’ plus ‘ᆨ’ choseong(초성/初聲):ㄱ (codepoint: U+1100) jungseong(중성/中聲):ㅏ(codepoint: U+1161) jongseong(종성/終聲):ᆨ (codepoint: U+11A8) Selected Hangul syllable ‘각’(U+AC01) Present Canonical decomposition: ‘가 U+AC00 HANGUL SYLLABLE GA' only shown. but It's intermediate step. 'ᆨ U+11A8 HANGUL JONGSEONG KIYEOK' is hidden Expected Result Canonical decomposition(Fully): ㄱ U+1100 HANGUL CHOSEONG KIYEOK ㅏ U+1161 HANGUL JUNGSEONG A ᆨ U+11A8 HANGUL JONGSEONG KIYEOK Hangul Choseong:ᄀ Hangul Jungseong:ᅡ Hangul Jongseong:ᆨ Reference Unicode Normalization forms http://unicode.org/reports/tr15/ Unicode Normalization forms #14.1.4. Hangul Decomposition and Composition http://unicode.org/reports/tr15/#Hangul_Composition Hangul Jamo (Range: U+1100-U+11FF) http://www.unicode.org/charts/PDF/U1100.pdf Hangul Syllables (Range: U+AC00-U+D7AF) http://www.unicode.org/charts/PDF/UAC00.pdf
I also reported the bug on GNU libunistring. This is GNU libunistring bug report post. Hello, My name is DaeHyun Sung(성대현,成大鉉). I'm Korean and also, GNOME Foundation member in Korea. My mother tongue is Korean Language. I found a Korean Syllables canonical decomposition bug on GNU libunistring. When I found a Korean Syllables canonical decomposition bug on GNONE characters, I also found GNU libunistring bug. It depends on GNU libunistring. libunistring/lib/uninorm/canonical-decomposition.c /* Hangul syllable. See Unicode standard, chapter 3, section "Hangul Syllable Decomposition", See also the clarification at <http://www.unicode.org/versions/Unicode5.1.0/>, section "Clarification of Hangul Jamo Handling". */ #if 1 /* Return the pairwise decomposition, not the full decomposition. */ decomposition[0] = 0xAC00 + uc - t; /* = 0xAC00 + (l * 21 + v) * 28; */ decomposition[1] = 0x11A7 + t; return 2; #else unsigned int v, l; uc = uc / 28; decomposition[1] = 0x1161 + v; decomposition[2] = 0x11A7 + t; return 3; #endif I watched That source comment 'he clarification at <http://www.unicode.org/versions/Unicode5.1.0/>, section "Clarification of Hangul Jamo Handling"'. It's a misleading description of people who do not know Korean well. I found Korean Syllables Canonical Decomposition bug Not fully decompose Hangul Syllables. Expected: U+D4DB → <U+1111, U+1171, U+11B6> = Full canonical composition result. correct! Result: U+D4DB → <U+D4CC,U+11B6> = only intermediate step. incorrect If you check the Unicode Standard Version 10.0 - core specification, Chapter3.12. Conjoining Jamo Behavior Hangul Decomposition. The Hangul Decomposition Algorithm as specified above directly decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters. The Hangul Decomposition Algorithm could also be expressed equivalently as a recursion of binary decompositions, as is the case for other non-Hangul characters. All LVT syllables would decompose into an LV syllable plus a T jamo. The LV syllables themselves would in turn decompose into an L jamo plus a V jamo. This approach can be used to produce somewhat more compact code than what is illustrated in this sample method. That code is not recursion of decompositions. So It can't fully decomposition of Hangul Syllables. If you use that code, recursively use it the source code. So, I suggest removing the source code part of #if 1. and use the source code part of #else. That code(the source code part of #if 1) is not Korean hangul fully decomposition. Korean Alphabet Hangul Canonical Decomposition Explain Hangul elements are commonly referred to as jamo(자모/字母), meaning “alphabet” Korean has special term for the jamo that are used to construct hangul syllable, depending on where in the syllable they appear: - Choseong(초성/初聲) for the initial sound, usually a consonant - Jungseong(중성/中聲) for the middle sound, usually a vowel - Jongseong(종성/終聲) for the final sound, usually a consonant Hangul syllables are the characters that are used to express contemporary Korean texts in writing. ex1) Decomposition of hangul syllable Unicode codepoint: U+AC00 Hangul(한글) ‘가’ jamo(자모/字母): ㄱ plus ㅏ choseong(초성/初聲): ㄱ (codepoint: U+1100) jungseong(중성/中聲): ㅏ(codepoint: U+1161) Selected Hangul syllable ‘가’(U+AC00) Present Canonical decomposition: ㄱ U+1100 HANGUL CHOSEONG KIYEOK ㅏ U+1161 HANGUL JUNGSEONG A Expected result Canonical decomposition: ㄱ U+1100 HANGUL CHOSEONG KIYEOK ㅏ U+1161 HANGUL JUNGSEONG A Hangul Choseong:ᄀ Hangul Jungseong:ᅡ ex2) Decomposition of hangul syllable Unicode code point: U+AC01 Hangul(한글) ‘각’ jamo(자모/字母): ‘ᄀ’ plus ‘ᅡ’ plus ‘ᆨ’ choseong(초성/初聲):ㄱ (codepoint: U+1100) jungseong(중성/中聲):ㅏ(codepoint: U+1161) jongseong(종성/終聲):ᆨ (codepoint: U+11A8) Selected Hangul syllable ‘각’(U+AC01) Present Canonical decomposition: ‘가 U+AC00 HANGUL SYLLABLE GA' It's intermediate step. 'ᆨ U+11A8 HANGUL JONGSEONG KIYEOK' Expected Result Canonical decomposition(Fully): ㄱ U+1100 HANGUL CHOSEONG KIYEOK ㅏ U+1161 HANGUL JUNGSEONG A ᆨ U+11A8 HANGUL JONGSEONG KIYEOK Hangul Choseong:ᄀ Hangul Jungseong:ᅡ Hangul Jongseong:ᆨ --- I attached diff files on mail. canonical-decomposition.c.diff -> libunistring/lib/uninorm/canonical-decomposition.c test-canonical-decomposition.c.diff -> libunistring/tests/uninorm/test-canonical-decomposition.c Also checked Hangul decomposition of GNOME and KDE GNOME gucharmap, my suggestion: https://bugzilla.gnome.org/show_bug.cgi?id=777829 GNOME gucharmap's Korean Hangul decomposition source code https://github.com/GNOME/gucharmap/blob/master/gucharmap/gucharmap-unicode-info.c else if (wc >= 0xac00 && wc <= 0xd7af) { /* compute hangul syllable name as per UAX #15 */ gint SIndex = wc - SBase; gint LIndex, VIndex, TIndex; if (SIndex < 0 || SIndex >= SCount) return ""; LIndex = SIndex / NCount; VIndex = (SIndex % NCount) / TCount; TIndex = SIndex % TCount; g_snprintf (buf, sizeof (buf), "HANGUL SYLLABLE %s%s%s", JAMO_L_TABLE[LIndex], JAMO_V_TABLE[VIndex], JAMO_T_TABLE[TIndex]); return buf; } KDE kwidgetsaddons, kcharselect: https://git.reviewboard.kde.org/r/129943/diff/1#index_header Check the documentation The Unicode® Standard Version 10.0 – Core Specification http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf 3.12 Conjoining Jamo Behavior Unicode® Standard Annex #15 - UNICODE NORMALIZATION FORMS http://unicode.org/reports/tr15/ Unicode Normalization forms http://unicode.org/reports/tr15/ Unicode Normalization forms #14.1.4. Hangul Decomposition and Composition http://unicode.org/reports/tr15/# Hangul_Composition Hangul Jamo (Range: U+1100-U+11FF) http://www.unicode.org/charts/PDF/U1100.pdf Hangul Syllables (Range: U+AC00-U+D7AF) http://www.unicode.org/charts/PDF/UAC00.pdf Please, check the mail, ASAP! Thanks! Sincerely, DaeHyun Sung(성대현,成大鉉)
Comment on attachment 363718 [details] [review] [PATCH] Fixed Korean Hangul Syllables Canonical Decomposition bug on GNOME-characters >From 9d06c21c687c09336d3daf9814f0eadfc31e6868 Mon Sep 17 00:00:00 2001 >From: DaeHyun Sung <sungdh86+git@gmail.com> >Date: Thu, 16 Nov 2017 01:57:05 +0900 >Subject: [PATCH] Fixed Korean Hangul Syllables Canonical Decomposition bug on > GNOME-characters >MIME-Version: 1.0 >Content-Type: text/plain; charset=UTF-8 >Content-Transfer-Encoding: 8bit > >Not fully decompose Hangul Syllables. >Expected: U+D4DB â <U+1111, U+1171, U+11B6> = Full canonical composition result. >Result: U+D4DB â <U+D4CC,U+11B6> = intermediate step. > >tracked the Bug, The base of this bug exists in GNU libunistring. >It's GNU libunistring Korean Hangul Syllables Canonical Decomposition bug. >It also depends on GNU libunistring. > >The Hangul Decomposition Algorithm as specified above directly >decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters. > >I fixed GNU libunistring's Hangul Decomposition Algorithm as known as Korean Alphabet Decomposition algorithm. > >Check the documentation >The Unicode® Standard Version 10.0 â Core Specification >http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf >3.12 Conjoining Jamo Behavior >Unicode® Standard Annex #15 - UNICODE NORMALIZATION FORMS >http://unicode.org/reports/tr15/ >--- > gllib/uninorm/canonical-decomposition.c | 11 ++--------- > lib/gc.c | 9 +++++++-- > src/window.js | 3 ++- > 3 files changed, 11 insertions(+), 12 deletions(-) > >diff --git a/gllib/uninorm/canonical-decomposition.c b/gllib/uninorm/canonical-decomposition.c >index dfeea71..3862636 100644 >--- a/gllib/uninorm/canonical-decomposition.c >+++ b/gllib/uninorm/canonical-decomposition.c >@@ -1,6 +1,7 @@ > /* Canonical decomposition of Unicode characters. > Copyright (C) 2009-2017 Free Software Foundation, Inc. > Written by Bruno Haible <bruno@clisp.org>, 2009. >+ Modified by DaeHyun Sung <sungdh86@gmail.com>, 2017. > > This program is free software: you can redistribute it and/or modify it > under the terms of the GNU General Public License as published >@@ -30,9 +31,7 @@ uc_canonical_decomposition (ucs4_t uc, ucs4_t *decomposition) > if (uc >= 0xAC00 && uc < 0xD7A4) > { > /* Hangul syllable. See Unicode standard, chapter 3, section >- "Hangul Syllable Decomposition", See also the clarification at >- <http://www.unicode.org/versions/Unicode5.1.0/>, section >- "Clarification of Hangul Jamo Handling". */ >+ "Hangul Syllable Decomposition"*/ > unsigned int t; > > uc -= 0xAC00; >@@ -52,11 +51,6 @@ uc_canonical_decomposition (ucs4_t uc, ucs4_t *decomposition) > } > else > { >-#if 1 /* Return the pairwise decomposition, not the full decomposition. */ >- decomposition[0] = 0xAC00 + uc - t; /* = 0xAC00 + (l * 21 + v) * 28; */ >- decomposition[1] = 0x11A7 + t; >- return 2; >-#else > unsigned int v, l; > > uc = uc / 28; >@@ -67,7 +61,6 @@ uc_canonical_decomposition (ucs4_t uc, ucs4_t *decomposition) > decomposition[1] = 0x1161 + v; > decomposition[2] = 0x11A7 + t; > return 3; >-#endif > } > } > else if (uc < 0x110000) >diff --git a/lib/gc.c b/lib/gc.c >index 46bb0df..e4992cc 100644 >--- a/lib/gc.c >+++ b/lib/gc.c >@@ -851,10 +851,15 @@ populate_related_characters (GcCharacterIter *iter) > decomposition_base = decomposition[0]; > if (decomposition_base != iter->uc) > g_array_append_val (result, decomposition_base); >- } + decomposition_base = decomposition[1]; + if (decomposition_base != iter->uc) + g_array_append_val (result, decomposition_base); + decomposition_base = decomposition[2]; + if (decomposition_base != iter->uc) + g_array_append_val (result, decomposition_base); + } > else > decomposition_base = iter->uc; >- > script = uc_script (iter->uc); > if (script) > { >diff --git a/src/window.js b/src/window.js >index 10c51e0..a9a7cb3 100644 >--- a/src/window.js >+++ b/src/window.js >@@ -193,7 +193,8 @@ var MainWindow = new Lang.Class({ > { artists: [ 'Allan Day <allanpday@gmail.com>', > 'Jakub Steiner <jimmac@gmail.com>' ], > authors: [ 'Daiki Ueno <dueno@src.gnome.org>', >- 'Giovanni Campagna <scampa.giovanni@gmail.com>' ], >+ 'Giovanni Campagna <scampa.giovanni@gmail.com>', >+ 'DaeHyun Sung <sungdh86@gmail.com>' ], > // TRANSLATORS: put your names here, one name per line. > translator_credits: _("translator-credits"), > program_name: _("GNOME Characters"), >-- >2.14.3 >
Created attachment 363987 [details] [review] new patch edited patch file
Review of attachment 363987 [details] [review]: Thank you for the patches, but please use the Bugzilla patch status properly ("committed" means that the patch has already been pushed to the git repository, but this is not the case). ::: lib/gc.c @@ +852,3 @@ if (decomposition_base != iter->uc) g_array_append_val (result, decomposition_base); + decomposition_base = decomposition[1]; I have a couple of questions: - Why did you remove the check of decomposition_length, from the previous patch? Couldn't it lead to unbound array access? - Now decomposition_base always points to the last character of a composed character; what if the character is a Latin composed character, e.g. á? For the latter, I would suggest to special case Hangul characters, since the current code assumes "base character + modifiers".
Answer two questions. Q: - Why did you remove the check of decomposition_length, from the previous patch? Couldn't it lead to unbound array access? A: Because, When I deleted check of decomposition_length, app is not creaked. But, I checked your message, I made a mistake about unbound array access. Q: - Now decomposition_base always points to the last character of a composed character; what if the character is a Latin composed character, e.g. á? A: I have not considered Latin letters. because, I'm Korean and I don't know about some latin composed characters. Maybe Special case "Hangul characters" I think it should be implemented separately.
Created attachment 363998 [details] [review] Modified Fixed Korean Hangul Syllables Canonical Decomposition Yesterday, I submitted GNU libunistring's Korean canonical composition bug report. Today Morning, I got a mail from the GNU libunistring committer "Bruno Haible" . I agree with GNU libunistring committer "Bruno Haible"'s opinion. http://git.savannah.gnu.org/gitweb/?p=libunistring.git;a=commitdiff;h=4e49b798264d01433f64137fb525f507778fb781 I refer to "Bruno Haible"'s opinion, I modified Korean Hangul Sylables canonical decomposition on GNOME characters. It has implemented Separately, example, Special case "Hangul characters" and the others Please, Check my source ASAP! Thanks, committer's opinion!
Created attachment 363999 [details] [review] libgc: Perform full canonical decomposition for Hangul syllables Previously, the code finding related characters only took into account of composed characters built from a base character and combining characters. However, Hangul syllables are composed of two or three Hangul jamo characters, all of which should be considered as a base character. For the implementation, uc_canonical_decomposition() is not capable of decomposing Hangul syllables. Instead of the function, this patch uses u32_normalize() with UNINORM_NFD, as suggested by Bruno Haible in: https://lists.gnu.org/archive/html/bug-libunistring/2017-11/msg00002.html -- I have slightly modified your patch based on Bruno's suggestion. Would it make sense for you?
Assuming silence means no objection, I am going to push it soon.
Attachment 363999 [details] pushed as 70e5e05 - libgc: Perform full canonical decomposition for Hangul syllables
Hmm, Meanwhile, due to overworking at my working, I checked that messages lately. Changed the libunistring and change the GNOME charactes, It's make sense for me. I checked GNU libunistring patch based on Bruno' suggestion. And I read CJKV Information Processing, written by Ken Lunde. CJKV Information Processing P.170 "More details about how Normalization of hangul syllables is handled, including some useful historical information, ca be found online. The complexity of Normalization is clearly beyond the scope of this book, and I encourage you to explore Unicode resources if what is presented here does not satisfy your needs." This link shown by CJKV Information Processing, 2nd Edition. Hangul Conjoining Jamo Rendering http://www.i18nl10n.com/korean/jamo.html http://www.unicode.org/charts/normalization/ I ran with change the code, It was confirmed that the expected results. Thanks!