After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 673532 - g_utf8_normalize(...,G_NORMALIZE_DEFAULT) problem
g_utf8_normalize(...,G_NORMALIZE_DEFAULT) problem
Status: RESOLVED FIXED
Product: pango
Classification: Platform
Component: general
unspecified
Other All
: Normal normal
: ---
Assigned To: pango-maint
pango-maint
Depends on:
Blocks:
 
 
Reported: 2012-04-04 20:59 UTC by Morten Welinder
Modified: 2012-08-25 19:32 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Morten Welinder 2012-04-04 20:59:10 UTC
See bug 673447 for background.

This string:

WWW: [휴가 가-- (오--)]
       0 | ed 9c b4 ea b0 80 20 ea b0 80 2d 2d 20 28 ec 98 | ..........--.(..
      10 | a4 2d 2d 29 XX XX XX XX XX XX XX XX XX XX XX XX | .--)************

when sent through g_utf8_normalize(...,G_NORMALIZE_DEFAULT) becomes

XXX: [휴가 가-- (오--)]
       0 | e1 84 92 e1 85 b2 e1 84 80 e1 85 a1 20 e1 84 80 | ................
      10 | e1 85 a1 2d 2d 20 28 e1 84 8b e1 85 a9 2d 2d 29 | ...--.(......--)

(Note: in Mozilla these strings appear the same; when pasted to, say, a gnome-
shell they look different.)

g_utf8_normalize isn't supposed to change text contents, so the two strings
should always look the same.  I don't know if I should blame glib or
pango+deps.

Tentatively blaming at glib for no other reason than it's first in the food
chain.
Comment 1 Dan Winship 2012-04-04 21:36:36 UTC
g_utf8_normalize() is converting the pre-composed hangul characters into their constituent jamo, which is correct for G_NORMALIZE_NFD aka G_NORMALIZE_DEFAULT, so this isn't glib's fault.

As I understand it, in theory pango ought to render the two strings the same, or at least very similarly, so the fact that the second string looks ugly in gnome-terminal and gedit may mean this is pango's fault. (Although gnumeric seems to have an extra bug on top of that, where the jamo aren't even getting visually recombined.)

At any rate, it probably makes more sense for gnumeric to normalize to NFC rather than NFD anyway. Using NFD means that replacing "e" with "a" would also replace "é" with "á", etc, which is weird.
Comment 2 Morten Welinder 2012-04-04 23:51:50 UTC
What you call weird was actually intended behaviour, :-/  It's probably
language dependent on whether it makes sense.

Neither unicode nor glib provides a really good normalization mode for
search-and-replace.  If I do s/f/F/ I would have expected a change even
for U+FB01 (that rules out NFC and NFD), but no-change for 2^5 (that rules
out NFKC and NFKD).

Tossing to pango for an opinion on rendering of the two strings.

PANGO: the claim is that the two strings from the initial report should
render identically, or some close approximation thereof.  How does that
claim look from where you are?
Comment 3 Behdad Esfahbod 2012-04-05 17:41:44 UTC
Pango's to blame.  In some not-so-distant future, Pango will also use harfbuzz-ng, and hence deal with this the same way that Firefox is doing...
Comment 4 Behdad Esfahbod 2012-08-18 17:14:50 UTC
We've merged HarfBuzz, which should improve this.  But not fully until we also adapt the itemizer...
Comment 5 Behdad Esfahbod 2012-08-25 19:32:50 UTC
Closing.  I'm tracking normalization in itemizer in a separate bug.