GNOME Bugzilla – Bug 673532
g_utf8_normalize(...,G_NORMALIZE_DEFAULT) problem
Last modified: 2012-08-25 19:32:50 UTC
See bug 673447 for background. This string: WWW: [휴가 가-- (오--)] 0 | ed 9c b4 ea b0 80 20 ea b0 80 2d 2d 20 28 ec 98 | ..........--.(.. 10 | a4 2d 2d 29 XX XX XX XX XX XX XX XX XX XX XX XX | .--)************ when sent through g_utf8_normalize(...,G_NORMALIZE_DEFAULT) becomes XXX: [휴가 가-- (오--)] 0 | e1 84 92 e1 85 b2 e1 84 80 e1 85 a1 20 e1 84 80 | ................ 10 | e1 85 a1 2d 2d 20 28 e1 84 8b e1 85 a9 2d 2d 29 | ...--.(......--) (Note: in Mozilla these strings appear the same; when pasted to, say, a gnome- shell they look different.) g_utf8_normalize isn't supposed to change text contents, so the two strings should always look the same. I don't know if I should blame glib or pango+deps. Tentatively blaming at glib for no other reason than it's first in the food chain.
g_utf8_normalize() is converting the pre-composed hangul characters into their constituent jamo, which is correct for G_NORMALIZE_NFD aka G_NORMALIZE_DEFAULT, so this isn't glib's fault. As I understand it, in theory pango ought to render the two strings the same, or at least very similarly, so the fact that the second string looks ugly in gnome-terminal and gedit may mean this is pango's fault. (Although gnumeric seems to have an extra bug on top of that, where the jamo aren't even getting visually recombined.) At any rate, it probably makes more sense for gnumeric to normalize to NFC rather than NFD anyway. Using NFD means that replacing "e" with "a" would also replace "é" with "á", etc, which is weird.
What you call weird was actually intended behaviour, :-/ It's probably language dependent on whether it makes sense. Neither unicode nor glib provides a really good normalization mode for search-and-replace. If I do s/f/F/ I would have expected a change even for U+FB01 (that rules out NFC and NFD), but no-change for 2^5 (that rules out NFKC and NFKD). Tossing to pango for an opinion on rendering of the two strings. PANGO: the claim is that the two strings from the initial report should render identically, or some close approximation thereof. How does that claim look from where you are?
Pango's to blame. In some not-so-distant future, Pango will also use harfbuzz-ng, and hence deal with this the same way that Firefox is doing...
We've merged HarfBuzz, which should improve this. But not fully until we also adapt the itemizer...
Closing. I'm tracking normalization in itemizer in a separate bug.