Bug 673532 – g_utf8_normalize(...,G_NORMALIZE_DEFAULT) problem

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 673532 - g_utf8_normalize(...,G_NORMALIZE_DEFAULT) problem


Summary:	g_utf8_normalize(...,G_NORMALIZE_DEFAULT) problem


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-04-04 20:59 UTC by Morten Welinder
Modified:	2012-08-25 19:32 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Morten Welinder 2012-04-04 20:59:10 UTC

See bug 673447 for background.

This string:

WWW: [휴가 가-- (오--)]
       0 | ed 9c b4 ea b0 80 20 ea b0 80 2d 2d 20 28 ec 98 | ..........--.(..
      10 | a4 2d 2d 29 XX XX XX XX XX XX XX XX XX XX XX XX | .--)************

when sent through g_utf8_normalize(...,G_NORMALIZE_DEFAULT) becomes

XXX: [휴가 가-- (오--)]
       0 | e1 84 92 e1 85 b2 e1 84 80 e1 85 a1 20 e1 84 80 | ................
      10 | e1 85 a1 2d 2d 20 28 e1 84 8b e1 85 a9 2d 2d 29 | ...--.(......--)

(Note: in Mozilla these strings appear the same; when pasted to, say, a gnome-
shell they look different.)

g_utf8_normalize isn't supposed to change text contents, so the two strings
should always look the same.  I don't know if I should blame glib or
pango+deps.

Tentatively blaming at glib for no other reason than it's first in the food
chain.

Comment 1 Dan Winship 2012-04-04 21:36:36 UTC

g_utf8_normalize() is converting the pre-composed hangul characters into their constituent jamo, which is correct for G_NORMALIZE_NFD aka G_NORMALIZE_DEFAULT, so this isn't glib's fault.

As I understand it, in theory pango ought to render the two strings the same, or at least very similarly, so the fact that the second string looks ugly in gnome-terminal and gedit may mean this is pango's fault. (Although gnumeric seems to have an extra bug on top of that, where the jamo aren't even getting visually recombined.)

At any rate, it probably makes more sense for gnumeric to normalize to NFC rather than NFD anyway. Using NFD means that replacing "e" with "a" would also replace "é" with "á", etc, which is weird.

Comment 2 Morten Welinder 2012-04-04 23:51:50 UTC

What you call weird was actually intended behaviour, :-/  It's probably
language dependent on whether it makes sense.

Neither unicode nor glib provides a really good normalization mode for
search-and-replace.  If I do s/f/F/ I would have expected a change even
for U+FB01 (that rules out NFC and NFD), but no-change for 2^5 (that rules
out NFKC and NFKD).

Tossing to pango for an opinion on rendering of the two strings.

PANGO: the claim is that the two strings from the initial report should
render identically, or some close approximation thereof.  How does that
claim look from where you are?

Comment 3 Behdad Esfahbod 2012-04-05 17:41:44 UTC

Pango's to blame.  In some not-so-distant future, Pango will also use harfbuzz-ng, and hence deal with this the same way that Firefox is doing...

Comment 4 Behdad Esfahbod 2012-08-18 17:14:50 UTC

We've merged HarfBuzz, which should improve this.  But not fully until we also adapt the itemizer...

Comment 5 Behdad Esfahbod 2012-08-25 19:32:50 UTC

Closing.  I'm tracking normalization in itemizer in a separate bug.