Bug 673447 – Search and replace mangles Korean

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 673447 - Search and replace mangles Korean


Summary:	Search and replace mangles Korean


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	Analytics
Version:	1.10.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Morten Welinder
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-04-03 17:06 UTC by Alex Stark
Modified:	2012-04-06 19:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Korean mangled on replacement of, say, "-" with "X". (2.51 KB, application/x-gnumeric) 2012-04-03 17:06 UTC, Alex Stark	Details

Description Alex Stark 2012-04-03 17:06:11 UTC

Created attachment 211238 [details]
Korean mangled on replacement of, say, "-" with "X".

In cells where match is found, characters are broken up and/or changed to superscript, etc.

Comment 1 Morten Welinder 2012-04-03 17:38:43 UTC

Confirmed.  (Even in ko_KR.UTF-8 locale.)

Comment 2 Morten Welinder 2012-04-03 18:01:00 UTC

I know nothing about Korean.

Here's what happens for cell A11:

1. We start with "휴가 가-- (오--)".
2. We normalize that to "휴가 가-- (오--)".
3. We replace - by X: "휴가 가XX (오XX)".

Note: the Korean characters in (1) and (2) looks different in Gnumeric and
when I print them.  They look identical when I paste them in here!  If I take
the above and paste it back into a shell they look different again.

(No idea how it will look once I commit this comment to bugzilla and it
gives it back to me.)

It would appear that this is a matter of normalization, not search-and-replace
as such.

Comment 3 Andreas J. Guelzow 2012-04-03 19:25:32 UTC

I don't understand the issue. Are there truly UTF characters broken up? I see glyphs broken up, but I would expect that in any language where glyph rendering is context sensitive. Or are you suggesting replacement should not happen in the presence of combining characters?

Comment 4 Andreas J. Guelzow 2012-04-04 00:36:33 UTC

Let me give some examples in my locale en_US.UTF-8 which should be pretty much C:

enter flea in A1 and A2
use search and replace to change the l to x in A2 only
note that the shape of the f has changed (this may be font dependent, you need a font my ligatures)

enter the five character sequence  feU+0308ar where the third character is U+0308 in A1 and A2.
use search and replace to change the e to x in A2 only.
note that you essentially changed one part of the symbol. (I could have entered the eU+0308 as a single character U+003b in which case we would have had the same result.)

I would guess that what you observe in Korean is just a more involved example of this.

Comment 5 Morten Welinder 2012-04-04 20:50:25 UTC

Andreas, if I take A11 and so s/-/X/ followed by s/X/-/ I end up with a
string that visually looks different.  There are no Xs in that cell to
begin with so it's unpleasant to see any difference.

The reason we see a differences is that g_utf8_normalize(...,G_NORMALIZE_DEFAULT)
does something to the string that causes it to render differently.  Here's
the relevant part of the documentation:

 * The normalization mode %G_NORMALIZE_DEFAULT only
 * standardizes differences that do not affect the
 * text content, such as the above-mentioned accent
 * representation.

The difference in appearance suggests a bug something in glib (if the
normalization is wrong) or pango+friends (if normalization is correct).

Comment 6 Morten Welinder 2012-04-04 20:59:38 UTC

Filed bug 673532 against glib to get their opinion.

Comment 7 Andreas J. Guelzow 2012-04-04 22:10:36 UTC

Do you have the byte sequence before and after the G_NORMALIZE_DEFAULT ?

Comment 8 Morten Welinder 2012-04-04 23:23:51 UTC

Andreas: see bug 673532.

Comment 9 Morten Welinder 2012-04-05 19:01:16 UTC

The word from bug 673532 is that this is pango's fault.

For Gnumeric that leaves the questions:

1. Should we do the substitution in NFD form [decomposed] or in
   NFC form [composed]?  Right now we do NFD.

   This comes down to this: should s/e/a/ change "é" to "á"?  There
   are perfectly good reasons either way.

2. Should we leave the result in NFD, NFC, or whatever comes out when
   we do the substitution?  Right now we do the whatever form.

   Right now we potentially leave a mess of forms.  That isn't supposed
   to matter, but leaving a normalized form if there is a substitution
   feels like a better way.  NFC, in light of this bug, is probably the
   result closest to what entry of text in a cell would yield.

Comment 10 Andreas J. Guelzow 2012-04-05 19:16:52 UTC

re 1: Since there are perfectly good reasons to do it either way, I think we may just want to have an option...

In German for example ä is usually considered a different letter than a. On the other hand many accents are considered just modifications...

re 2: We should probably avoid changing the format, but what are we doing for things such as strlen. STRLEN counts characters so the answer would depend on the representation and may change with normalization.

Comment 11 Morten Welinder 2012-04-06 15:57:34 UTC

We could have an option (suitably hidden on the "advanced" tab), but
I suspect it may be overkill for a spreadsheet.

Re 2: once we have normalized (regardless of how) there is no going back.
That means STRLEN will change after substitution.  Too bad, I think.
However, STRLEN is a good reason for returning the result in NFC.

I notice another problem: we don't normalize the search text.  It should
be subjected to the same normalization as the subject text.

Comment 12 Morten Welinder 2012-04-06 19:10:18 UTC

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.

We now normalize to NFC after search/replace changes a cell.

Comment 13 Andreas J. Guelzow 2012-04-06 19:40:55 UTC

Perhaps I am not looking at the right place, but it looks to me that all of these normalizations act only on the string content without adjusting any existing pango attribute lists. THis could be the cause for bug #673663.