Bug 670232 – String comparison fails on Korean (international) characters

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 670232 - String comparison fails on Korean (international) characters


Summary:	String comparison fails on Korean (international) characters


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	Analytics
Version:	1.10.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	Andreas J. Guelzow
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:	670403
Blocks:

Reported:	2012-02-16 16:48 UTC by Alex Stark
Modified:	2015-12-12 19:26 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Second column should be almost all FALSE (2.13 KB, application/x-gnumeric) 2012-02-16 16:48 UTC, Alex Stark		Details
proposed patch (827 bytes, patch) 2012-02-17 04:39 UTC, Andreas J. Guelzow	none	Details \| Review
Two strings, one the decomposition of the other (110 bytes, text/html) 2012-02-17 06:27 UTC, Alex Stark		Details
Test of extended-char EXACT behaviour (2.09 KB, application/x-gnumeric) 2012-02-17 16:02 UTC, Alex Stark		Details
updated patch (2.46 KB, patch) 2012-02-19 01:51 UTC, Andreas J. Guelzow	committed	Details \| Review
sample test file (12.08 KB, application/gnumeric) 2012-02-19 02:13 UTC, Andreas J. Guelzow		Details

Description Alex Stark 2012-02-16 16:48:00 UTC

Created attachment 207790 [details]
Second column should be almost all FALSE

EXACT(,) returns TRUE when strings obviously do not match.

I was comparing two Hangeul (Korean character) cells.

Comment 1 Alex Stark 2012-02-16 18:37:21 UTC

(In reply to comment #0)

Note that the string match function works in Open Office Calc.

Comment 2 Morten Welinder 2012-02-16 18:41:56 UTC

We end up calling g_utf8_collate which claims the strings are the same.

For reference, what does the content of A3 and A4 mean?

Comment 3 Alex Stark 2012-02-16 22:28:28 UTC

Well, I guess that's where the bug comes from.

In this case I was creating a column (B) with flags to show that there were duplicates in another column (A).  The test case was a simplification.

The A column is really just a list of dictionary words / phrases in Korean.  There is nothing "special" about the words, and therefore nothing that a library utility function should not be able to handle correctly.

Thanks!

Comment 4 Alex Stark 2012-02-16 22:44:49 UTC

it is hard to see that g_utf8_collate would be incorrect.  It is pretty poor if different utf8 "strings" return 0.

Can you check that the conversion under g_unichar_to_utf8() is yielding different strings?

Comment 5 Andreas J. Guelzow 2012-02-17 04:28:29 UTC

Well, we really should not be using g_utf8_collate here. According to its documentation: "Compares two strings for ordering using the linguistically correct rules for the current locale." I doubt that, at least in my locale, an ordering for these strings is sensibly defined.

If I change the code of gnumeric_exact to
static GnmValue *
gnumeric_exact (GnmFuncEvalInfo *ei, GnmValue const * const *argv)
{
	char const *f = value_peek_string (argv[0]);
	char const *s = value_peek_string (argv[1]);

	gboolean val = (g_utf8_collate (f,s) == 0);
	
	return value_new_bool (val);
}

I see:

Breakpoint 1, gnumeric_exact (ei=0xbfffdd5c, argv=0xbfffdc80)
    at functions.c:212
212		char const *f = value_peek_string (argv[0]);
(gdb) n
213		char const *s = value_peek_string (argv[1]);
(gdb) 
215		gboolean val = (g_utf8_collate (f,s) == 0);
(gdb) 
217		return value_new_bool (val);
(gdb) p val
$1 = 1
(gdb) p f
$2 = 0x876c550 "회화"
(gdb) p s
$3 = 0x881ed30 "후식"

I really don't find this to be to unexpected.

Comment 6 Andreas J. Guelzow 2012-02-17 04:39:50 UTC

Created attachment 207832 [details] [review]
proposed patch

I think we should not use collate, since it is locale dependent, but simply normalize the string and the compare it for byte-wise equality, as implemented by the attached patch.

Comment 7 Alex Stark 2012-02-17 06:27:10 UTC

Created attachment 207834 [details]
Two strings, one the decomposition of the other

You guys are good.

The locale thing perhaps explains why your Korean, etc, users haven't had problems.

I think that G_NORMALIZE_DEFAULT is correct, as in the path.  Just to be absolutely sure, I am attaching a file with two strings on two lines that _might_ be considered equal under "Hangul Jamo" composition, but should not be EXACT equal.  I am really not sure, since I have no ready means of testing.  (The truth is that I am only guessing what Hangul Jamo composition actually is.)

Comment 8 Andreas J. Guelzow 2012-02-17 06:30:45 UTC

Alex, would you be able to insert those two strings from Comment #7 into a gnumeric file and attach that one. Transfer of text from an html file into Gnumeric could change things.

Comment 9 Andreas J. Guelzow 2012-02-17 06:34:56 UTC

If I copy paste those strings into Gnumeric they _look_ very different and even with the proposed patch, exact returns FALSE for them.

The decomposition referred to in that patch is unicode decomposition which affects how a text is decomposed into various accents. This would be very different from any decomposition used within a language. By using G_NORMALIZE_DEFAULT we also retain some formatting differences.

Comment 10 Morten Welinder 2012-02-17 14:24:45 UTC

The code in question is from 2002.

The patch to normalize instead of using collate looks good to me and is
probably the effect that was originally hoped for.

Note, that Excel docs say EXACT is the same as "=".  We should have a look
there too.

Comment 11 Morten Welinder 2012-02-17 14:33:56 UTC

Actually, maybe we need a utility function to do the comparison, because
the patch looks mildly expensive.

Something like

gboolean
gnm_compare_strings (const char *a, const char *b)
{
  while (*a == *b && *a > 0 && *a < 127)
    a++, b++;

  if (*a != *b)
    return FALSE;
  if (*a == 0)
    return TRUE;

  --the expensive thing from the patch--
}


This assumes that any combined character starts with non-ascii.  Is that
true?  If not, we may have to back up a character.

Comment 12 Andreas J. Guelzow 2012-02-17 15:36:07 UTC

in gnm_compare_strings
  if (*a != *b)
    return FALSE;
is surely not correct. If we differ on a non-ascii, we need to normalize before we can tell that these are really different strings.

We only need normalization if the strings differ somewhere.

I have no idea whether any combined character instance starts with non-ascii, but doubt that that is the case.

Comment 13 Alex Stark 2012-02-17 16:02:29 UTC

Created attachment 207862 [details]
Test of extended-char EXACT behaviour

May need to be tested on a machine whose LOCALE is not compatible with Korean (Hangul).

Col B should only be TRUE in one or two places: where there are two EXACTly equal strings in col A.

Comment 14 Alex Stark 2012-02-17 16:19:59 UTC

Depending, perhaps, on your LOCALE, g_utf8_collate() might not be so fast either.

In my limited understanding, testing the first character would not be sufficient.  However, it might be worth testing all chars<127.  In other words, first do a simple string match that returns (a) equal, (b) different, or (c) non-ASCII.

A routine that first does an ASCII-style match might be legitimate and fast because:
1) Non-ASCII strings typically have a non-ASCII character at their start, or very early.  (e.g. Perhaps a Chinese word in parens.)
2) Many cells would be ASCII even in, say, a Japanese LOCALE.

The main downside might be that simple extensions to ASCII, such as a European character set, might trigger the more expensive comparison after unnecessary work has been done.  In other words, (1) might not apply.

Issues beyond my understanding:
A) Is testing char<127 sufficient to ensure that no normalization might be necessary? and
B) If a simple string comparison discovers a non-ascii character, would it be sufficient to apply the normalized comparison to the trailing sub-string?

Why isn't there a normalized-as-required string match?

Comment 15 Andreas J. Guelzow 2012-02-19 01:51:16 UTC

Created attachment 207965 [details] [review]
updated patch

This updated patch should be fast for ASCII only strings.

Comment 16 Andreas J. Guelzow 2012-02-19 02:02:04 UTC

Comment on attachment 207965 [details] [review]
updated patch

I have committed this patch. The utility function gnm_compare_strings is currently in plugins/fn-string/functions.c until we figure out the best place to put it. I am keping this bug open since the same function probably needs to be used for equality testing of strings.

Comment 17 Andreas J. Guelzow 2012-02-19 02:12:03 UTC

I am not sure where to find the code that tests equality (=A1=A2) when A1 and A2 contains strings. Its behaviour needs to be fixed since it yields TRUE on obviously distinct strings.

Comment 18 Andreas J. Guelzow 2012-02-19 02:13:54 UTC

Created attachment 207966 [details]
sample test file

In the attached test file, D6 should be FALSE. It incorrectly identifies two clearly distinct strings as equal when testing with "=".

Comment 19 Morten Welinder 2012-02-19 03:18:29 UTC

value_compare_real

Comment 20 Morten Welinder 2012-02-19 17:34:39 UTC

I have filed bug 670403 against glib.  I don't think returning 0 is correct
for g_utf8_collate.

Our string value comparison needs to be a total ordering.  If we can't get
that from from g_utf8_collate, then we need to augment it or look elsewhere.

Comment 21 Morten Welinder 2012-02-19 17:45:01 UTC

The sheet from comment 18 shows three different {\aa} rendered.
A2 and A3 are ok-ish with A2 being better.  A4 is awful in its
ring placement -- it's kind of fallen a bit down on the right side.
But they're really all wrong in that the ring should *not* touch
the A.  *sigh*

Comment 22 Andreas J. Guelzow 2012-02-19 23:20:18 UTC

The exact appearance of the glyphs is a font issue. And the typical North American font designer probably doesn't even know how those letters should look like.

Comment 23 Andreas J. Guelzow 2012-02-20 04:10:40 UTC

So ="a"="A" is supposed to be true??

Comment 24 Morten Welinder 2012-02-20 13:32:45 UTC

> So ="a"="A" is supposed to be true??

From memory, no.  This agrees:

http://office.microsoft.com/en-us/excel-help/exact-HP005209081.aspx

Comment 25 Morten Welinder 2012-02-20 13:42:52 UTC

Hold that thought.  On closer inspection, that page talks about "==", not "=".
I didn't even know there was a "==".

I'll need to fix an Excel to have a look.

Comment 26 Alex Stark 2012-02-20 17:14:49 UTC

This may be a separate bug.

Using the test file (of comment 13), doing a search-and-replace of "-" with empty string does unpleasant things to some cell contents.

Comment 27 Andreas J. Guelzow 2012-02-20 20:37:41 UTC

Using g_utf8_collate seems to be useless when using characters that are not part of the collation sequence for the current locale. 

While EXACT is patched to do something remotely useful, evaluation of equality "=" will still consider strings consisting of such characters equal. The same is likely happening everywhere else were equality is tested.

We should probably have a look at   http://www.unicode.org/reports/tr10/  and the sample code given there for locale specific collation.

In the case of Hangul syllables even http://www.unicode.org/reports/tr10/ considers this a difficult and non-standardized problem. See  7.1.5 Hangul Collation in that document.

Comment 28 Andreas J. Guelzow 2012-02-21 04:06:21 UTC

We could either implement the unicode collation algorithm ourselves or use the ICU library from http://site.icu-project.org/

Using the ICU library would likely be the simplest.

Morten, any preference?

Comment 29 Alex Stark 2012-02-21 21:53:04 UTC

Are you totally sure you need full collation?

For sorting this makes sense, but for equality testing, much of the string interpretation is unnecessary.

(For sorting, the "alphabet" order is not even the same in, say, North Korea as in South Korea.  Jamo decomposition makes it even more difficult.  Therefore the sort order necessarily depends on the LOCALE.  I didn't see a means of setting the LOCALE for a document.)

Comment 30 Morten Welinder 2012-02-21 22:01:41 UTC

Excel has TRUE for ="a"="A".  My copy of Excel barfs at "==".

> Are you totally sure you need full collation?
For EXACT, no.  But for the larger problem, yes.

> Therefore the sort order necessarily depends on the LOCALE.
That's true even for Western European languages.

> I didn't see a means of setting the LOCALE for a document
We use the locale setting.  That's per-Gnumeric instance, not per
document.

Comment 31 Morten Welinder 2015-12-12 19:26:58 UTC

I believe this was actually fixed at the time, so this bug should have
been closed then.


This problem has been fixed in our software repository. The fix will go into the next software release. Once that release is available, you may want to check for a software upgrade provided by your Linux distribution.