GNOME Bugzilla – Bug 670232
String comparison fails on Korean (international) characters
Last modified: 2015-12-12 19:26:58 UTC
Created attachment 207790 [details] Second column should be almost all FALSE EXACT(,) returns TRUE when strings obviously do not match. I was comparing two Hangeul (Korean character) cells.
(In reply to comment #0) Note that the string match function works in Open Office Calc.
We end up calling g_utf8_collate which claims the strings are the same. For reference, what does the content of A3 and A4 mean?
Well, I guess that's where the bug comes from. In this case I was creating a column (B) with flags to show that there were duplicates in another column (A). The test case was a simplification. The A column is really just a list of dictionary words / phrases in Korean. There is nothing "special" about the words, and therefore nothing that a library utility function should not be able to handle correctly. Thanks!
it is hard to see that g_utf8_collate would be incorrect. It is pretty poor if different utf8 "strings" return 0. Can you check that the conversion under g_unichar_to_utf8() is yielding different strings?
Well, we really should not be using g_utf8_collate here. According to its documentation: "Compares two strings for ordering using the linguistically correct rules for the current locale." I doubt that, at least in my locale, an ordering for these strings is sensibly defined. If I change the code of gnumeric_exact to static GnmValue * gnumeric_exact (GnmFuncEvalInfo *ei, GnmValue const * const *argv) { char const *f = value_peek_string (argv[0]); char const *s = value_peek_string (argv[1]); gboolean val = (g_utf8_collate (f,s) == 0); return value_new_bool (val); } I see: Breakpoint 1, gnumeric_exact (ei=0xbfffdd5c, argv=0xbfffdc80) at functions.c:212 212 char const *f = value_peek_string (argv[0]); (gdb) n 213 char const *s = value_peek_string (argv[1]); (gdb) 215 gboolean val = (g_utf8_collate (f,s) == 0); (gdb) 217 return value_new_bool (val); (gdb) p val $1 = 1 (gdb) p f $2 = 0x876c550 "회화" (gdb) p s $3 = 0x881ed30 "후식" I really don't find this to be to unexpected.
Created attachment 207832 [details] [review] proposed patch I think we should not use collate, since it is locale dependent, but simply normalize the string and the compare it for byte-wise equality, as implemented by the attached patch.
Created attachment 207834 [details] Two strings, one the decomposition of the other You guys are good. The locale thing perhaps explains why your Korean, etc, users haven't had problems. I think that G_NORMALIZE_DEFAULT is correct, as in the path. Just to be absolutely sure, I am attaching a file with two strings on two lines that _might_ be considered equal under "Hangul Jamo" composition, but should not be EXACT equal. I am really not sure, since I have no ready means of testing. (The truth is that I am only guessing what Hangul Jamo composition actually is.)
Alex, would you be able to insert those two strings from Comment #7 into a gnumeric file and attach that one. Transfer of text from an html file into Gnumeric could change things.
If I copy paste those strings into Gnumeric they _look_ very different and even with the proposed patch, exact returns FALSE for them. The decomposition referred to in that patch is unicode decomposition which affects how a text is decomposed into various accents. This would be very different from any decomposition used within a language. By using G_NORMALIZE_DEFAULT we also retain some formatting differences.
The code in question is from 2002. The patch to normalize instead of using collate looks good to me and is probably the effect that was originally hoped for. Note, that Excel docs say EXACT is the same as "=". We should have a look there too.
Actually, maybe we need a utility function to do the comparison, because the patch looks mildly expensive. Something like gboolean gnm_compare_strings (const char *a, const char *b) { while (*a == *b && *a > 0 && *a < 127) a++, b++; if (*a != *b) return FALSE; if (*a == 0) return TRUE; --the expensive thing from the patch-- } This assumes that any combined character starts with non-ascii. Is that true? If not, we may have to back up a character.
in gnm_compare_strings if (*a != *b) return FALSE; is surely not correct. If we differ on a non-ascii, we need to normalize before we can tell that these are really different strings. We only need normalization if the strings differ somewhere. I have no idea whether any combined character instance starts with non-ascii, but doubt that that is the case.
Created attachment 207862 [details] Test of extended-char EXACT behaviour May need to be tested on a machine whose LOCALE is not compatible with Korean (Hangul). Col B should only be TRUE in one or two places: where there are two EXACTly equal strings in col A.
Depending, perhaps, on your LOCALE, g_utf8_collate() might not be so fast either. In my limited understanding, testing the first character would not be sufficient. However, it might be worth testing all chars<127. In other words, first do a simple string match that returns (a) equal, (b) different, or (c) non-ASCII. A routine that first does an ASCII-style match might be legitimate and fast because: 1) Non-ASCII strings typically have a non-ASCII character at their start, or very early. (e.g. Perhaps a Chinese word in parens.) 2) Many cells would be ASCII even in, say, a Japanese LOCALE. The main downside might be that simple extensions to ASCII, such as a European character set, might trigger the more expensive comparison after unnecessary work has been done. In other words, (1) might not apply. Issues beyond my understanding: A) Is testing char<127 sufficient to ensure that no normalization might be necessary? and B) If a simple string comparison discovers a non-ascii character, would it be sufficient to apply the normalized comparison to the trailing sub-string? Why isn't there a normalized-as-required string match?
Created attachment 207965 [details] [review] updated patch This updated patch should be fast for ASCII only strings.
Comment on attachment 207965 [details] [review] updated patch I have committed this patch. The utility function gnm_compare_strings is currently in plugins/fn-string/functions.c until we figure out the best place to put it. I am keping this bug open since the same function probably needs to be used for equality testing of strings.
I am not sure where to find the code that tests equality (=A1=A2) when A1 and A2 contains strings. Its behaviour needs to be fixed since it yields TRUE on obviously distinct strings.
Created attachment 207966 [details] sample test file In the attached test file, D6 should be FALSE. It incorrectly identifies two clearly distinct strings as equal when testing with "=".
value_compare_real
I have filed bug 670403 against glib. I don't think returning 0 is correct for g_utf8_collate. Our string value comparison needs to be a total ordering. If we can't get that from from g_utf8_collate, then we need to augment it or look elsewhere.
The sheet from comment 18 shows three different {\aa} rendered. A2 and A3 are ok-ish with A2 being better. A4 is awful in its ring placement -- it's kind of fallen a bit down on the right side. But they're really all wrong in that the ring should *not* touch the A. *sigh*
The exact appearance of the glyphs is a font issue. And the typical North American font designer probably doesn't even know how those letters should look like.
So ="a"="A" is supposed to be true??
> So ="a"="A" is supposed to be true?? From memory, no. This agrees: http://office.microsoft.com/en-us/excel-help/exact-HP005209081.aspx
Hold that thought. On closer inspection, that page talks about "==", not "=". I didn't even know there was a "==". I'll need to fix an Excel to have a look.
This may be a separate bug. Using the test file (of comment 13), doing a search-and-replace of "-" with empty string does unpleasant things to some cell contents.
Using g_utf8_collate seems to be useless when using characters that are not part of the collation sequence for the current locale. While EXACT is patched to do something remotely useful, evaluation of equality "=" will still consider strings consisting of such characters equal. The same is likely happening everywhere else were equality is tested. We should probably have a look at http://www.unicode.org/reports/tr10/ and the sample code given there for locale specific collation. In the case of Hangul syllables even http://www.unicode.org/reports/tr10/ considers this a difficult and non-standardized problem. See 7.1.5 Hangul Collation in that document.
We could either implement the unicode collation algorithm ourselves or use the ICU library from http://site.icu-project.org/ Using the ICU library would likely be the simplest. Morten, any preference?
Are you totally sure you need full collation? For sorting this makes sense, but for equality testing, much of the string interpretation is unnecessary. (For sorting, the "alphabet" order is not even the same in, say, North Korea as in South Korea. Jamo decomposition makes it even more difficult. Therefore the sort order necessarily depends on the LOCALE. I didn't see a means of setting the LOCALE for a document.)
Excel has TRUE for ="a"="A". My copy of Excel barfs at "==". > Are you totally sure you need full collation? For EXACT, no. But for the larger problem, yes. > Therefore the sort order necessarily depends on the LOCALE. That's true even for Western European languages. > I didn't see a means of setting the LOCALE for a document We use the locale setting. That's per-Gnumeric instance, not per document.
I believe this was actually fixed at the time, so this bug should have been closed then. This problem has been fixed in our software repository. The fix will go into the next software release. Once that release is available, you may want to check for a software upgrade provided by your Linux distribution.