GNOME Bugzilla – Bug 531403
g_utf8_collate broken on Mac
Last modified: 2008-06-17 18:10:20 UTC
g_utf8_collate is broken on Mac, apparently because wcscoll() and friends are broken. Attached patch makes glib use carbon API for collation and adds a new structure, GCollationKey, because I have no idea how to convert carbon collation key to a string (its format is not as described in UTS 10). Also attached is a patch for Gtk to use new API in filechooser. As result, collation test in glib doesn't fail, and GtkFileChooser sorts files like Finder does.
Created attachment 110357 [details] [review] Glib patch This adds GCollationKey structure and new API to deal with it; but implementation simply casts char* to that and back on platforms other than Mac, and uses g_utf8_collate_key*. On Mac, it uses Carbon API, and the structure actually contains Carbon collation key. g_utf8_collate_key and g_utf8_collate_key_for_filename are unchanged and broken on Mac.
Created attachment 110358 [details] [review] Gtk patch
How is wcscoll() and friends broken? Needless to say, I'm not a huge fan of this...
Sorry, it seems I forgot about the main thing. So, wcscoll() on Mac (10.4.11 on PPC) does not do what it should, it simply does what wcscmp() does. I tried it with en_US.UTF-8, en_US.ISO8859-1, en_US.ISO8859-15 (all listed in 'locale -a' output), it does same thing as strcmp(). And it's not only wcscoll, strcoll() does same thing, i.e. collation is broken, not specifically wide char API. GtkFileChooser sorts files as in "C" locale - ChangeLog before autogen.sh. Here's the test program I tried (I tried two, this and the same one using wide char API): #include <string.h> #include <stdio.h> #include <stdlib.h> #include <locale.h> static int cmp_strings (const void *p1, const void *p2) { const char * const *ps1 = p1; const char * const *ps2 = p2; return strcoll (*ps1, *ps2); } int main (void) { unsigned i; char *array[] = {"a", "ab", "B", "BA", "b"}; setlocale (LC_ALL, ""); qsort (array, 5, sizeof (char*), cmp_strings); for (i = 0; i < 5; ++i) puts (array[i]); return 0; } Result is: B, BA, a, ab, b.
So, basically that their libc collation rules are bogus.
Bug #333977 is the genesis here. Would be good to put this issue to bed once and for all. Either OS X has a different sort-order in its libSystem and gnome should obey it (so everything on Mac is self-consistent) which means changing the test expected result, or else glib should sort the same on all platforms and therefore need to implement something itself (or rely on some lib) instead of using some system function here.
Created attachment 111986 [details] [review] patch New patch, no new structures, instead convert Carbon collation key to a string as Behdad suggested. The result is not 100% equivalent to sorting with Carbon API, but it seems to be good enough: I sorted text from planet.gnome.org and Chinese and Russian wikipedias, and the difference was that Carbon compared some strings as equal while this new glib routine didn't; that happened only in cases like the following (first number is Carbon comparison, second is Glib's): '‘PackageKit’.' and 'PackageKit': 0 vs -1 'LinuxTag' and 'LinuxTag!': 0 vs -1 ''/org/bansheeproject/Banshee/PlayerEngine')' and '"org.bansheeproject.Banshee.PlayerEngine")': 0 vs -1 '"/org/bansheeproject/Banshee/PlayerEngine")' and ''/org/bansheeproject/Banshee/PlayerEngine')': 0 vs 1 '¾' and 'ответственности': 0 vs -1 '¾' and 'Порт-Вила': 0 vs -1 'Малвату-Маури,' and 'Малвату-Маури': 0 vs 1 '¾' and '¾': 0 vs 1 Note the last one. I guess that's because one of them is decomposed or something. Behdad, is it good enough? (It passes glib test, if that's worth anything)
The output is definitely good enough. Reviewing the patch now. To make it match Carbon I guess you just need to stop upon seeing a '\001' byte. At least that's what the docs said.
So, I think you don't even need the utf8 encoding. Just exchange 0 and 1 bytes and stop at the first (new!) nul byte!
(The patch is broken, it fails to zero-terminate the result) So, I tested string comparison instead of comparing keys, and the results are identical: namely, g_utf8_collate (which uses UCCompareTextDefault) produces same results as strcmp(g_utf8_collate_key()). Either UCCompareTextDefault doesn't use their collation key comparison routine or something else is wrong. Anyway, g_utf8_collate and g_utf8_collate_key from the patch agree, and filechooser seems to show files in right order, regardless why it happens. Also, sorting breaks if you stop at \001. Perhaps it doesn't break if you stop at the int16 or int32 (whatever is used there) with value of 1, no idea.
Created attachment 111990 [details] [review] patch Working patch (identical to previous one, except it terminates the result with zero byte).
The Apple docs say a 1 separate significant and insignificant fields. So most probably UCCompareTextDefault() is stopping upon seeing a 1, but for sorting you really need to compare all fields. So yeah, either your patch, or just replace 0 with 1. Your's safer, so lets go with it.
2008-06-02 Yevgen Muntyan <muntyan@tamu.edu> Bug 531403 – g_utf8_collate broken on Mac. * glib/gunicollate.c: (g_utf8_collate): use UCCompareTextDefault; (collate_key_to_string), (carbon_collate_key_with_collator), (carbon_collate_key), (carbon_collate_key_for_filename): new functions using Carbon API to get collate key for g_utf8_collate_key() and g_utf8_collate_key_for_filename(); (g_utf8_collate_key), (g_utf8_collate_key_for_filename): use those.
*** Bug 538088 has been marked as a duplicate of this bug. ***