GNOME Bugzilla – Bug 673047
gunicollate is broken on OS X (patch included!)
Last modified: 2017-03-09 03:38:03 UTC
Created attachment 210846 [details] patch to make Unicode collation work on OS X This is a problem that has been reported several times in the past, but was always misinterpreted. Starting with OS X 10.6 (it seems), glib has been doing Unicode collation incorrectly on OS X, resulting in bugs such as file lists being sorted incorrectly in GTK+. Someone at MacPorts misunderstood this as a 64-bit issue, and "worked around" it by disabling glib's use of OS X's collation routines; most Mac users of glib ended up using this patch, and so for years glib on OS X has been doing Unicode collation in the C locale, while the real bug remained undiscovered. Until now! The actual issue was that, starting in 10.6 (or so it seems), the output of UCGetCollationKey changed in such a way that glib's code did not convert the key correctly (assuming it ever did; I have not bothered tracking down an old version of OS X to check). The real problem here is that there is an API mismatch between glib and OS X: glib offers an API to generate a sort key, and guarantees that callers will get the correct results by sorting these keys with strcmp, while OS X offers an API to generate sort keys and another to compare them, so that the actual structure of the keys may change. I did a bit of reverse engineering and changed the code to work with the sort keys generated by UCGetCollationKey on 10.7; it should probably cover 10.6 too, but I'll have more information on that later. Ideally, though, the API mismatch should be eliminated, either by glib adding a key comparison function (which could then call through to UCCompareCollationKeys), or by contacting Apple and getting them to add an API that produces keys guaranteed to work with strcmp. If glib wants to keep supporting OS X 10.5, the old version of collate_key_to_string should probably be kept and conditionalized by OS X version. It would be nice if someone with more knowledge of autotools arcana could help set that up. Executive summary: Unicode collation with glib is broken on OS X, has been broken for a while, this patch fixes it.
Bug #612019 is (at least one of?) the reports of the problem, with links/comments about the MacPorts (non-)solution.
Update: I tested the patch on 10.6, and it solves the issue there as well.
Thanks, Daniel. Bug #644686 should be the same problem too.
Created attachment 212412 [details] [review] patch to make Unicode collation work on OS X 10.6 and above I updated my patch. It turns out that the same string, in the same program, does not always get the same key at runtime. In particular, the second entry of the key can change: for many strings it's usually 1, but sometimes it becomes 0. This caused intermittent failures in the collate test. The new patch works reliably.
Created attachment 212477 [details] [review] patch to fix Unicode collation on OS X After doing more testing on various systems, it turned out that the original patch only worked on 64-bit systems. This updated version works correctly on both 64- and 32-bit systems.
Kris, can you have a look at this please? I get reports on #gimp that this seems to nicely fix several issues.
I applied the patch from Comment #5 against glib-2.22.4 and it now fails one self-test in 64-bit mode on OS X 10.6: > 1d0 > < bla4 > 6a6 > > bla4 > Test failed: unexpected error when using g_utf8_collate_key() on ./collate/collate-2.in > FAIL: run-collate-tests.sh The full tests/collate.out is: > bla4 > bla001 > bla02 > bla03 > bla10 > bla100 > event.c > eventgenerator.c > event.h > file2.bla > file3.xx > file.c > file.txt Obviously I'm using a fairly old glib version, but figured it was worth a data-point about the proposed solution.
Without the patch, glib-2.34.0 passes run-collate-tests.sh on 10.6 in 32-bit mode and fails in 64-bit mode (as expected). With the patch, it fails in both modes with the same ./collate/collate-2.in blah4 situation.
Fixed in master and glib-2-34: commit 42be40a1f88a55ff7d333e94ac2119d2b1867e9a Author: Michael Natterer <mitch@gimp.org> Date: Fri Nov 16 18:48:09 2012 +0100 Bug 673047 - gunicollate is broken on OS X Apply slightly modified patch from Camillo Lugaresi which fixes gunicollate for OSX >= 10.6. It was totally hilariously broken for anyone on 10.6 and later, I dont know if it's now broken on 10.5, but better fix it for the vast majority of users. (cherry picked from commit ef8510be09a746dcbc8d470376688820b225c40f) glib/gunicollate.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-)
*** Bug 612019 has been marked as a duplicate of this bug. ***
*** Bug 644686 has been marked as a duplicate of this bug. ***
This patch does not pass self-test on 10.6 in 64-bit mode (-arch x86_64, the same platform where the original code is broken). Using glib-2.34.2 with 42be40a1f88a55ff7d333e94ac2119d2b1867e9a applied: 1,2c1 < c < z --- > 223 5c4,5 < foo --- > c > eer34 7c7 < 223 --- > foo 9c9 < eer34 --- > z Test failed: unexpected error when using g_utf8_collate_key() on ./collate/collate-1.in FAIL: run-collate-tests.sh Passes on 32-bit (-arch i386), which the original code also did.
In addition, there is another earlier test-failure related to collation: TEST: collate... (pid=15855) /unicode/collate/0: OK /unicode/collate-key/0: ** GLib:ERROR:collate.c:71:do_collate: assertion failed (str == test->sorted[i]): ("c" == "223") FAIL on 64-bit but passes on 32-bit.
I tried make check but some earlier test failed. After some hesitation i pushed anyway, because it is obviously an improvement, even if untested :)
*** Bug 391461 has been marked as a duplicate of this bug. ***
Should we reopen the bug until the tests are fixed?
probably I guess... can someone summarize what is broken where at this point, and what we know about why?
glib-2.36.2 on OS X 10.8 fails self-test, with "bla4" misplaced.
*** Bug 719403 has been marked as a duplicate of this bug. ***
*** Bug 729155 has been marked as a duplicate of this bug. ***
2.40.0 built for 64-bit arch still fails: Test failed: unexpected error when using g_utf8_collate_key_for_filename() on ./collate/collate-2.in collate.out is: bla001 bla02 bla03 bla10 bla100 bla4 event.c event.h eventgenerator.c file.c file.txt file2.bla file3.xx Now it appears that "(text)(digits)" is treating the (digits) as regular characters instead of a unified numerical value? As before (Bug #612019), undef'ing HAVE_CARBON also does not fix it, it just makes the failure different (both g_utf8_collate_key() and g_utf8_collate_key_for_filename() fail for on ./collate/collate-1.in, giving results of: 223 GTK+ bar baz c eer34 er1 foo z but the expected output has "GTK+" between "foo" and "z". Just a difference of case-sensitivity? And g_utf8_collate_key() fails for ./collate/collate-2.in same as for with carbon ("blah4" misplaced).
Created attachment 307754 [details] [review] New fix. Tested. Current status and summary: 1) HEAD is currently unable to pass collation tests on any of my darwin platforms (ppc/i386/x86_64). 2) OS X locales have a different collation order than glibc locales. That is the reason of undef'ing HAVE_CARBON is also failing. As generating locales is not so easy (localedef seems to be broken/unmaintained), the sensible way IMHO is go on using Carbon for collation on OS X (as is currently done). 3) Carbon provides a function to compare collation keys. This is a different API approach as glib, which returns keys which can be compared with strcmp. 4) UCCollationKey to "strcmp'able key" conversion was broken, apparently because UCCollationKey internal structure had been wrongly reverse-engineered. I did some experiements and I think I have got the this right now (though a field has still some bits with unknown meaning). See code comments for details. Attached patch fixes this. Applying it makes all collation tests pass on platforms: powerpc-apple-darwin8.11.0 i386-apple-darwin9.8.0 x86_64-apple-darwin13.4.0