GNOME Bugzilla – Bug 55836
need locale-sensitive sorting for UTF-8 strings (g_utf8_strcoll?)
Last modified: 2011-02-18 15:47:37 UTC
There's code that needs to do strcasecmp on UTF-8 strings.
In fact, it seems that what we really need is the function that compares UTF-8 strings properly for sorting, more like strcoll than strcasecmp.
Hard problem - the cheat is to convert to the encoding of the locale, strcoll() and convert back. But this doesn't give an ordering for strings that can't be represented in the current locale. On GNU libc, linux, very recent versions, there are some functions allowing locale operations in a non-current locale - so you might be able to use this to do strcoll() in de_DE.UTF-8 even if the current locale is de_DE.iso-8859-1 Or you could implement: http://www.unicode.org/unicode/reports/tr10/ Probably several weeks of work, not counting finding the correct tailoring data for interesting locales. g_utf8_strcmp() is just g_strcmp() - UTF-8 has that property. g_utf8_strcasecmp() could be done by g_unichar_tolower() character by character.
It would be cruel to leave this as an exercise for the programmer. Programs that sort things should use a locale-sensitive sort like strcoll to make people in non-US countries happy. I think that coming up with a UTF-8 version of it is part of the price for switching to UTF-8. Maybe we can ship GNOME 2 without solving this problem. I don't know. If the current locale is de_DE.iso-8859-1, you can switch to de_DE.UTF-8 and do the strcoll call and switch back. So functions allowing locale operations aren't necessarily required. But I don't see how you'd know when you need to use a locale other than the current one and what locale to use. It would be nice to have g_utf8_strcasecmp, and it would be easy to code it, but I guess that doesn't really help with the problem.
(crappy web browser, sorry) It would be cruel to leave this as an exercise for the programmer. Programs that sort things should use a locale-sensitive sort like strcoll to make people in non-US countries happy. I think that coming up with a UTF-8 version of it is part of the price for switching to UTF-8. Maybe we can ship GNOME 2 without solving this problem. I don't know. If the current locale is de_DE.iso-8859-1, you can switch to de_DE.UTF-8 and do the strcoll call and switch back. So functions allowing locale operations aren't necessarily required. But I don't see how you'd know when you need to use a locale other than the current one and what locale to use. It would be nice to have g_utf8_strcasecmp, and it would be easy to code it, but I guess that doesn't really help with the problem.
Well, there is not even a guarantee that there will be a corresponding UTF-8 locale on the system. The problem with switching and switching back is that the locale is application-wide not thread-wide. (g_strtod() is buggy in this way.) The lack of a UTF-8 strcoll for GLib-2.0 is something we've been worrying about, but I don't see any easy resolution - maybe we could do a hack of strcoll() if the strings are both convertable to the current locale, string convertable < string non convertable always, strcmp() if the strings are both not convertable and call that g_utf8_strcoll() for now.
I like your proposed simple hack to use the underlying strcoll. We can easily improve it compatibly later on. Should I write the code to do this an attach a patch, or is that a waste of time?
Created attachment 658 [details] Simple attempt at writing fallback strcoll()
Another fallback technique for non-UTF-8 locales: if __STDC_ISO_10646__ is defined, convert to ucs4, then use wcscoll.
gint g_utf8_collate (const gchar *str1, const gchar *str2); gchar *g_utf8_collate_key (const gchar *str, gssize len); Now committed.