GNOME Bugzilla – Bug 55852
Do we need anything between strcmp and g_utf8_strcoll for UTF-8?
Last modified: 2011-02-18 15:47:37 UTC
We've been discussing how it might be nice to have g_utf8_strcasecmp. But the Unicode standard describes a number of levels between strcmp and full "collation". The report at http://unicode.org/unicode/reports/tr15/ describes normalization, with 4 forms: D, C, KD, and KC. Form C is supposed to be the rule used for URLs. This makes it clear that some applications that formerly used strcmp to compare strings might want to compare UTF-8 strings in a way that ignores differences that have to do with how the string was typed and which would be invisible when the string was displayed. An example where this might come up could be when checking if someone typed a password correctly. The report at http://unicode.org/unicode/reports/tr21/#Caseless%20Matching says that caseless matching is done by case folding which "is more than just conversion to lowercase". So a good implementation of a UTF-8 strcasecmp would not simply be based on conversion to lowercase. Sadly, there are four flavors of case folding, roughly summarized as "simple folding", "full folding", "simple folding handling dotted I", and "full folding handling dotted I". I could imagine adding a function that does form C normalization (g_utf8_str_normalize?), another that does form C normalization and full case folding (g_utf8_str_fold_case?) to be used where people might use g_strdown today, another that does comparison with normalization (g_utf8_strnormcmp?) to be used in some places where strcmp is used today, another that does comparison with normalization and case folding (g_utf8_strcasecmp?) to be used in some places where people use g_strcasecmp today, and perhaps explict calls to normalize with any of the 4 algorithms (g_utf8_str_normalize_full?) and case fold with any of the 4 algorithms (g_utf8_str_fold_case_full?). An argument against doing any of this is that programs should instead use g_ascii_strdown, g_ascii_strcasecmp, and g_utf8_strcoll. This might also be a waste of time -- we could just wait until we see real user problems and then go back and add these operations as needed to fix those problems. I hope having this bug report turns out to be useful. (I would have cc'd to trow@ximian
I think any time you are using human readable text (names, subject lines, ...), you _should_ be using unicode-sensitive functions rather than g_ascii_*. After all, ascii covers a tiny subset of the worlds languages. If you are parsing a config file or something, yes, then you should use g_ascii_*. Looking over various documentation on the issue, one thing that comes to mind is that in many of the cases where people are currently using g_strdown(), the correct internationlized operation is to obtain a sort key [as with strxfrm] that ignores the differences you don't care about, which could be one or more of: - normalization - 3rd level differences (case) - 2nd level differences (accents) Using strdown() and then displaying the results to the user is usually a bad idea - rather, it is more frequently a way of accelerating a strcasecmp(), or of providing a key to do a fast case-insenstive lookup in a hash table. This implies that while normalization (which should leave an identically displayed string) is a sensible operation to provide, just skipping case folding might make sense. This seems to be (in a fairly quick look) what the ICU API provides.
gchar *g_utf8_normalize (const gchar *str, gssize len, GNormalizeMode mode); gchar *g_utf8_strup (const gchar *str, gssize len); gchar *g_utf8_strdown (const gchar *str, gssize len); gchar *g_utf8_casefold (const gchar *str, gssize len); These provide a decent set of primitives for doing most operations that roughly correspond to strup, strdown, strcasecmp in the non-internationalized case. The functions implement the algorithms from the corresponding unicode technical reports (#15, #21)