Bug 55852 – Do we need anything between strcmp and g_utf8_strcoll for UTF-8?

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 55852 - Do we need anything between strcmp and g_utf8_strcoll for UTF-8?


Summary:	Do we need anything between strcmp and g_utf8_strcoll for UTF-8?


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	1.3.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2001-06-06 23:56 UTC by Darin Adler
Modified:	2011-02-18 15:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Darin Adler 2001-06-06 23:56:14 UTC

We've been discussing how it might be nice to have g_utf8_strcasecmp. But
the Unicode standard describes a number of levels between strcmp and full
"collation".

The report at http://unicode.org/unicode/reports/tr15/ describes
normalization, with 4 forms: D, C, KD, and KC. Form C is supposed to be the
rule used for URLs. This makes it clear that some applications that
formerly used strcmp to compare strings might want to compare UTF-8 strings
in a way that ignores differences that have to do with how the string was
typed and which would be invisible when the string was displayed. An
example where this might come up could be when checking if someone typed a
password correctly.

The report at http://unicode.org/unicode/reports/tr21/#Caseless%20Matching
says that caseless matching is done by case folding which "is more than
just conversion to lowercase". So a good implementation of a UTF-8
strcasecmp would not simply be based on conversion to lowercase. Sadly,
there are four flavors of case folding, roughly summarized as "simple
folding", "full folding", "simple folding handling dotted I", and "full
folding handling dotted I".

I could imagine adding a function that does form C normalization
(g_utf8_str_normalize?), another that does form C normalization and full
case folding (g_utf8_str_fold_case?) to be used where people might use
g_strdown today, another that does comparison with normalization
(g_utf8_strnormcmp?) to be used in some places where strcmp is used today,
another that does comparison with normalization and case folding
(g_utf8_strcasecmp?) to be used in some places where people use
g_strcasecmp today, and perhaps explict calls to normalize with any of the
4 algorithms (g_utf8_str_normalize_full?) and case fold with any of the 4
algorithms (g_utf8_str_fold_case_full?).

An argument against doing any of this is that programs
should instead use g_ascii_strdown, g_ascii_strcasecmp, and g_utf8_strcoll.

This might also be a waste of time -- we could just wait until we see real
user problems and then go back and add these operations as needed to fix
those problems.

I hope having this bug report turns out to be useful. (I would have cc'd to
trow@ximian

Comment 1 Owen Taylor 2001-06-24 19:13:15 UTC

I think any time you are using human readable text (names,
subject lines, ...), you _should_ be using unicode-sensitive
functions rather than g_ascii_*. After all, ascii covers
a tiny subset of the worlds languages. If you are parsing
a config file or something, yes, then you should use g_ascii_*.

Looking over various documentation on the issue, one thing
that comes to mind is that in many of the cases where people
are currently using g_strdown(), the correct internationlized
operation is to obtain a sort key [as with strxfrm] that
ignores the differences you don't care about, which could
be one or more of:

 - normalization
 - 3rd level differences (case)
 - 2nd level differences (accents)

Using strdown() and then displaying the results to the user
is usually a bad idea - rather, it is more frequently a
way of accelerating a strcasecmp(), or of providing a key
to do a fast case-insenstive lookup in a hash table.

This implies that while normalization (which should leave
an identically displayed string) is a sensible operation
to provide, just skipping case folding might make sense.
This seems to be (in a fairly quick look) what the ICU API
provides.

Comment 2 Owen Taylor 2001-07-07 02:44:35 UTC

gchar *g_utf8_normalize (const gchar   *str,
			 gssize         len,
			 GNormalizeMode mode);

gchar *g_utf8_strup   (const gchar *str,
		       gssize       len);
gchar *g_utf8_strdown (const gchar *str,
		       gssize       len);
gchar *g_utf8_casefold (const gchar *str,
			gssize       len);

These provide a decent set of primitives for doing most
operations that roughly correspond to strup, strdown,
strcasecmp in the non-internationalized case.

The functions implement the algorithms from the 
corresponding unicode technical reports (#15, #21)