Bug 55836 – need locale-sensitive sorting for UTF-8 strings (g_utf8_strcoll?)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 55836 - need locale-sensitive sorting for UTF-8 strings (g_utf8_strcoll?)


Summary:	need locale-sensitive sorting for UTF-8 strings (g_utf8_strcoll?)


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	1.3.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:	55837

Reported:	2001-06-06 18:34 UTC by Darin Adler
Modified:	2011-02-18 15:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Simple attempt at writing fallback strcoll() (1009 bytes, text/plain) 2001-06-18 20:30 UTC, Owen Taylor	Details

Description Darin Adler 2001-06-06 18:34:18 UTC

Comment 1 Darin Adler 2001-06-06 18:37:20 UTC

There's code that needs to do strcasecmp on UTF-8 strings.

Comment 2 Darin Adler 2001-06-06 18:41:40 UTC

In fact, it seems that what we really need is the function that
compares UTF-8 strings properly for sorting, more like strcoll than
strcasecmp.

Comment 3 Owen Taylor 2001-06-06 19:05:51 UTC

Hard problem - the cheat is to convert to the encoding
of the locale, strcoll() and convert back. But this doesn't
give an ordering for strings that can't be represented
in the current locale.

On GNU libc, linux, very recent versions, there are some
functions allowing locale operations in a non-current
locale - so you might be able to use this to do strcoll()
in de_DE.UTF-8 even if the current locale is de_DE.iso-8859-1

Or you could implement:
 
 http://www.unicode.org/unicode/reports/tr10/

Probably several weeks of work, not counting finding the
correct tailoring data for interesting locales.
 
g_utf8_strcmp() is just g_strcmp() - UTF-8 has that property.
g_utf8_strcasecmp() could be done by g_unichar_tolower()
character by character.

Comment 4 Darin Adler 2001-06-06 19:24:37 UTC

It would be cruel to leave this as an exercise for the programmer. Programs that sort things should use a locale-sensitive sort like strcoll to make people in non-US countries happy. I think that coming up with a UTF-8 version of it is part of the price for switching to UTF-8. Maybe we can ship GNOME 2 without solving this problem. I don't know.

If the current locale is de_DE.iso-8859-1, you can switch to de_DE.UTF-8 and do the strcoll call and switch back. So functions allowing locale operations aren't necessarily required. But I don't see how you'd know when you need to use a locale other than the current one and what locale to use.

It would be nice to have g_utf8_strcasecmp, and it would be easy to code it, but I guess that doesn't really help with the problem.

Comment 5 Darin Adler 2001-06-06 19:26:12 UTC

(crappy web browser, sorry)

It would be cruel to leave this as an exercise for the programmer.
Programs that sort things should use a locale-sensitive sort like
strcoll to make people in non-US countries happy. I think that coming
up with a UTF-8 version of it is part of the price for switching to
UTF-8. Maybe we can ship GNOME 2 without solving this problem. I don't
know.

If the current locale is de_DE.iso-8859-1, you can switch to
de_DE.UTF-8 and do the strcoll call and switch back. So functions
allowing locale operations aren't necessarily required. But I don't
see how you'd know when you need to use a locale other than the
current one and what locale to use.

It would be nice to have g_utf8_strcasecmp, and it would be easy to
code it, but I guess that doesn't really help with the problem.

Comment 6 Owen Taylor 2001-06-06 19:31:08 UTC

Well, there is not even a guarantee that there will be a corresponding
UTF-8 locale on the system.

The problem with switching and switching back is that the locale
is application-wide not thread-wide. (g_strtod() is buggy in
this way.)

The lack of a UTF-8 strcoll for GLib-2.0 is something we've been
worrying about, but I don't see any easy resolution - maybe
we could do a hack of strcoll() if the strings are both convertable
to the current locale,   string convertable < string non convertable
always,   strcmp() if the strings are both not convertable and
call that g_utf8_strcoll() for now.

Comment 7 Darin Adler 2001-06-06 23:00:18 UTC

I like your proposed simple hack to use the underlying strcoll. We can
easily improve it compatibly later on. Should I write the code to do
this an attach a patch, or is that a waste of time?

Comment 8 Owen Taylor 2001-06-18 20:30:48 UTC

Created attachment 658 [details]
Simple attempt at writing fallback strcoll()

Comment 9 Owen Taylor 2001-06-24 21:13:43 UTC

Another fallback technique for non-UTF-8 locales:

 if __STDC_ISO_10646__ is defined, convert to ucs4, then
 use wcscoll.

Comment 10 Owen Taylor 2001-07-07 02:41:48 UTC

gint   g_utf8_collate     (const gchar *str1,
			   const gchar *str2);
gchar *g_utf8_collate_key (const gchar *str,
			   gssize       len);

Now committed.