Bug 423036 – [META] normalize strings for sorting, searching, comparison, filenames, etc.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 423036 - [META] normalize strings for sorting, searching, comparison, filenames, etc.


Summary:	[META] normalize strings for sorting, searching, comparison, filenames, etc.


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:	419894 421064 421253 421486 421678 421736 423045 423237 423242 423244 423245 423247 423257 423258 423260 423261 423264 423265 423268 423271 423272 423274 423282 424429 424800 424851
Blocks:

Reported:	2007-03-26 19:23 UTC by Denis Jacquerye
Modified:	2018-05-24 11:00 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Denis Jacquerye 2007-03-26 19:23:17 UTC

This is a metabug, it is not a glib bug but rather involves applications using glib but not doing "the right thing" regarding strings.

Unicode define canonically equivalent sequences of characters.
For example these are equivalent:
ẹ́ <U+0065 LATIN SMALL LETTER E + U+0323 COMBINING DOT BELOW + U+0301 COMBINING ACUTE ACCENT>
ẹ́ <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT + U+0323 COMBINING DOT BELOW>
ẹ́ <U+1EB9 LATIN SMALL LETTER E WITH DOT BELOW + U+0301 COMBINING ACUTE ACCENT>

For sorting, g_utf8_collate() should be used instead of strcmp.

For comparison, eg. for matching string in search, g_utf8_normalize() should be use before strcmp. With either G_NORMALIZE_DEFAULT = G_NORMALIZE_NFD or = G_NORMALIZE_DEFAULT_COMPOSE = G_NORMALIZE_NFC.

Applications should also use this before creating files, i.e. unicode equivalent filenames should be considered as the same unique filename.

Remember the user doesn't care about byte value or character sequence. Input methods might use one sequence or another, applications should handle the rest.

Comment 1 Morten Welinder 2007-03-26 19:44:57 UTC

For regular expressions (i.e., the new gregex stuff), see

    http://unicode.org/unicode/reports/tr18/

which basically states that a regular expression engine is allowed to punt
this to the caller but that it must document what it does.

Comment 2 Morten Welinder 2007-03-27 13:24:47 UTC

How is backspace supposed to work if the cursor is right behind a
combining pair?

Comment 3 Denis Jacquerye 2007-03-27 13:45:01 UTC

MW: I can't speak for all scripts, but for Latin script backspace currently works consistantly in gnome/gtk apps, i.e. 'é' (precomposed), 'é' (with combining diacritics), or 'ɛ́' are handled in a similar manner, backspace deletes everything up to and including the base character. Only mozilla based stuff doesn't work that way.

Comment 4 Michael Chudobiak 2007-03-27 17:04:09 UTC

It should be mentioned that you have to pay close attention to gnome-vfs escaping issues as well. That is, g_utf8_normalize has to be run on unescaped URIs, not escaped URIs.

- Mike

Comment 5 GNOME Infrastructure Team 2018-05-24 11:00:46 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/88.