GNOME Bugzilla – Bug 423036
[META] normalize strings for sorting, searching, comparison, filenames, etc.
Last modified: 2018-05-24 11:00:46 UTC
This is a metabug, it is not a glib bug but rather involves applications using glib but not doing "the right thing" regarding strings. Unicode define canonically equivalent sequences of characters. For example these are equivalent: ẹ́ <U+0065 LATIN SMALL LETTER E + U+0323 COMBINING DOT BELOW + U+0301 COMBINING ACUTE ACCENT> ẹ́ <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT + U+0323 COMBINING DOT BELOW> ẹ́ <U+1EB9 LATIN SMALL LETTER E WITH DOT BELOW + U+0301 COMBINING ACUTE ACCENT> For sorting, g_utf8_collate() should be used instead of strcmp. For comparison, eg. for matching string in search, g_utf8_normalize() should be use before strcmp. With either G_NORMALIZE_DEFAULT = G_NORMALIZE_NFD or = G_NORMALIZE_DEFAULT_COMPOSE = G_NORMALIZE_NFC. Applications should also use this before creating files, i.e. unicode equivalent filenames should be considered as the same unique filename. Remember the user doesn't care about byte value or character sequence. Input methods might use one sequence or another, applications should handle the rest.
For regular expressions (i.e., the new gregex stuff), see http://unicode.org/unicode/reports/tr18/ which basically states that a regular expression engine is allowed to punt this to the caller but that it must document what it does.
How is backspace supposed to work if the cursor is right behind a combining pair?
MW: I can't speak for all scripts, but for Latin script backspace currently works consistantly in gnome/gtk apps, i.e. 'é' (precomposed), 'é' (with combining diacritics), or 'ɛ́' are handled in a similar manner, backspace deletes everything up to and including the base character. Only mozilla based stuff doesn't work that way.
It should be mentioned that you have to pay close attention to gnome-vfs escaping issues as well. That is, g_utf8_normalize has to be run on unescaped URIs, not escaped URIs. - Mike
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/88.