GNOME Bugzilla – Bug 790229
Function to remove accents and other diacritic marks from utf8 string, needed for text search
Last modified: 2017-11-13 16:02:09 UTC
Apps doing text search on utf8 strings usually want to compare Unicode strings and be insensitive to diacritical marks (eg. accents in letters), GLib should provide a function for that. This is currently needed for GtkAppChooserWidget search function, bug 745128. There is a function being copied around to remove accents from utf8 strings, see: https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n40 so by providing a GLib function we help developers and avoid code duplication.
Created attachment 363420 [details] [review] unicode: provide new g_utf8_unaccent() function that will remove from passed string all combining characters that belong to Unicode General Category of Nonspacing Mark. This include characters such as accents, diacritics, Hebrew points, Arabic vowel signs and Indic matras. This function will allow to compare Unicode strings and be insensitive to diacritical marks (eg. accents in letters), which is usually a needed feature when doing text search.
My patch above is based on Alexander's function: https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n40 but expanded to remove any combining character of the Unicode General Category of Nonspacing Mark, while Alexander's function was just removing a few ranges which cover most/all of diacritics for Latin/European languages. So the patch will cover all languages.
GLib already provides g_str_tokenize_and_fold() for basically exactly this purpose: searching and filtering. Use that together with g_str_match() and that should do what you want, I think. Additionally to the patch here, it performs case folding, tokenisation and prefix matching on tokens. If that doesn’t do what you need, please reopen this bug report.
Even g_str_match_string() could be enough for the purpose of searching strings ignoring combining marks, I assume, didn't test.