After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 790229 - Function to remove accents and other diacritic marks from utf8 string, needed for text search
Function to remove accents and other diacritic marks from utf8 string, needed...
Status: RESOLVED WONTFIX
Product: glib
Classification: Platform
Component: i18n
2.53.x
Other Linux
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks: 745128
 
 
Reported: 2017-11-12 06:52 UTC by Nelson Benitez
Modified: 2017-11-13 16:02 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
unicode: provide new g_utf8_unaccent() function (4.22 KB, patch)
2017-11-12 07:04 UTC, Nelson Benitez
none Details | Review

Description Nelson Benitez 2017-11-12 06:52:34 UTC
Apps doing text search on utf8 strings usually want to compare Unicode strings
and be insensitive to diacritical marks (eg. accents in letters), GLib should provide a function for that.

This is currently needed for GtkAppChooserWidget search function, bug 745128.

There is a function being copied around to remove accents from utf8 strings, see:
https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n40

so by providing a GLib function we help developers and avoid code duplication.
Comment 1 Nelson Benitez 2017-11-12 07:04:43 UTC
Created attachment 363420 [details] [review]
unicode: provide new g_utf8_unaccent() function

that will remove from passed string all combining characters
that belong to Unicode General Category of Nonspacing Mark.
This include characters such as accents, diacritics, Hebrew
points, Arabic vowel signs and Indic matras.

This function will allow to compare Unicode strings and be
insensitive to diacritical marks (eg. accents in letters),
which is usually a needed feature when doing text search.
Comment 2 Nelson Benitez 2017-11-12 07:26:22 UTC
My patch above is based on Alexander's function:
https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n40

but expanded to remove any combining character of the Unicode General Category of Nonspacing Mark, while Alexander's function was just removing a few ranges which cover most/all of diacritics for Latin/European languages. So the patch will cover all languages.
Comment 3 Philip Withnall 2017-11-13 10:39:32 UTC
GLib already provides g_str_tokenize_and_fold() for basically exactly this purpose: searching and filtering. Use that together with g_str_match() and that should do what you want, I think. Additionally to the patch here, it performs case folding, tokenisation and prefix matching on tokens.

If that doesn’t do what you need, please reopen this bug report.
Comment 4 Aleksander Morgado 2017-11-13 16:02:09 UTC
Even g_str_match_string() could be enough for the purpose of searching strings ignoring combining marks, I assume, didn't test.