GNOME Bugzilla – Bug 648587
Disregard accents when searching
Last modified: 2012-12-12 16:42:42 UTC
In Spanish, accents make searches harder than necessary. If I want to open my Music folder (Música in Spanish) and start typing "Mu...", without the accent, the Shell won't find it. Most people I know don't bother with accents, so from their point of view the Shell's search "won't work". Please, can it disregard accents the same way it disregards Capitalization?
I'd like to stress to our beloved but English developers ;-) that this is very important for many languages, as people often don't bother typing accents (this is how Google works anyway). This is generally tricky to do correctly. Empathy has some code to do this (which is planned to become GtkLiveSearch): http://git.gnome.org/browse/empathy/tree/libempathy-gtk/empathy-live-search.c The Shell could probably reuse strip_utf8_string(). The idea is to strip the keywords when indexing them, and later the text typed by the user, and thus match the stripped form of both. For a comment of the terrible mess this is (which explains why it's useful to use an existing and tested implementation): http://blogs.gnome.org/xclaesse/2010/06/07/need-help-with-non-latin-alphabet/
I do really miss that. In French for exemple, it's the videos ("vidéos" in french) folder that has an accent.
*** Bug 657714 has been marked as a duplicate of this bug. ***
Created attachment 195466 [details] [review] Added util function that strips accents This is used by contact search and should be easily implemented by other search providers as well.
Created attachment 195510 [details] [review] Added util function that strips accents This is used by contact search and should be easily implemented by other search providers as well.
Have you looked at any prior art for this? For example, what does Lucene do? I totally admit our current search system is only slightly less bad than the worst implementation of search conceivable, but before we bolt on more to it, it's probably worth looking at other code out there. Or for something more in GNOME, what does Evolution's search do? What does tracker do?
This code is adapted from that in Empathy, which is taken from Hildon which took it from E-D-S.
Review of attachment 195510 [details] [review]: ::: src/shell-contact-system.c @@ +104,3 @@ { const char *term = iter->data; + normalized_terms = g_slist_prepend (normalized_terms, shell_util_strip_utf8_string (shell_util_normalize_and_casefold (term))); This leaks memory. Doesn't the new function also do normalize_and_casefold() internally anyways? ::: src/shell-util.c @@ +81,3 @@ + default: + ch = g_unichar_tolower (ch); + decomp = g_unicode_canonical_decomposition (ch, &dlen); This function appears to be deprecated in GLib git in favor of g_unichar_fully_decompose(). ::: src/shell-util.h @@ +10,3 @@ G_BEGIN_DECLS +char *shell_util_strip_utf8_string (const char *string); The term "strip" here is overloaded - to me it means "remove whitespace", as in g_strstrip() or Python string.strip(). How about shell_util_normalize_string_for_search() ?
(In reply to comment #7) > This code is adapted from that in Empathy, which is taken from Hildon which > took it from E-D-S. Ok, so there's some history here =) Clearly someone put some nontrivial work into it in the past, and they deserve at least a copyright notice. Just something like: /* Copied from evolution-data-server/foo.c under the LGPL * Originally written by Jane Doe <jdoe@example.com> */ (Whether we actually try to share this function in libgnome-desktop or something is a different topic)
The patch is not removing only the combining diacritical marks. That patch is really applying NFD normalization and grabbing just *the first* unicode point generated in the decomposition; which is usually ok for most cases, but not always. It will break e.g. Korean Hangul decomposition (where a single Hangul unicode point representing a syllabe is decomposed into multiple 'Jamo' unicode points representing letters). A better approach would be to: * Apply a compatibility decomposition to the whole string (NFKD normalization). * Remove all combining diacritical marks, this is, all Unicode points within the following ranges: Basic range: [U+0300,U+036F] Supplement: [U+1DC0,U+1DFF] For Symbols: [U+20D0,U+20FF] Half marks: [U+FE20,U+FE2F] That is what we do in Tracker and works pretty well.
For reference, this is the glib-based implementation of the Tracker unaccenting mechanism (now removed from git master as we better depend on either libunistring or libicu): http://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-parser-glib.c?h=tracker-0.10#n162
Created attachment 231389 [details] [review] Just skip combining diacritical marks in search operations This patch introduces the unaccenting method implemented in Tracker, which will just remove all combining diacritical marks from the string once it has been NFKD-normalized.
Created attachment 231390 [details] [review] update to the previous patch, just removes a g_print()
Review of attachment 231389 [details] [review]: Code is a bit tricky...but makes sense to me. Please commit without the debugging g_print().
Review of attachment 231390 [details] [review]: Looks good, thanks.
This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.