Bug 648587 – Disregard accents when searching

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 648587 - Disregard accents when searching


Summary:	Disregard accents when searching


Status:	RESOLVED FIXED

Product:	gnome-shell
Classification:	Core
Component:	general
Version:	3.0.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gnome-shell-maint
QA Contact:	gnome-shell-maint

URL:
Whiteboard:

Duplicates:	657714 (view as bug list)
Depends on:
Blocks:

Reported:	2011-04-25 06:33 UTC by David Prieto
Modified:	2012-12-12 16:42 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Added util function that strips accents (4.66 KB, patch) 2011-09-02 10:45 UTC, Morten Mjelva	none	Details \| Review
Added util function that strips accents (4.72 KB, patch) 2011-09-02 17:32 UTC, Morten Mjelva	reviewed	Details \| Review
Just skip combining diacritical marks in search operations (5.71 KB, patch) 2012-12-12 16:06 UTC, Aleksander Morgado	reviewed	Details \| Review
update to the previous patch, just removes a g_print() (5.66 KB, patch) 2012-12-12 16:11 UTC, Aleksander Morgado	committed	Details \| Review

Description David Prieto 2011-04-25 06:33:49 UTC

In Spanish, accents make searches harder than necessary. If I want to open my Music folder (Música in Spanish) and start typing "Mu...", without the accent, the Shell won't find it. Most people I know don't bother with accents, so from their point of view the Shell's search "won't work".

Please, can it disregard accents the same way it disregards Capitalization?

Comment 1 Milan Bouchet-Valat 2011-04-30 11:47:36 UTC

I'd like to stress to our beloved but English developers ;-) that this is very important for many languages, as people often don't bother typing accents (this is how Google works anyway).


This is generally tricky to do correctly. Empathy has some code to do this (which is planned to become GtkLiveSearch):
http://git.gnome.org/browse/empathy/tree/libempathy-gtk/empathy-live-search.c

The Shell could probably reuse strip_utf8_string(). The idea is to strip the keywords when indexing them, and later the text typed by the user, and thus match the stripped form of both.

For a comment of the terrible mess this is (which explains why it's useful to use an existing and tested implementation):
http://blogs.gnome.org/xclaesse/2010/06/07/need-help-with-non-latin-alphabet/

Comment 2 ecyrbe 2011-06-17 04:20:46 UTC

I do really miss that. In French for exemple, it's the videos ("vidéos" in french) folder that has an accent.

Comment 3 Owen Taylor 2011-08-30 15:18:44 UTC

*** Bug 657714 has been marked as a duplicate of this bug. ***

Comment 4 Morten Mjelva 2011-09-02 10:45:10 UTC

Created attachment 195466 [details] [review]
Added util function that strips accents

This is used by contact search and should be easily implemented by other
search providers as well.

Comment 5 Morten Mjelva 2011-09-02 17:32:10 UTC

Created attachment 195510 [details] [review]
Added util function that strips accents

This is used by contact search and should be easily implemented by other
search providers as well.

Comment 6 Colin Walters 2011-09-02 17:38:40 UTC

Have you looked at any prior art for this?  For example, what does Lucene do?

I totally admit our current search system is only slightly less bad than the worst implementation of search conceivable, but before we bolt on more to it, it's probably worth looking at other code out there.

Or for something more in GNOME, what does Evolution's search do?  What does tracker do?

Comment 7 Morten Mjelva 2011-09-02 17:47:30 UTC

This code is adapted from that in Empathy, which is taken from Hildon which took it from E-D-S.

Comment 8 Colin Walters 2011-09-02 17:49:31 UTC

Review of attachment 195510 [details] [review]:

::: src/shell-contact-system.c
@@ +104,3 @@
     {
       const char *term = iter->data;
+      normalized_terms = g_slist_prepend (normalized_terms, shell_util_strip_utf8_string (shell_util_normalize_and_casefold (term)));

This leaks memory.  Doesn't the new function also do normalize_and_casefold() internally anyways?

::: src/shell-util.c
@@ +81,3 @@
+    default:
+      ch = g_unichar_tolower (ch);
+      decomp = g_unicode_canonical_decomposition (ch, &dlen);

This function appears to be deprecated in GLib git in favor of g_unichar_fully_decompose().

::: src/shell-util.h
@@ +10,3 @@
 G_BEGIN_DECLS
 
+char    *shell_util_strip_utf8_string          (const char       *string);

The term "strip" here is overloaded - to me it means "remove whitespace", as in g_strstrip() or Python string.strip().

How about shell_util_normalize_string_for_search() ?

Comment 9 Colin Walters 2011-09-02 17:51:36 UTC

(In reply to comment #7)
> This code is adapted from that in Empathy, which is taken from Hildon which
> took it from E-D-S.

Ok, so there's some history here =)  Clearly someone put some nontrivial work into it in the past, and they deserve at least a copyright notice.  Just something like:

/* Copied from evolution-data-server/foo.c under the LGPL
 * Originally written by Jane Doe <jdoe@example.com>
 */

(Whether we actually try to share this function in libgnome-desktop or something is a different topic)

Comment 10 Aleksander Morgado 2012-12-12 13:24:04 UTC

The patch is not removing only the combining diacritical marks.

That patch is really applying NFD normalization and grabbing just *the first* unicode point generated in the decomposition; which is usually ok for most cases, but not always. It will break e.g. Korean Hangul decomposition (where a single Hangul unicode point representing a syllabe is decomposed into multiple 'Jamo' unicode points representing letters).

A better approach would be to:
 * Apply a compatibility decomposition to the whole string (NFKD normalization).
 * Remove all combining diacritical marks, this is, all Unicode points within the following ranges:
        Basic range: [U+0300,U+036F]
        Supplement: [U+1DC0,U+1DFF]
        For Symbols: [U+20D0,U+20FF]
        Half marks: [U+FE20,U+FE2F]

That is what we do in Tracker and works pretty well.

Comment 11 Aleksander Morgado 2012-12-12 13:45:14 UTC

For reference, this is the glib-based implementation of the Tracker unaccenting mechanism (now removed from git master as we better depend on either libunistring or libicu):

http://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-parser-glib.c?h=tracker-0.10#n162

Comment 12 Aleksander Morgado 2012-12-12 16:06:37 UTC

Created attachment 231389 [details] [review]
Just skip combining diacritical marks in search operations

This patch introduces the unaccenting method implemented in Tracker, which will just remove all combining diacritical marks from the string once it has been NFKD-normalized.

Comment 13 Aleksander Morgado 2012-12-12 16:11:43 UTC

Created attachment 231390 [details] [review]
update to the previous patch, just removes a g_print()

Comment 14 Colin Walters 2012-12-12 16:27:04 UTC

Review of attachment 231389 [details] [review]:

Code is a bit tricky...but makes sense to me.  Please commit without the debugging g_print().

Comment 15 Colin Walters 2012-12-12 16:27:39 UTC

Review of attachment 231390 [details] [review]:

Looks good, thanks.

Comment 16 Aleksander Morgado 2012-12-12 16:42:42 UTC

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.