Bug 722246 – Normalize, casefold and unaccent strings when searching

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 722246 - Normalize, casefold and unaccent strings when searching


Summary:	Normalize, casefold and unaccent strings when searching


Status:	RESOLVED FIXED

Product:	gnome-documents
Classification:	Core
Component:	general
Version:	3.11.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	GNOME documents maintainer(s)
QA Contact:	GNOME documents maintainer(s)

URL:
Whiteboard:

Depends on:	722254
Blocks:

Reported:	2014-01-15 09:28 UTC by Debarshi Ray
Modified:	2014-12-24 05:46 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Casefold strings instead of lowercaseing when searching (3.41 KB, patch) 2014-01-15 09:51 UTC, Debarshi Ray	none	Details \| Review
Casefold strings instead of lowercaseing when searching (3.41 KB, patch) 2014-01-16 07:36 UTC, Debarshi Ray	committed	Details \| Review
search, utils: Unaccent and normalize strings when searching (5.70 KB, patch) 2014-01-16 07:37 UTC, Debarshi Ray	committed	Details \| Review

Description Debarshi Ray 2014-01-15 09:28:42 UTC

Searching for "podminky" does not return a document with "podmínky" in its title. It should. We should do something like this: https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n38

Our searches are of the form: fn:contains (fn:lower-case (str1), str2.toLowerCase()). For starters, we should casefold (tracker:case-fold and GLib.utf8_casefold) instead of converting to lowercase. Conversation from #tracker:

09:19 <martyn> 2. so using tracker:case-fold actually ends up with              
      g_utf8_casefold() being called on the string... where as fn:lower-case     
      will call u16_tolower() for libunistring, or you_strToLower for libicu
09:19 <martyn> so basically, results may not match if you use fn:lower-case
09:19 <martyn> depending on how those backends decide to "lower" strings
09:20 <martyn> in terms of efficiency, I have no data to go on sadly, it's a    
      battle between glib and (libicu / libunistring)
09:20 <rishi> So, casefolding looks better. Having consistent and reliable      
      results is probably better than a bit of speed.
09:21 <martyn> using JavaScripts toLowerCase() function may behave differently  
      again, I don't know
09:21 <rishi> Especially since other modules appear to be using                 
      g_utf8_casefold.
09:21 <martyn> yea

Comment 1 Debarshi Ray 2014-01-15 09:51:48 UTC

Created attachment 266344 [details] [review]
Casefold strings instead of lowercaseing when searching

Comment 2 Aleksander Morgado 2014-01-15 11:05:15 UTC

(In reply to comment #0)
> Searching for "podminky" does not return a document with "podmínky" in its
> title. It should. We should do something like this:
> https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n38
> 

LoL, you see where the original code comes from, right? :) We do unaccenting, casefolding and normalization in Tracker already, but for full-text-searches only. I guess we could re-use that same logic during non-FTS searches.

Comment 3 Debarshi Ray 2014-01-16 07:36:15 UTC

Created attachment 266430 [details] [review]
Casefold strings instead of lowercaseing when searching

Fix minor whitespace issue.

Comment 4 Debarshi Ray 2014-01-16 07:37:16 UTC

Created attachment 266431 [details] [review]
search, utils: Unaccent and normalize strings when searching

This depends on the patches from bug 722254

Comment 5 Cosimo Cecchi 2014-01-16 20:31:51 UTC

Review of attachment 266430 [details] [review]:

Looks good to me, nice cleanup.

Comment 6 Cosimo Cecchi 2014-01-16 20:38:19 UTC

Review of attachment 266431 [details] [review]:

::: src/lib/gd-utils.c
@@ +238,3 @@
 }
 
+/* Copied from tracker/src/libtracker-fts/tracker-parser-glib.c under LGPLv2+

Could this be made public in tracker maybe if we're going to use it across the board? See other comment though.

::: src/search.js
@@ +66,3 @@
     getTerms: function() {
         let escaped_str = Tracker.sparql_escape_string(this._string);
+        let str = GdPrivate.normalize_casefold_and_unaccent(escaped_str);

I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold() instead of this piece of code. That'd allow us to avoid copy-pasting the function from Tracker entirely.

Comment 7 Debarshi Ray 2014-01-16 21:12:01 UTC

(In reply to comment #6)
> Review of attachment 266431 [details] [review]:
> 
> ::: src/lib/gd-utils.c
> @@ +238,3 @@
>  }
> 
> +/* Copied from tracker/src/libtracker-fts/tracker-parser-glib.c under LGPLv2+
> 
> Could this be made public in tracker maybe if we're going to use it across the
> board? See other comment though.

This code used to be in Tracker as part of the GLib "backend" for dealing with Unicode. That was deleted because it is now mandatory to use either the libicu or libunistring "backend" because "GLib unicode support has performance issues".

> ::: src/search.js
> @@ +66,3 @@
>      getTerms: function() {
>          let escaped_str = Tracker.sparql_escape_string(this._string);
> +        let str = GdPrivate.normalize_casefold_and_unaccent(escaped_str);
> 
> I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold()
> instead of this piece of code.

I will take a look at it. Thanks for the idea.

Comment 8 Aleksander Morgado 2014-01-17 08:13:16 UTC

> 
> I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold()
> instead of this piece of code. That'd allow us to avoid copy-pasting the
> function from Tracker entirely.

Note that the new g_str_tokenize_and_fold() is somewhat special w.r.t to matching words with accents. This new method decomposes each word and then tries to look for matches, the two examples given in the doc show the issue:

 * Looking for "fred" matches both "Frederic" and "Fréderic"
 * Looking for "fréd" matches "Fréderic" but not "Frederic"

But also, not explicitly said in the doc:

 * Looking for "frederic" matches "Frederic" but not "Fréderic"
 * Looking for "fréderic" matches "Fréderic" but not "Frederic"

So in order to match a word, your search string needs to either be exactly equal to the final word with accent, OR, equal up to just before the accent once the words get decomposed.

In contrast, Tracker's original search string just removes the accents (combining diacritical marks to be more accurate) in both strings once they are decomposed, which completely avoids the issue, or following the example, with Tracker's logic we'll get:

 * Looking for "fred" matches both "Frederic" and "Fréderic"
 * Looking for "fréd" matches both "Frederic" and "Fréderic"
 * Looking for "frederic" matches both "Frederic" and "Fréderic"
 * Looking for "fréderic" matches both "Frederic" and "Fréderic"

Comment 9 Aleksander Morgado 2014-01-17 08:19:13 UTC

> 
> This code used to be in Tracker as part of the GLib "backend" for dealing with
> Unicode. That was deleted because it is now mandatory to use either the libicu
> or libunistring "backend" because "GLib unicode support has performance
> issues".
> 

To clarify this, we removed the GLib backend for Unicode support because *our logic and implementation* of the backend code was pretty bad. Before introducing the libicu and libunistring backends, we had word-breaking algorithm using glib for ASCII text (looking for non-alphanumerical characters to break the words) together with a Pango-based word-breaking algorithm for non-ASCII text. This logic was completely messy and inconsistent, specially with strings having both ASCII and non-ASCII. With the libicu and libunistring backends we now use the Unicode-based word-breaking algorithm for all kind of texts, which doesn't break words looking for non-alphanumerical characters (that's not good, and actually g_str_tokenize_and_fold() kind of does that). The logic in Unicode doesn't look for places to break words, it looks for places where *not to* break words, based on a locale-dependent set of rules which work pretty well, even for Asian scripts.

Comment 10 Aleksander Morgado 2014-01-21 15:02:45 UTC

(In reply to comment #8)
> > 
> > I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold()
> > instead of this piece of code. That'd allow us to avoid copy-pasting the
> > function from Tracker entirely.
> 
> Note that the new g_str_tokenize_and_fold() is somewhat special w.r.t to
> matching words with accents. This new method decomposes each word and then
> tries to look for matches, the two examples given in the doc show the issue:
> 
>  * Looking for "fred" matches both "Frederic" and "Fréderic"
>  * Looking for "fréd" matches "Fréderic" but not "Frederic"
> 
> But also, not explicitly said in the doc:
> 
>  * Looking for "frederic" matches "Frederic" but not "Fréderic"
>  * Looking for "fréderic" matches "Fréderic" but not "Frederic"
> 
> So in order to match a word, your search string needs to either be exactly
> equal to the final word with accent, OR, equal up to just before the accent
> once the words get decomposed.
> 
> In contrast, Tracker's original search string just removes the accents
> (combining diacritical marks to be more accurate) in both strings once they are
> decomposed, which completely avoids the issue, or following the example, with
> Tracker's logic we'll get:
> 
>  * Looking for "fred" matches both "Frederic" and "Fréderic"
>  * Looking for "fréd" matches both "Frederic" and "Fréderic"
>  * Looking for "frederic" matches both "Frederic" and "Fréderic"
>  * Looking for "fréderic" matches both "Frederic" and "Fréderic"

Aaand.. you can forget this. g_str_tokenize_and_fold() does actually remove combining diacritical marks, as the Tracker code did. It also removes everything that is non-ASCII after the NFKD decomposition, but that's another thing. If targeting latin scripts, g_str_tokenize_and_fold() should do the job.

Comment 11 Cosimo Cecchi 2014-12-24 03:18:22 UTC

I pushed this to master finally. I will open a bug against GLib to see if we can share this new function.

Attachment 266431 [details] pushed as df39e35 - search, utils: Unaccent and normalize strings when searching

Comment 12 Cosimo Cecchi 2014-12-24 03:19:41 UTC

Oh, I missed the last comment by Aleksander. I will push another cleanup to use the GLib function then.

Comment 13 Cosimo Cecchi 2014-12-24 04:04:16 UTC

After trying to change the code to use the GLib function, it turns out it's not really suitable for our use case. What we need here is the full set of unaccented/normalized tokens. g_str_tokenize_and_fold() has that information in the ascii_alternates parameter, but when accented and unaccented words are combined, ascii_alternates only contains the unaccented tokens of the previously accented words, instead of the full set: we don't have enough knowledge of which words the alternates were derived from to rebuild a full set ourselves.

Comment 14 Cosimo Cecchi 2014-12-24 05:46:09 UTC

Found a way to use the GLib function after talking to Ryan on IRC.
This also makes the behavior better when the user manually types in an accent, compared to the previous patch.