GNOME Bugzilla – Bug 722246
Normalize, casefold and unaccent strings when searching
Last modified: 2014-12-24 05:46:09 UTC
Searching for "podminky" does not return a document with "podmínky" in its title. It should. We should do something like this: https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n38 Our searches are of the form: fn:contains (fn:lower-case (str1), str2.toLowerCase()). For starters, we should casefold (tracker:case-fold and GLib.utf8_casefold) instead of converting to lowercase. Conversation from #tracker: 09:19 <martyn> 2. so using tracker:case-fold actually ends up with g_utf8_casefold() being called on the string... where as fn:lower-case will call u16_tolower() for libunistring, or you_strToLower for libicu 09:19 <martyn> so basically, results may not match if you use fn:lower-case 09:19 <martyn> depending on how those backends decide to "lower" strings 09:20 <martyn> in terms of efficiency, I have no data to go on sadly, it's a battle between glib and (libicu / libunistring) 09:20 <rishi> So, casefolding looks better. Having consistent and reliable results is probably better than a bit of speed. 09:21 <martyn> using JavaScripts toLowerCase() function may behave differently again, I don't know 09:21 <rishi> Especially since other modules appear to be using g_utf8_casefold. 09:21 <martyn> yea
Created attachment 266344 [details] [review] Casefold strings instead of lowercaseing when searching
(In reply to comment #0) > Searching for "podminky" does not return a document with "podmínky" in its > title. It should. We should do something like this: > https://git.gnome.org/browse/gnome-control-center/tree/panels/common/cc-util.c#n38 > LoL, you see where the original code comes from, right? :) We do unaccenting, casefolding and normalization in Tracker already, but for full-text-searches only. I guess we could re-use that same logic during non-FTS searches.
Created attachment 266430 [details] [review] Casefold strings instead of lowercaseing when searching Fix minor whitespace issue.
Created attachment 266431 [details] [review] search, utils: Unaccent and normalize strings when searching This depends on the patches from bug 722254
Review of attachment 266430 [details] [review]: Looks good to me, nice cleanup.
Review of attachment 266431 [details] [review]: ::: src/lib/gd-utils.c @@ +238,3 @@ } +/* Copied from tracker/src/libtracker-fts/tracker-parser-glib.c under LGPLv2+ Could this be made public in tracker maybe if we're going to use it across the board? See other comment though. ::: src/search.js @@ +66,3 @@ getTerms: function() { let escaped_str = Tracker.sparql_escape_string(this._string); + let str = GdPrivate.normalize_casefold_and_unaccent(escaped_str); I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold() instead of this piece of code. That'd allow us to avoid copy-pasting the function from Tracker entirely.
(In reply to comment #6) > Review of attachment 266431 [details] [review]: > > ::: src/lib/gd-utils.c > @@ +238,3 @@ > } > > +/* Copied from tracker/src/libtracker-fts/tracker-parser-glib.c under LGPLv2+ > > Could this be made public in tracker maybe if we're going to use it across the > board? See other comment though. This code used to be in Tracker as part of the GLib "backend" for dealing with Unicode. That was deleted because it is now mandatory to use either the libicu or libunistring "backend" because "GLib unicode support has performance issues". > ::: src/search.js > @@ +66,3 @@ > getTerms: function() { > let escaped_str = Tracker.sparql_escape_string(this._string); > + let str = GdPrivate.normalize_casefold_and_unaccent(escaped_str); > > I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold() > instead of this piece of code. I will take a look at it. Thanks for the idea.
> > I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold() > instead of this piece of code. That'd allow us to avoid copy-pasting the > function from Tracker entirely. Note that the new g_str_tokenize_and_fold() is somewhat special w.r.t to matching words with accents. This new method decomposes each word and then tries to look for matches, the two examples given in the doc show the issue: * Looking for "fred" matches both "Frederic" and "Fréderic" * Looking for "fréd" matches "Fréderic" but not "Frederic" But also, not explicitly said in the doc: * Looking for "frederic" matches "Frederic" but not "Fréderic" * Looking for "fréderic" matches "Fréderic" but not "Frederic" So in order to match a word, your search string needs to either be exactly equal to the final word with accent, OR, equal up to just before the accent once the words get decomposed. In contrast, Tracker's original search string just removes the accents (combining diacritical marks to be more accurate) in both strings once they are decomposed, which completely avoids the issue, or following the example, with Tracker's logic we'll get: * Looking for "fred" matches both "Frederic" and "Fréderic" * Looking for "fréd" matches both "Frederic" and "Fréderic" * Looking for "frederic" matches both "Frederic" and "Fréderic" * Looking for "fréderic" matches both "Frederic" and "Fréderic"
> > This code used to be in Tracker as part of the GLib "backend" for dealing with > Unicode. That was deleted because it is now mandatory to use either the libicu > or libunistring "backend" because "GLib unicode support has performance > issues". > To clarify this, we removed the GLib backend for Unicode support because *our logic and implementation* of the backend code was pretty bad. Before introducing the libicu and libunistring backends, we had word-breaking algorithm using glib for ASCII text (looking for non-alphanumerical characters to break the words) together with a Pango-based word-breaking algorithm for non-ASCII text. This logic was completely messy and inconsistent, specially with strings having both ASCII and non-ASCII. With the libicu and libunistring backends we now use the Unicode-based word-breaking algorithm for all kind of texts, which doesn't break words looking for non-alphanumerical characters (that's not good, and actually g_str_tokenize_and_fold() kind of does that). The logic in Unicode doesn't look for places to break words, it looks for places where *not to* break words, based on a locale-dependent set of rules which work pretty well, even for Asian scripts.
(In reply to comment #8) > > > > I wonder if we couldn't just use the newly introduced g_str_tokenize_and_fold() > > instead of this piece of code. That'd allow us to avoid copy-pasting the > > function from Tracker entirely. > > Note that the new g_str_tokenize_and_fold() is somewhat special w.r.t to > matching words with accents. This new method decomposes each word and then > tries to look for matches, the two examples given in the doc show the issue: > > * Looking for "fred" matches both "Frederic" and "Fréderic" > * Looking for "fréd" matches "Fréderic" but not "Frederic" > > But also, not explicitly said in the doc: > > * Looking for "frederic" matches "Frederic" but not "Fréderic" > * Looking for "fréderic" matches "Fréderic" but not "Frederic" > > So in order to match a word, your search string needs to either be exactly > equal to the final word with accent, OR, equal up to just before the accent > once the words get decomposed. > > In contrast, Tracker's original search string just removes the accents > (combining diacritical marks to be more accurate) in both strings once they are > decomposed, which completely avoids the issue, or following the example, with > Tracker's logic we'll get: > > * Looking for "fred" matches both "Frederic" and "Fréderic" > * Looking for "fréd" matches both "Frederic" and "Fréderic" > * Looking for "frederic" matches both "Frederic" and "Fréderic" > * Looking for "fréderic" matches both "Frederic" and "Fréderic" Aaand.. you can forget this. g_str_tokenize_and_fold() does actually remove combining diacritical marks, as the Tracker code did. It also removes everything that is non-ASCII after the NFKD decomposition, but that's another thing. If targeting latin scripts, g_str_tokenize_and_fold() should do the job.
I pushed this to master finally. I will open a bug against GLib to see if we can share this new function. Attachment 266431 [details] pushed as df39e35 - search, utils: Unaccent and normalize strings when searching
Oh, I missed the last comment by Aleksander. I will push another cleanup to use the GLib function then.
After trying to change the code to use the GLib function, it turns out it's not really suitable for our use case. What we need here is the full set of unaccented/normalized tokens. g_str_tokenize_and_fold() has that information in the ascii_alternates parameter, but when accented and unaccented words are combined, ascii_alternates only contains the unaccented tokens of the previously accented words, instead of the full set: we don't have enough knowledge of which words the alternates were derived from to rebuild a full set ourselves.
Found a way to use the GLib function after talking to Ryan on IRC. This also makes the behavior better when the user manually types in an accent, compared to the previous patch.