GNOME Bugzilla – Bug 713179
Allow user to override stemming language for search tokenizer
Last modified: 2020-09-13 14:28:10 UTC
---- Reported by chaz@yorba.org 2013-05-24 16:05:00 -0700 ---- Original Redmine bug id: 6999 Original URL: http://redmine.yorba.org/issues/6999 Searchable id: yorba-bug-6999 Original author: Charles Lindsay Original description: We detect the user's preferred languages based on locale, and use that to select an appropriate stemming algorithm (i.e. language for stemming) for the search tokenizer. However, we can't be right 100% of the time, and the user should have an option to tell us what stemming language they want to use. Unfortunately, changing the stemming algorithm means rebuilding the entire search table, so we'll want to have a solid UI in place for search table upgrades, and we should maybe warn the user that it might take some time. Related issues: related to geary - 6956: choose correct search tokenizer stemmer for locale (Fixed) related to geary - 7504: Highlighting problem with stemming (Fixed) ---- Additional Comments From geary-maint@gnome.bugs 2013-09-17 12:02:00 -0700 ---- ### History #### #1 Updated by Adam Dingle 2 months ago Personally I find the stemming annoying. At the moment if I search for "eventful" I see matches for "event". If I wanted that, I would have typed it. If we do keep the stemming, I'd like some way to disable it, e.g. surrounding a term in quotes to search for it exactly. #### #2 Updated by Jim Nelson 2 months ago I find the stemmer to be too aggressive as well, Adam. However, the stemmer cannot be turned on or off depending on the search query due to SQLite's construction. The stemmer is used to stem both the indexed text as well as the query text in order for both to match. In other words, in the index: 'event' -> 'event' 'eventful' -> 'event' If we use the stemmer for indexing but turn it off for a query, "eventful" will match nothing. This is the trade-off we discussed internally when designing search: to stem or not to stem. On one hand, it seems too aggressive when "eventful" yields hit for every use of the word "event". On the other hand, it's nice when things like tense and plurals don't constrict search results. Gmail appears to use a less aggressive stemmer than we have -- "considers" will find "consider" but not "considered", and "considered" will not find "consider". Thunderbird's stemmer seems to be remarkably like ours; "considers" matches "consider", "considered", and "considering". It might be worthwhile searching for a more refined stemming algorithm, but I'm not sure I'm ready to rip out what we have quite yet. --- Bug imported by chaz@yorba.org 2013-11-21 20:19 UTC --- This bug was previously known as _bug_ 6999 at http://redmine.yorba.org/show_bug.cgi?id=6999 Unknown version " in product geary. Setting version to "!unspecified". Unknown milestone "unknown in product geary. Setting to default milestone for this product, "---". Setting qa contact to the default for this product. This bug either had no qa contact or an invalid one. Resolution set on an open status. Dropping resolution
Here's another example: I just searched for "party" and it matched a message containing the word "particular". This should definitely not match. Should we create a separate bug for improving the default English stemming algorithm? That seems different from this bug's ostensible subject ("allow user to specify stemming algorithm").
Are you searching for "party" with the quotes, or party without quotes?
Sorry - that was unclear. There were no quotes in my search. I should have written that I searched for [party] (as we used to at Google to avoid this particular ambiguity).
And the follow-up question: if you add quotes, does it still find the email with particular?
Aha - it does not. Good to know I can use quotes as a workaround here.
Ah, great! I suspect what's going on is "party" probably stems to an initial substring of what "particular" stems to. Since we turn unquoted strings into prefix matches, you're matching both. Using quotes turns off the prefix matching, so you have to match the whole (stemmed) word. With event/eventful, they apparently stem to the same string, so quotes do no good there. I think in the future if we discover a bug in the stemming algorithm, it would be a separate issue from letting the user select the stemming algorithm or disable it. Anyway, glad I could at least offer a workaround in the short term.
I've ticketed improving the stemmer at bug #720361.
According to this wiki page: https://wiki.gnome.org/Apps/Geary/FullTextSearchStrategy the user is allowed to specify the stemming algorithm via GSettings. But this should be documented in the user manual (at least mention this possibility and link to above wiki page). Mike, should we set this bug as Documentation bug?
Re-reading the original description, this seems to mostly cover choosing the stemmer language, and (maybe?) providing a UI for the ensuing database upgrade. So I'm inclined to keep this open since that is different to the broader stemmer algorithm for which we now do have a setting for. I've updated the subject to better reflect this. I'd be pretty happy to make this a requires-restart option tbh, since we probably want the accounts to be closed for the duration anyway. This would allow re-using the existing DB upgrade infrastructure and UI to for it. If nothing else if the setting changes while running, we could pop-up a dialog saying a restart is needed.
Fix for this landing in https://gitlab.gnome.org/GNOME/geary/-/merge_requests/580