After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 713179 - Allow user to override stemming language for search tokenizer
Allow user to override stemming language for search tokenizer
Status: RESOLVED FIXED
Product: geary
Classification: Other
Component: engine
unspecified
Other All
: Normal enhancement
: ---
Assigned To: Geary Maintainers
Geary Maintainers
Depends on:
Blocks:
 
 
Reported: 2013-05-24 11:05 UTC by Charles Lindsay
Modified: 2020-09-13 14:28 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Charles Lindsay 2013-11-21 20:18:59 UTC


---- Reported by chaz@yorba.org 2013-05-24 16:05:00 -0700 ----

Original Redmine bug id: 6999
Original URL: http://redmine.yorba.org/issues/6999
Searchable id: yorba-bug-6999
Original author: Charles Lindsay
Original description:

We detect the user's preferred languages based on locale, and use that to
select an appropriate stemming algorithm (i.e. language for stemming) for the
search tokenizer. However, we can't be right 100% of the time, and the user
should have an option to tell us what stemming language they want to use.

Unfortunately, changing the stemming algorithm means rebuilding the entire
search table, so we'll want to have a solid UI in place for search table
upgrades, and we should maybe warn the user that it might take some time.

Related issues:
related to geary - 6956: choose correct search tokenizer stemmer for
locale (Fixed)
related to geary - 7504: Highlighting problem with stemming (Fixed)



---- Additional Comments From geary-maint@gnome.bugs 2013-09-17 12:02:00 -0700 ----

### History

####

#1

Updated by Adam Dingle 2 months ago

Personally I find the stemming annoying. At the moment if I search for
"eventful" I see matches for "event". If I wanted that, I would have typed it.
If we do keep the stemming, I'd like some way to disable it, e.g. surrounding
a term in quotes to search for it exactly.

####

#2

Updated by Jim Nelson 2 months ago

I find the stemmer to be too aggressive as well, Adam. However, the stemmer
cannot be turned on or off depending on the search query due to SQLite's
construction. The stemmer is used to stem both the indexed text as well as the
query text in order for both to match. In other words, in the index:

'event' -> 'event'

'eventful' -> 'event'

If we use the stemmer for indexing but turn it off for a query, "eventful"
will match nothing.

This is the trade-off we discussed internally when designing search: to stem
or not to stem. On one hand, it seems too aggressive when "eventful" yields
hit for every use of the word "event". On the other hand, it's nice when
things like tense and plurals don't constrict search results.

Gmail appears to use a less aggressive stemmer than we have -- "considers"
will find "consider" but not "considered", and "considered" will not find
"consider". Thunderbird's stemmer seems to be remarkably like ours;
"considers" matches "consider", "considered", and "considering".

It might be worthwhile searching for a more refined stemming algorithm, but
I'm not sure I'm ready to rip out what we have quite yet.



--- Bug imported by chaz@yorba.org 2013-11-21 20:19 UTC  ---

This bug was previously known as _bug_ 6999 at http://redmine.yorba.org/show_bug.cgi?id=6999

Unknown version " in product geary. 
   Setting version to "!unspecified".
Unknown milestone "unknown in product geary. 
   Setting to default milestone for this product, "---".
Setting qa contact to the default for this product.
   This bug either had no qa contact or an invalid one.
Resolution set on an open status.
   Dropping resolution 

Comment 1 Adam Dingle 2013-12-12 21:53:36 UTC
Here's another example: I just searched for "party" and it matched a message containing the word "particular".  This should definitely not match.

Should we create a separate bug for improving the default English stemming algorithm?  That seems different from this bug's ostensible subject ("allow user to specify stemming algorithm").
Comment 2 Charles Lindsay 2013-12-12 22:10:05 UTC
Are you searching for "party" with the quotes, or party without quotes?
Comment 3 Adam Dingle 2013-12-12 22:11:22 UTC
Sorry - that was unclear.  There were no quotes in my search.  I should have written that I searched for [party] (as we used to at Google to avoid this particular ambiguity).
Comment 4 Charles Lindsay 2013-12-12 22:13:36 UTC
And the follow-up question: if you add quotes, does it still find the email with particular?
Comment 5 Adam Dingle 2013-12-12 22:17:15 UTC
Aha - it does not.  Good to know I can use quotes as a workaround here.
Comment 6 Charles Lindsay 2013-12-12 22:23:25 UTC
Ah, great!  I suspect what's going on is "party" probably stems to an initial substring of what "particular" stems to.  Since we turn unquoted strings into prefix matches, you're matching both.  Using quotes turns off the prefix matching, so you have to match the whole (stemmed) word.

With event/eventful, they apparently stem to the same string, so quotes do no good there.

I think in the future if we discover a bug in the stemming algorithm, it would be a separate issue from letting the user select the stemming algorithm or disable it.

Anyway, glad I could at least offer a workaround in the short term.
Comment 7 Jim Nelson 2013-12-12 22:41:06 UTC
I've ticketed improving the stemmer at bug #720361.
Comment 8 Federico Bruni 2017-12-02 10:39:57 UTC
According to this wiki page:
https://wiki.gnome.org/Apps/Geary/FullTextSearchStrategy

the user is allowed to specify the stemming algorithm via GSettings. But this should be documented in the user manual (at least mention this possibility and link to above wiki page).

Mike, should we set this bug as Documentation bug?
Comment 9 Michael Gratton 2017-12-04 01:44:56 UTC
Re-reading the original description, this seems to mostly cover choosing the stemmer language, and (maybe?) providing a UI for the ensuing database upgrade. So I'm inclined to keep this open since that is different to the broader stemmer algorithm for which we now do have a setting for. I've updated the subject to better reflect this.

I'd be pretty happy to make this a requires-restart option tbh, since we probably want the accounts to be closed for the duration anyway. This would allow re-using the existing DB upgrade infrastructure and UI to for it. If nothing else if the setting changes while running, we could pop-up a dialog saying a restart is needed.
Comment 10 Michael Gratton 2020-09-13 14:28:10 UTC
Fix for this landing in https://gitlab.gnome.org/GNOME/geary/-/merge_requests/580