GNOME Bugzilla – Bug 720361
Improved search experience
Last modified: 2014-12-17 19:55:21 UTC
As discussed in bug #713179, the stemming algorithm Geary uses for search seems too aggressive: party -> particular eventful -> event (including "EventMask") I'm certain we could come up with other examples, but the broader question here is how to make the search algorithm more intuitive without sacrificing stemming and its benefits. As pointed out in the other ticket, this can be worked around to a degree by quoting the term, but we shouldn't expect the user to know this when searching. Some possibilities: (1) Improve the stemmer algorithm. A big task, and of course, there's multiple stemmers for various languages. Even if we limited this job to English, it's not a trivial task, and worse, if done wrong it could cause issue elsewhere. (2) Search for an improved algorithm. If it's not been ported to an SQLite module, we'd presumably have to do that as well. (3) Tweak our search parameters. Right now Geary uses prefix matching for all terms. It may be that we can tweak how we're searching, or score results to give preference to exact matches over near matches, or some other heuristic to dampen the aggressiveness of the current stemmer.
I'm leaving this on the table for 0.6, although there's a lot of discussion and investigation to be done before we can consider implementation.
This ticket is conflating a number of issues. Let me try to clear up some confusing points here: > party -> particular This has nothing to do with the stemmer. The problem here is our aggressive use of prefix matching. Party stems to "parti" (that's what http://www.ittc.ku.edu/~bluo/eecs767sp10/stemmer.php says, although we're using the updated porter stemmer, not the original). Particular stems to itself (at least in the original algorithm). The only reason it matches is because we turn unquoted words into prefix matches, which means it looks for "parti*" and finds particular. Only (3) would help this case. I suppose you could make a case for making "party" not stem to "parti", but that's by the stemmer's design and you'd break other cases if you changed it: "parties" and "partying" also stem to "parti", whereas "parting" stems to just "part", for example. > eventful -> event (including "EventMask") "Eventful" and "event" both stem to "event". This is a case where (1) or (2) might help. "EventMask" stems to "EventMask". Again, it's only matching because we turn unquoted words into prefix matches, and thus (3) is the only thing that would help here. If the idea is to improve the general search experience, we should make the title reflect that, not call out the stemming algorithm in particular, which doesn't seem to be the biggest issue here. I suspect that we've only scratched the surface of some corner cases we'd like to fix. Without a vast amount of resources (real-world email corpus, real-world searches, a correlation of expected results for various search inputs, etc.) I don't think we'll be able to get very far, and my suspicion is that for every "fix" we implement, we may lessen the search experience in other ways. I'm not saying the search is perfect by any means, but I am saying I don't think we have the resources to attack most of the remaining issues in any meaningful way.
I've updated the title to better reflect my intent for filing this ticket. Regarding party/particular, I would say that has *something* to do with the stemmer since, if the stemmer was not in the mix, "particular" wouldn't match at all. But do we fix the stemmer, fix how we're using it, or (somehow) correct the results we're getting from it? That's what I'm trying to suss out here. Are we satisfied with the current results? If not, what could we do to improve them? Regarding EventMask, it was something I noticed when I attempted to reproduce the problem for this ticket and thought it was worth noting. I included the three bullet points above not to suggest they stand on equal footing as lines of attack, but to flesh out the problem as I understand it. Ultimately I believe we need to improve the search experience. I lean toward #3 at this moment. I do not believe #1 is something to undertake lightly, and #2 I doubt exists unless it's truly cutting-edge work, and then I would question adopting it.
> if the stemmer was not in the mix, "particular" wouldn't match at all It's worth noting that it does match as you're typing party until you hit the y, stemmer or no. > But do we fix the stemmer My point is that the only case for the stemmer returning unexpected results I've seen is the event/eventful situation. In other words, I don't think there's a compelling case to be made that the stemmer itself needs fixing. > or (somehow) correct the results we're getting from it? Do you mean the results as in what the stemmer returns for a given input, or what we're actually pulling out of SQLite? Again, I want to make sure we're talking about the same thing. In this case, I suspect you don't actually mean the stemmer, but the black box that is the current SQLite search table. > I included the three bullet points above not to suggest they stand on equal > footing as lines of attack, but to flesh out the problem as I understand it. My point was that for any number of individual examples we can enumerate, I suspect any fix we attempted would break other cases, unless we have the data and test cases to be sure it doesn't, in practice. I don't see anything indicating that the way we're doing search is fundamentally wrong. In other words, I think given our current resources, putting up with a small handful of unexpected matches is far preferable to special-casing ourselves to death.
I've pushed a new search algorithm that attempts to curb overstemming via a configurable heuristic. If the user wants to control how search works, they can change this heuristic in GSettings. (There's no plans to make this available via the UI.) More information can be found at: https://wiki.gnome.org/Apps/Geary/FullTextSearchStrategy Pushed to master, commit 533ab75
Jim, thanks so much - this is great! I appreciate both the flexible options for users and the detailed documentation. I'll experiment with various settings - I'll almost certainly use EXACT or CONSERVATIVE, but am not sure which.
The nice thing about this approach is we have some flexibility with tweaking parameters. If you have suggestions, I'm all ears.