GNOME Bugzilla – Bug 353534
Suggestions in Beagle
Last modified: 2018-07-03 09:52:30 UTC
As I saw that somebody found my suggestions code Lukas, here it is. ;) Background: I did a quick and dirty port of the Spell Checker code found in Lucene SVN for another project, and hooked it up to Beagle. While this is nice and all to have, the way it's currently hooked up to Beagle indexing-wise, isn't exactly a masterpiece. Here are some things that somebody need to fix before we can use it: 1) The N-grams we generate come from the stemmed tokens in the index. This also applies to the word we try to match against, which will produce some weird suggestions if your word gets stemmed. 2) Updating the N-gram index takes a long time. We need to profile the spellchecker and see if there is a problem. 3) One N-gram index is created per "ordinary index". This is so far from optimal it's not even funny, both concerning speed, disk-usage and suggestion relevancy. When we move to having a single index (or multiple shared between the backends) we'll hopefully be able to solve this better. 4) I had to revert to indexing a batch of documents in RAM, and merge it do disk after the whole batch is done - which will cause the index to get optimized every time we get an indexing request. That's it. Go to work, I'm off to school. ;)
Created attachment 71895 [details] [review] Patches to Beagle
Created attachment 71896 [details] Port of SpellChecker from Lucene SVN
First things first. DAMN YAY AWSOME! Issues: 1) Static indexies and this aren't playing well, and the code doesn't create the SpellingIndex directory that is needed unless index is wiped first, we should make this all automagical and transparent. (ie. flush the index for the user) 2) Any chance we can make the suggestions more apparent? They aren't immediately noticeable, and more importantly, we should try to make them 'clickable'. I'll add more once I get everything indexed :)
1) Yeah, there are two options. Either up the main lucene index version which will trigger a total reindex for all backends, or detect when there is no spelling index and regenerate it from disk for the first time. 2) Of course, but I feel we need to address the issues I talked about in the first comment before anything else.
Created attachment 72008 [details] Screenshot for those who haven't seen it
After the 1.9 Lucene.Net merges, will this still work?
Can we just use a FuzzyTermEnum to enumerate all similar terms in the index ? That will be faster and would require no extra index. The one problem I don't quite see how to solve (both FuzzyTermEnum and SpellChecker approach) is how to report the correct word when query is stemmed. I wrote a small FuzzeyTermEnum program, and if I feed it "asembly" it tells me "assembl" is the correct word in the index. I have no clue how to add the 'y' at its end.
Yeayh, doing a fuzzy term enum is probably the best way to go about doing this, considering that building an n-gram index will be prohibitively expensive. What we'd probably have to do for getting the suffixes that are stemmed off is to store the suffix somewhere and reattach it when reporting it to the user, or perhaps use the TextCache to pull a matching word.
Fuzzy term enum is extremely simple (5 lines of code). But I am not sure of the method of "unstemming" the word. The stemmed word could have a totally different suffix than something that fits the best match. The best match could be a name or some abbreviation ... anything. It might not even be a valid word! Pulling from the textcache will not work, its too large to grep for.
You would probably need to modify the analyzer to get the unstemmed version of the word. The potential for invalid words is definitely there, but since both would stem to the same thing anyway, there's no real way for us to tell. It would be interesting to see how often this happens in practice. As for the text cache, it seems reasonable to me. Snippeting works by grepping the text cache; this would be similar except that there would likely be more than one (but smaller than all) document to peruse.
(In reply to comment #10) > As for the text cache, it seems reasonable to me. Snippeting works by grepping > the text cache; this would be similar except that there would likely be more > than one (but smaller than all) document to peruse. Hmm... I was always thinking of worst case behaviour, maybe on average you only need to grep fewer documents. I am more worried about the tens of thousands (there are textcache files for all backends) of files to open and read.
A lousy implementation of this has been commited in r4293. Working on it. :-)
Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735): https://gitlab.gnome.org/Archive/beagle/commits/master "tracker" is an available alternative. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the responsibility for active development again.