Bug 353534 – Suggestions in Beagle

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 353534 - Suggestions in Beagle


Summary:	Suggestions in Beagle


Status:	RESOLVED WONTFIX

Product:	beagle
Classification:	Other
Component:	General
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:	gnome[unmaintained]

Depends on:
Blocks:

Reported:	2006-08-30 12:27 UTC by Fredrik Hedberg
Modified:	2018-07-03 09:52 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patches to Beagle (21.25 KB, patch) 2006-08-30 12:29 UTC, Fredrik Hedberg	none	Details \| Review
Port of SpellChecker from Lucene SVN (6.39 KB, application/x-compressed-tar) 2006-08-30 12:30 UTC, Fredrik Hedberg		Details
Screenshot for those who haven't seen it (28.63 KB, image/png) 2006-09-01 06:54 UTC, Fredrik Hedberg		Details

Description Fredrik Hedberg 2006-08-30 12:27:51 UTC

As I saw that somebody found my suggestions code Lukas, here it is. ;)

Background: I did a quick and dirty port of the Spell Checker code found in Lucene SVN for another project, and hooked it up to Beagle. 

While this is nice and all to have, the way it's currently hooked up to Beagle indexing-wise, isn't exactly a masterpiece. Here are some things that somebody need to fix before we can use it:

1) The N-grams we generate come from the stemmed tokens in the index. This also applies to the word we try to match against, which will produce some weird suggestions if your word gets stemmed.

2) Updating the N-gram index takes a long time. We need to profile the spellchecker and see if there is a problem.

3) One N-gram index is created per "ordinary index". This is so far from optimal it's not even funny, both concerning speed, disk-usage and suggestion relevancy. When we move to having a single index (or multiple shared between the backends) we'll hopefully be able to solve this better.

4) I had to revert to indexing a batch of documents in RAM, and merge it do disk after the whole batch is done - which will cause the index to get optimized every time we get an indexing request.

That's it. Go to work, I'm off to school. ;)

Comment 1 Fredrik Hedberg 2006-08-30 12:29:17 UTC

Created attachment 71895 [details] [review]
Patches to Beagle

Comment 2 Fredrik Hedberg 2006-08-30 12:30:22 UTC

Created attachment 71896 [details]
Port of SpellChecker from Lucene SVN

Comment 3 Kevin Kubasik 2006-08-31 23:49:40 UTC

First things first. DAMN YAY AWSOME!

Issues:

1) Static indexies and this aren't playing well, and the code doesn't create the SpellingIndex directory that is needed unless index is wiped first, we should make this all automagical and transparent. (ie. flush the index for the user)

2) Any chance we can make the suggestions more apparent? They aren't immediately noticeable, and more importantly, we should try to make them 'clickable'. 

I'll add more once I get everything indexed :)

Comment 4 Fredrik Hedberg 2006-09-01 06:53:00 UTC

1) Yeah, there are two options. Either up the main lucene index version which will trigger a total reindex for all backends, or detect when there is no spelling index and regenerate it from disk for the first time.

2) Of course, but I feel we need to address the issues I talked about in the first comment before anything else.

Comment 5 Fredrik Hedberg 2006-09-01 06:54:16 UTC

Created attachment 72008 [details]
Screenshot for those who haven't seen it

Comment 6 Kevin Kubasik 2006-10-17 04:57:53 UTC

After the 1.9 Lucene.Net merges, will this still work?

Comment 7 Debajyoti Bera 2007-04-10 22:23:18 UTC

Can we just use a FuzzyTermEnum to enumerate all similar terms in the index ? That will be faster and would require no extra index.

The one problem I don't quite see how to solve (both FuzzyTermEnum and SpellChecker approach) is how to report the correct word when query is stemmed. I wrote a small FuzzeyTermEnum program, and if I feed it "asembly" it tells me "assembl" is the correct word in the index. I have no clue how to add the 'y' at its end.

Comment 8 Joe Shaw 2007-04-11 14:19:51 UTC

Yeayh, doing a fuzzy term enum is probably the best way to go about doing this, considering that building an n-gram index will be prohibitively expensive.

What we'd probably have to do for getting the suffixes that are stemmed off is to store the suffix somewhere and reattach it when reporting it to the user, or perhaps use the TextCache to pull a matching word.

Comment 9 Debajyoti Bera 2007-04-11 14:44:42 UTC

Fuzzy term enum is extremely simple (5 lines of code). But I am not sure of the method of "unstemming" the word. The stemmed word could have a totally different suffix than something that fits the best match. The best match could be a name or some abbreviation ... anything. It might not even be a valid word!

Pulling from the textcache will not work, its too large to grep for.

Comment 10 Joe Shaw 2007-04-11 14:57:33 UTC

You would probably need to modify the analyzer to get the unstemmed version of the word.

The potential for invalid words is definitely there, but since both would stem to the same thing anyway, there's no real way for us to tell.  It would be interesting to see how often this happens in practice.

As for the text cache, it seems reasonable to me.  Snippeting works by grepping the text cache; this would be similar except that there would likely be more than one (but smaller than all) document to peruse.

Comment 11 Debajyoti Bera 2007-04-11 16:46:33 UTC

(In reply to comment #10)
> As for the text cache, it seems reasonable to me.  Snippeting works by grepping
> the text cache; this would be similar except that there would likely be more
> than one (but smaller than all) document to peruse.

Hmm... I was always thinking of worst case behaviour, maybe on average you only need to grep fewer documents. I am more worried about the tens of thousands (there are textcache files for all backends) of files to open and read.

Comment 12 Lukas Lipka 2007-12-16 11:03:52 UTC

A lousy implementation of this has been commited in r4293. Working on it. :-)

Comment 13 André Klapper 2018-07-03 09:52:30 UTC

Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735):
https://gitlab.gnome.org/Archive/beagle/commits/master

"tracker" is an available alternative.

Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect
reality. Please feel free to reopen this ticket (or rather transfer the project
to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the
responsibility for active development again.