GNOME Bugzilla – Bug 341797
Ignore stop words when generating search results
Last modified: 2006-07-24 11:52:08 UTC
Yelp 2.14.1, Ubuntu Dapper To make searches both faster and more relevant, each locale should have a list of stop words, which are ignored when choosing search results (though they should still be highlighted wherever they appear in the results that are returned). The list of stop words should be maintained by translators for each locale. The ideal stop words for a help viewer will be a bit different from those for a Web search engine, because of the different types of things people ask for in each. For starters, in English locales, Yelp probably should ignore the words "a about an are as at be broke broken by can can't dialog dialogue do doesn't/doesnt don't/dont explain for from has have help how i in is it item me my of on or tell that the thing this to what where who will with won't/wont why work working works", and the suffix "'s". For example: "What's the Connect to Server menu item for?" finds {connect, server, menu} "How do I customize the clock?" finds {customize, clock} "All my windows have disappeared" finds {all, windows, disappeared} "This thing is broken" returns the generic "No results found" message. (Implementation note: In the future, when keywords are implemented for help pages, these should be included for searching even if they're stop words. For example, searching iTunes Help for "what is this thing?" returns the page "Overview: Learn what iTunes is and what you can do with it" and no other.)
Moving to search component
Created attachment 66937 [details] [review] Improved search A work-in-progress patch to once again try improving searching. List of things it does: 1. Re-enables searching of the GNOME user guide (don't ask) 2. A list of "stop words" that arn't used in the search 3. Only return results with all the matching words for a query (e.g. for "CD burning", only returns results with CD and burning in the text somewhere. Currently, results with either CD or burning or both are returned) 4. Attempt to find the best snippet associated with the query 5. Strip any '?'s and "'s" from words in the search terms This isn't really usable yet, but to give an idea: Searching for "cd burning" will now return 2 results (on Ubuntu Dapper): The desktop guide and the ubuntu desktop guide (both pointing to the section on how to burn CDs and DVDs). Not sure this is exactly what is wanted, but its an improvement I think. searching for "How do I burn CDs?" should return the same 2 results (but for some reason doesn't just now). The stop words are stored as a colon seperated list so they will be translatable. I'll extend the work some more when I get the chance. Some things I'd like are: a list of common suffixes (like "es" or "s" or "ing" in English) that can be removed or checked against during the search (and is translatable), which should improve the results a bit further. I'd also like to clean up the search framework a bit (its still really messy just now). Sorry for the rambling. Just putting the patch here so I don't forget about it and for people to try if they so desire (more feedback always good ;) )
Awesome -- thanks for doing this. That "CD burning" produces such good results now is an excellent sign. On further thought, I think the list should include "get", "gets", "got", "make", "makes", "not", and "when". The list also needs advice for translators. Something like: /* Do not translate this string directly. These are colon-separated words that aren't useful for choosing search results; they will be different for each language. Include pronouns, articles, very common verbs and prepositions, words from question structures like "tell me about" and "how do I", and words for functional states like "not", "work", and "broken". */ In some languages (perhaps even including English), the forms like "-es" and "-ing" won't be reliably programmable, so it may be better to handle them using hidden keywords in individual help pages instead.
(In reply to comment #3) > Awesome -- thanks for doing this. That "CD burning" produces such good results > now is an excellent sign. On further thought, I think the list should include > "get", "gets", "got", "make", "makes", "not", and "when". Cool. I'll add them to the list. I'm quite concerned that yelp with the patch is too strick. It misses a few manuals that might be useful in this case (e.g. the gthumb manual contains a section on burning an image collection to CD). > > The list also needs advice for translators. Something like: > /* > Do not translate this string directly. These are colon-separated words that > aren't useful for choosing search results; they will be different for each > language. Include pronouns, articles, very common verbs and prepositions, > words from question structures like "tell me about" and "how do I", and > words for functional states like "not", "work", and "broken". > */ As I siad, this is still a work in progress. I was going to add a comment, but wanted it to appear in the po file. I'm not sure how to do this (whether comments appear or not), but I'll definitely add a comment or something like that before committing. > > In some languages (perhaps even including English), the forms like "-es" and > "-ing" won't be reliably programmable, so it may be better to handle them using > hidden keywords in individual help pages instead. > Not sure of the best way of doing this properly. My basic thought was to do similar to the stop words. Currently, the search checks to see if the found word has spaces / punctuation around it, if it does, it assumes its a hit, if not then its part of another word and can be ignored. I was going to extend this slightly and instead check for space / punctuation and the defined suffixes, which may result in a few extra hits. I don't really know if its worth it, but I figure people would rather see a few results matching the word "talk" to the term in the text of "talking", and maybe include a few more hits (that might not be as relevant), than miss all the hits for "talking" altogether. I'm also building a list of test queries that should cover a reasonable selection of what people might search for, which is currently quite short. If people have ideas for search terms to include in the list, I'd love to add them, help improve search further.
It's true that searching for "cd burning" currently returns gthumb's relevant section, but as far as I can tell, that's only because the help is out of date. If it referred to "the CD/DVD Creator window", instead of "the burn:/// location with Nautilus", would it show up at all? (I don't know how to test this because I don't know where/how any word index/cache is stored.) (Anyway, keywords are needed for synonyms, not just for manual stemming.) A comment should end up in a PO file if it comes immediately before the source code line that uses the string. (Sorry for not making that explicit.) A really nifty way to collect likely search strings would be to collect real ones. Add code that is off (or perhaps pointed at gnomesupport.org) by default, but that distributors can turn on, to end search results with a "Get more help at {Name of Vendor's Support Site}" link -- with distributors logging the resulting search strings, as well as returning results. This would also make the help a whole lot more useful and convenient, even before any improvements in results! Vendors wouldn't even need to share private data about searches, as resulting improvements should be self-evident when reported as bugs. But all that probably belongs in a separate bug report ...
(In reply to comment #5) > It's true that searching for "cd burning" currently returns gthumb's relevant > section, but as far as I can tell, that's only because the help is out of date. > If it referred to "the CD/DVD Creator window", instead of "the burn:/// > location with Nautilus", would it show up at all? (I don't know how to test > this because I don't know where/how any word index/cache is stored.) > (Anyway, keywords are needed for synonyms, not just for manual stemming.) The searching isn't using any word indexes. It literally searches each document for each search term (that isn't an ignore word). So, even though gthumb's manual is out-of-date, it really should be showing up in the search results. I did play around with the idea of a word list, stored in one of Yelps secret hidden places, but it would need to be refreshed each time scrollkeeper was updated with a new document, or a seperate file for each document. The first would probably take a huge amount of time at startup (and when originally done, there wasn't Brents super-cool caching system, so Yelp didn't know if scrollkeeper was updated or not). The second might be reasonable, but would create lots of files in the yelp hidden directory, not exactly ideal. At the time of implementing, I figured the quickest way would be to parse each document on the fly - it really doesn't take a long time to do and means there are snippets with each result. By far the longest time used during the search is in the info page searching part. Anyway, I'm off on a tangent. I might revisit these ideas when the branch for super-cool new stuff is created. > > A comment should end up in a PO file if it comes immediately before the source > code line that uses the string. (Sorry for not making that explicit.) Cool. I didn't know if it would. I'll add a comment next patch I create. > > A really nifty way to collect likely search strings would be to collect real > ones. Add code that is off (or perhaps pointed at gnomesupport.org) by default, > but that distributors can turn on, to end search results with a "Get more help > at {Name of Vendor's Support Site}" link -- with distributors logging the > resulting search strings, as well as returning results. This would also make > the help a whole lot more useful and convenient, even before any improvements > in results! Vendors wouldn't even need to share private data about searches, as > resulting improvements should be self-evident when reported as bugs. But all > that probably belongs in a separate bug report ... > I was thinking more short-term. A list of phrases or queries that people might search for that I'd manually go through, checking which pages should be returned by the search. I'd then use the list to try and improve the search parameters during this cycle (and possibly build on next cycle).
Reported bug 344843 on the vendor-collected lists of searches. I cannot think of any other source of real help-related searches: Web searches such as those listed by SearchSpy <http://www.dogpile.com/info.dogpl/searchspy/results.htm?filter=1> are rarely tech-related.
Created attachment 68663 [details] [review] Updated patch To prove I'm still working on this ;) The patch is still a prototype. It adds in a list of prefix / suffixes. I know this isn't ideal, but it does produce some decent results (i.e. "how do I burn a CD?" now returns 6 results instead of 2 - Including the GnomeBaker manual!!! as well as the gthumb manual). In this, the sections linked to are screwed and the snippets needs some lovin' (I've been meaning to sort them out better for a while). I suspect the time taken to process a search is starting to increase too much and I'll look at ways of bringing it back down again. Also, there are (on some occasions) seemingly irrelevant results returned, but I have a plan to get rid of them again. As I said, still a prototype. Just to show I'm still working on it. Don't judge it too harshly.
Created attachment 69185 [details] [review] One more time Update patch once more. Approaching final form. A huge number of changes in here. Highlights are (from what I remember): * Fix links from last patch * If the search is entirely ignore words, run them all through * Improve searching in HTML docs (see bug #347819 for details, which I'll make depend on this one) * Rewrite snippet parsing. Catches more keywords * Don't display index terms / keywords in snippets. Instead, rely on the title of the section * Handle prefixs and suffixes better (still isn't an ideal solution, but does allow burn and burning to pick up the same result) * Fix score mechanism. It was broken for a while. Works better now. * Speed up the search a bit by moving some code around. I'm sure there is more.
Patch has been committed. Closing. 2006-07-23 Don Scorgie <dscorgie@cvs.gnome.org> * src/yelp-search-pager.c: Large basic search update: - Common words are ignored (how, do etc.) - Common suffixes and prefixes are checked - Fix problems with searching in some docbook docs - Better scoring mechanism - Slightly improved matching algorithm - Fix searching HTML docs - Better snippet highlighting - Clean sections of code slightly Bugs fixed: #341797, #341800, #347819 (work around) Partially fixed: #335962
Thank you!