GNOME Bugzilla – Bug 668728
Support full text search
Last modified: 2016-01-13 16:17:52 UTC
I have no idea how the search works (is it looking in metadata only?), but it just doesn't work for me. For instance, I type "bank" to find my bank documents. Nothing. If I do the same in tracker-needle, I get useful results.
(In reply to comment #0) > I have no idea how the search works (is it looking in metadata only?), but it > just doesn't work for me. > > For instance, I type "bank" to find my bank documents. Nothing. > > If I do the same in tracker-needle, I get useful results. Search looks in title and author; it might be that tracker-needle also does full text search inside the doc contents, which we don't support yet.
Tracker-needle does indeed also search in the full text of the files. Is searching full text in gnome-documents in scope for 3.6? As gnome-documents is already using tracker, I would imagine extending for fts search should not be a big change.
Bump for this bug. I see that it is still not resolved in 3.7.91 even though it should be just adding "fts:match" to the SPARQL query. Why is this bug still unresolved? Is there some patch to achieve it? By the way, the functionality of "Tracker Search" extension [1] is more accurate to tracker-needle and faster than searching with the gnome-documents search provider. I do not see a clear reason for this. [1] https://extensions.gnome.org/extension/284/tracker-search/
Pedro: Because nobody has written the code yet. Will you? Patches are welcome.
I tried but the problem seems deeper and my potential fix will only satisfy some users and no developer. The problem with "fts:match" is that it does not perform substring match, just prefix match. This is not a big problem for typical users but, even for them, there are some use-cases which are not covered. However, I could not figure out why the current approach does not find files with some words and tracker-needle (or Tracker Search extension) does. If I find more time to spend on this I could try other alternatives.
Created attachment 240908 [details] [review] Support FTS patch
(In reply to comment #6) > Created an attachment (id=240908) [details] [review] > Support FTS patch The attached patch implements the fts feature on gnome-documents. As discussed on IRC with Cosimo Cecchi, even prefix searching is better than no full text searching at all. Please, note that the patch depends on the resolution of the tracker bug https://bugzilla.gnome.org/show_bug.cgi?id=697517
Review of attachment 240908 [details] [review]: Rafael, thanks for the patch. Other than the comment below, doesn't this patch also unconditionally apply the FTS matching, causing potentially less results to be displayed? For example: I have a file titled "Cosimo's Document"; regardless of its contents, right now this will show up if I search for "simo", but AFAICS it won't after your patch, as the fts:match inside the WHERE clause will discard all results for which "simo" is not a word-prefix match for any of the FTS-indexed properties of the document. Some possible solutions: - query both with FTS and non-FTS in an UNION, and rely on DISTINCT to remove duplicate results for us - decide to drop non word-prefix matching completely, i.e. make the above the expected behavior. I think this option is less worse than it sounds - matching inside words is not an usual search engine behavior What do you think? ::: src/query.js @@ +100,3 @@ + // fts:match needs at least 3 characters to work properly + if (terms.length > 0) + part += ' ?urn fts:match \"%s\" . '.format(terms); I think this would need to go inside the if ((flags & QueryFlags.UNFILTERED) == 0) condition in any case.
(In reply to comment #8) > Review of attachment 240908 [details] [review]: > > Rafael, thanks for the patch. > > Other than the comment below, doesn't this patch also unconditionally apply the > FTS matching, causing potentially less results to be displayed? > For example: I have a file titled "Cosimo's Document"; regardless of its > contents, right now this will show up if I search for "simo", but AFAICS it > won't after your patch, as the fts:match inside the WHERE clause will discard > all results for which "simo" is not a word-prefix match for any of the > FTS-indexed properties of the document. Exactly, suffix matching stops working. > Some possible solutions: > - query both with FTS and non-FTS in an UNION, and rely on DISTINCT to remove > duplicate results for us > - decide to drop non word-prefix matching completely, i.e. make the above the > expected behavior. I think this option is less worse than it sounds - matching > inside words is not an usual search engine behavior > > What do you think? The second approach makes more sense to me. For example, if I want to find a document which contains the word "football" I won't enter "ball", I'll actually try "football". Anyway, I'll send an updated patch when you decide what's the best approach in this case. > ::: src/query.js > @@ +100,3 @@ > + // fts:match needs at least 3 characters to work properly > + if (terms.length > 0) > + part += ' ?urn fts:match \"%s\" . '.format(terms); > > I think this would need to go inside the if ((flags & QueryFlags.UNFILTERED) == > 0) condition in any case. I thought UNFILTERED was the match All in all documents, but it doesn't seem to be the case. So you are right, it should go inside the if.
Created attachment 241706 [details] [review] Support FTS patch v2 v2: Fix the issue appointed about the fts:match going under the if
*** Bug 672690 has been marked as a duplicate of this bug. ***
Review of attachment 241706 [details] [review]: So, I tried this patch (with the wildcard modification suggested below) and I didn't really like the result: - Searching seems to be quite a bit slower. Right now we're assembling results with an UNION of all the WHERE clauses for each search type, and with this patch we do an fts:match in each of those (which is I guess the reason it is slower); it might be possible to rearrange the query so that only one fts:match is applied globally. - We lose results that match the author string. For example, I have a lot of Google documents that show "cosimo.cecchi" in the author field. Searching for "cosimo" or "cecchi" won't display those anymore after your patch (not even if I explicitly select the Author match from the filter dropdown). I believe this is because, despite nco:fullname being an FTS-indexed property, it's a property of the nco:Contact and not of one of the resource types we're querying for. This is not a very easy one to fix: we could change how matching works for authors in general, and require that the Author filter is set in order for those results to be shown (and then you would disable FTS from the query when those conditions are met), but I am not a big fan of that. Another approach is to first do a FTS query for all the authors that match the string, and add in the query all documents for which the author is part of that set. - We also get a lot of extra results by default for which the connection with the search string isn't immediately obvious. I think those should be somehow ranked lower, or even under a separate header, if they only match by content. So overall a different approach that I'm kinda leaning towards instead is to make content search an explicit opt-in when you query, and not enabled by default. What do you think? ::: src/query.js @@ +102,3 @@ + // fts:match needs at least 3 characters to work properly + if (terms.length > 0) + part += ' ?urn fts:match \"%s\" . '.format(terms); This should be "%s*" instead, or you should append a wildcard to each of the search terms, as Owen did in the patch at https://bugzilla.gnome.org/show_bug.cgi?id=672690#c3
(In reply to comment #12) > Review of attachment 241706 [details] [review]: > > So, I tried this patch (with the wildcard modification suggested below) and I > didn't really like the result: > - Searching seems to be quite a bit slower. Right now we're assembling results > with an UNION of all the WHERE clauses for each search type, and with this > patch we do an fts:match in each of those (which is I guess the reason it is > slower); it might be possible to rearrange the query so that only one fts:match > is applied globally. It was not the only reason: results were not being filtered because I was returning '(true)' as the filter for content search. The issue is resolved in the new attached patch (v3). > - We also get a lot of extra results by default for which the connection with > the search string isn't immediately obvious. I think those should be somehow > ranked lower, or even under a separate header, if they only match by content. I think it can be solved by ordering the results by fts:rank too. It was added in the new patch. > So overall a different approach that I'm kinda leaning towards instead is to > make content search an explicit opt-in when you query, and not enabled by > default. What do you think? Hmm, although it is simple to implement (and I did it in the new patch), I'm not a big fan of this approach. It seems bad in the usability viewpoint, because when searching for All you are actually not searching for all the fields. I think some users might find the search inconsistent. But at least there is a way now to search inside the files. > ::: src/query.js > @@ +102,3 @@ > + // fts:match needs at least 3 characters to work properly > + if (terms.length > 0) > + part += ' ?urn fts:match \"%s\" . '.format(terms); > > This should be "%s*" instead, or you should append a wildcard to each of the > search terms, as Owen did in the patch at > https://bugzilla.gnome.org/show_bug.cgi?id=672690#c3 You're right. Fixed.
Created attachment 242096 [details] [review] Support FTS patch v3 In this new patch we only search by content when explicitly selected by the user. Otherwise, the search keeps its current behaviour. Another improvements are to ignore match filters when doing FTS and ordering the results by fts:rank besides modification time.
Any problem with Rafael latest patch (apart from the fact it might not apply anymore ? ;)
(In reply to Frederic Crozat from comment #15) > Any problem with Rafael latest patch (apart from the fact it might not apply > anymore ? ;) Thanks for pinging. I will take a look unless Cosimo beats me to it.
Review of attachment 242096 [details] [review]: The rest of the code looks fine to me, but would need testing. ::: src/query.js @@ +98,3 @@ + if (this._context.searchMatchManager.activeItemIsContent()) { + let fts = this._context.searchMatchManager.getFts(); + whereSparql += '?urn fts:match \"%s\" '.format(fts); Wouldn't that break if I had quotes in my search terms.
I'll try rebasing and testing the patch again.
Created attachment 318249 [details] [review] Support full text search. Here's an attempt to rebase the patch on top of master. I haven't tested it yet.
Created attachment 318254 [details] [review] Support full text search I introduced a syntax error while rebasing.
Created attachment 318333 [details] [review] Support full text search
Attached is an improved patch. In it I fixed the search behaviour so that it's consistent: when searching for All, fts results are also included. Searching for a specific matching criteria (author, title, content) also works as expected. About quotes in the search, the getTerms method from SearchController handles the escaping of them, so they are just ignored. A future improvement to fts in gnome-documents is to allow searching for sentences. Tracker supports sentence searching if the terms are enclosed in escaped quotes [1]. For that feature to be added to gnome-documents, the terms parsing needs to be changed so that we can identify which terms are part of a sentence and properly escape them in the tracker query. [1] https://wiki.gnome.org/Projects/Tracker/Documentation/Examples/SPARQL/FTS
Review of attachment 318333 [details] [review]: Thanks for the patch, Rafael. I am still trying to fully understand it, but a few quick and minor comments. ::: src/query.js @@ +107,1 @@ + let activeItem = this._context.searchMatchManager.getActiveItem(); Nitpick: How about calling it something like 'activeMatch' instead? Because every manager class derived from BaseManager calls its members items. @@ +150,3 @@ + whereParts.push(part); + })); + } So, this bit is the same as the previous loop, except we attach the ftsWhere to the other WHERE clauses, right? It might be nicer to refactor this into a separate function. We can pass '' as the ftsWhere to achieve the previous behaviour.
Review of attachment 318333 [details] [review]: ::: src/query.js @@ +134,3 @@ + // Note that the filter string needs to be slightly different for the + // fts to work properly + if (ftsWhere.length) { It might be more robust to do: if (ftsWhere.length || activeMatch.id != Search.SearchMatchStock.CONTENT) { ... } Consider the case where we have only selected 'content' from the dropdown but not entered any text. The query won't have any WHERE clause at all. Currently, it is not a big issue because we don't switch to the search view when only a search-match has been selected without typing any text (see src/embed.js). Whether it is a feature or a bug, I don't know. As the person who wrote the "multiple views" code, all I can say is that I forgot to wire _onSearchChanged to SearchMatchManager. :)
(In reply to Debarshi Ray from comment #24) > Review of attachment 318333 [details] [review] [review]: > It might be more robust to do: > if (ftsWhere.length || activeMatch.id != Search.SearchMatchStock.CONTENT) { > ... > } Sorry, I meant: if (ftsWhere.length || activeMatch.id == Search.SearchMatchStock.CONTENT) { ... }
(In reply to Debarshi Ray from comment #23) > Review of attachment 318333 [details] [review] [review]: > > Thanks for the patch, Rafael. I am still trying to fully understand it, but > a few quick and minor comments. > > ::: src/query.js > @@ +107,1 @@ > + let activeItem = this._context.searchMatchManager.getActiveItem(); > > Nitpick: How about calling it something like 'activeMatch' instead? Because > every manager class derived from BaseManager calls its members items. sure, I will change it. > @@ +150,3 @@ > + whereParts.push(part); > + })); > + } > > So, this bit is the same as the previous loop, except we attach the ftsWhere > to the other WHERE clauses, right? It might be nicer to refactor this into a > separate function. We can pass '' as the ftsWhere to achieve the previous > behaviour. It's not exactly the same. Besides the ftsWhere, it also passes 'false' to the _buildFilterString, so the MatchTypes filters are not included (when doing fts, I can't filter by author/title; it has to be in a separate block so that they are all in an UNION). This is the structure of the query: SELECT DISTINCT * WHERE { { searchType1 FILTERS(matchType, sourceType, categType) } UNION { searchchType2 query FILTERS(matchType, sourceType, categType) } ... UNION { searchType1 . fts:match <term> FILTERS(sourceType, categType) } UNION { searchType2 . fts:match <term> FILTERS(sourceType, categType) } ... Note that both the query (there is an appended fts:match) and the FILTERS have to change in order to cover all search scenarios (ALL, CONTENT, AUTHOR, TITLE). If I'm looking only in the content, I have to skip the non-fts query parts. If not searching for All neither Content, I have to skip the fts query parts. That's why I left two forEach loops in there. It looked cleaner to my eyes.
(In reply to Debarshi Ray from comment #25) > (In reply to Debarshi Ray from comment #24) > > Review of attachment 318333 [details] [review] [review] [review]: > > It might be more robust to do: > > if (ftsWhere.length || activeMatch.id != Search.SearchMatchStock.CONTENT) { > > ... > > } > > Sorry, I meant: > if (ftsWhere.length || activeMatch.id == Search.SearchMatchStock.CONTENT) { > ... > } Ok. I'll add that. I have to also add the condition for SearchMatchStock.ALL or there will be no fts when All is selected.
(In reply to Rafael Fonseca from comment #27) > (In reply to Debarshi Ray from comment #25) > > (In reply to Debarshi Ray from comment #24) > > > Review of attachment 318333 [details] [review] [review] [review] [review]: > > > It might be more robust to do: > > > if (ftsWhere.length || activeMatch.id != Search.SearchMatchStock.CONTENT) { > > > ... > > > } > > > > Sorry, I meant: > > if (ftsWhere.length || activeMatch.id == Search.SearchMatchStock.CONTENT) { > > ... > > } > > Ok. I'll add that. I have to also add the condition for SearchMatchStock.ALL > or there will be no fts when All is selected. ALL already works fine, I think. :)
Anyway, I am quite happy with the direction of this patch. I can see Cosimo's point (in comment 12) about search being slow because we are attaching fts:match to every search-type, but I didn't really feel any slowness. One (micro?) optimization that I can think of is to skip COLLECTIONS. nie:title is the only useful property that they have, and even if it is marked with tracker:fulltextIndexed, I doubt anyone would expect CONTENT searching to hit those. Cosimo?
Created attachment 318371 [details] [review] Support full text search Updated patch addressing raised points.
Review of attachment 318371 [details] [review]: Thanks, Rafael. Looks good to me. I pushed it to master after adding the bug URL to the commit message and making the following tiny changes. ::: src/query.js @@ +157,3 @@ let step = Search.OFFSET_STEP; + tailSparql += 'ORDER BY DESC(fts:rank(?urn)) '; // fts relevance ordering This comment seems redundant. I think it is obvious what we are doing. ::: src/search.js @@ +366,3 @@ + return '?urn fts:match \'%s\' . '.format(ftsterms.join(' ')); + + }, The newline got mixed up. :)
Created attachment 318977 [details] [review] query, search: Add full-text search I hope I didn't mess anything up because I had a strange feeling of déjà vu. I did double check it.
My apologies that it took 3 years to land this. Thank you for all your hard work and patience!
(In reply to Debarshi Ray from comment #33) > My apologies that it took 3 years to land this. Thank you for all your hard > work and patience! No worries. Let me know in case of any problems.