Bug 776395 – Indexed files missing from search results

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 776395 - Indexed files missing from search results


Summary:	Indexed files missing from search results


Status:	RESOLVED OBSOLETE

Product:	tracker
Classification:	Core
Component:	Search Tool
Version:	unspecified
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	---
Assigned To:	tracker-search-tool
QA Contact:	tracker-search-tool

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-12-22 14:06 UTC by jc
Modified:	2021-05-26 22:24 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Screenshot of failed search against an indexed file (122.86 KB, image/png) 2016-12-22 14:07 UTC, jc		Details
HACK: print tokenizing info (1.37 KB, patch) 2017-01-07 21:57 UTC, Carlos Garnacho	none	Details \| Review
Tokens (854.84 KB, application/x-zip-compressed) 2017-01-08 14:58 UTC, jc		Details

Description jc 2016-12-22 14:06:29 UTC

For some reason, some files are correctly indexed but not shown in the result.

See the screenshot enclosed :

- the settings show indexed paths.
- I search a term, with a limit of 100 files,
- The expected file (and around ten others) does not show in the results
- Checking the indexation manually on this file shows that the indexation worked as expected.

So there seem to be a problem in the output given by tracker search (and needle).

(Note that I am using Recoll as a fallback solution, and no problem with searching the same file there).

Comment 1 jc 2016-12-22 14:07:23 UTC

Created attachment 342388 [details]
Screenshot of failed search against an indexed file

Comment 2 jc 2016-12-22 16:01:44 UTC

I checked with some random other terms taken from such documents and made a search.

As a result, such files are entirely invisible to full-text search.

Comment 3 Carlos Garnacho 2016-12-22 19:43:20 UTC

I'm curious about what does the full-text search table know about these files. To kind of know that info, we need first to get its URN, doing:

$ tracker info /path/to/file | head -n 2

should give you the urn:uuid:$UUID string that's an unique identifier for the given file. Next is checking the FTS table directly, doing:

$ tracker sql -q "select * from fts5 where rowid = (select ID from Resource where uri='urn:uuid:...')"

That should give you the contents of the FTS table for that particular file. Between the many columns, for documents I'd normally expect there to be the filename, and document text content and title.

FWIW, I think this was a one time thing that has endured in the database, most probably, doing:

$ tracker reset -f /path/to/file

Or rebuilding the FTS tokenization data from scratch:

$ tracker sql -q "INSERT INTO fts5(fts5) VALUES ('rebuild')"

Will get the issue fixed, would be nice to do some forensics though.

Comment 4 jc 2016-12-22 21:38:11 UTC

Yes, there are many columns including filename, title and text:

XMCO-ActuSecu-42-Securite_Imprimantes.pdf | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | actu
sécu 42
l’ACTUSÉCU est un magazine numérique rédigé et édité par les consultants du cabinet de conseil XMCO
NOVEMBRE 2015
[ ...rest of content ...] | (null) | (null) | (null) | (null) | (null)

None of the two commands fixed the issue...

Comment 5 Carlos Garnacho 2016-12-23 00:12:46 UTC

Does it also happen with the newly created file if you copy the document? In that case I'd appreciate if you sent an specimen to my email address... I can dispose of the file after checking the problem if you wish.

Comment 6 jc 2016-12-23 07:09:32 UTC

So, the same issue happens if I copy the file to test.pdf, for instance.

No problem with sending you the file as it is non confidential, public information.

Comment 7 jc 2017-01-07 09:37:57 UTC

Did you receive my PDF?
Even if you cannot fix it, I am interested to know whether it got indexed in your environment or not.

Comment 8 Carlos Garnacho 2017-01-07 15:52:35 UTC

Sorry... I did receive it, and it indeed seems correctly indexed and searchable here (modulo the d' contractions). I think this can only be down to:

1) tokenizer backend issues: Tracker has libicu and libunistring backends, libicu is by far more popular across distros, and I just came to think it's the one I've tested, maybe yours uses libunistring by default?

2) Other locale dependent issues: things like unaccenting, word stemming and stop words are locale dependent to some extent. Do you use fr_FR@UTF-8 or anything more exotic?

I will recompile with libunistring and double check things though.

Comment 9 jc 2017-01-07 16:10:04 UTC

Hey, no problem at all, thank you. Now I know that I have to try harder.

1) It is libicu, I verified it with strace :

%  strace tracker search -l 100 --disable-snippets "hacking+team" 2>&1 | grep -i libicu     
open("/usr/lib64/tracker-1.0/libicui18n.so.57", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libicui18n.so.57", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib64/tracker-1.0/libicuuc.so.57", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libicuuc.so.57", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib64/tracker-1.0/libicudata.so.57", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib64/libicudata.so.57", O_RDONLY|O_CLOEXEC) = 3
%  strace tracker search -l 100 --disable-snippets "hacking+team" 2>&1 | grep -i libunitring 
% 

2) It is fr.FR.UTF-8:

%  echo $LANG
fr_FR.UTF-8
%  

I will make more tests with different distros and that file and let you know if it works somewhere.

Comment 10 jc 2017-01-07 16:14:51 UTC

Concerning (1), just realized that it is not finding the lib (I need to wake up).

So, testing now with symbolic links from /usr/lib64. I probably need to re-index?

Comment 11 jc 2017-01-07 16:54:07 UTC

After more testing :

- my main machine after reindexing the file,
- on a fresh Fedora 25 machine,
- on a fresh Ubuntu 16.04 system (note that here it seems to be compiled with libunistring)

The result is coherent:

- the file is indexed and you can search simple words inside, for instance "hacking"
- search with several terms fails, like "hacking team".

Isn't there something wrong with the way space is encoded?

Comment 12 Carlos Garnacho 2017-01-07 21:56:32 UTC

That's weird because word breaks are found out through libicu/libunistring specific ways, so I wouldn't expect both to be broken the exact same way. The workings are:

On the indexed content:

* Tracker registers a custom tokenizer for the FTS table:
  https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n391

* When text in the FTS table is processed, the tokenizer is called with the full text in the specific row/column:
  https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n88
  
* The tokenizing function splits by word break (discarding stop words on the way), and calls back into the given FTS function to register each token individually:
  https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n132

* The tracker_parser_next() function used to iterate through the extracted tokens is the one with libicu/libunistring implementations, and looks for word breaks in library specific ways:
  https://git.gnome.org/browse/tracker/tree/src/libtracker-common/tracker-parser-libicu.c#n429
  https://git.gnome.org/browse/tracker/tree/src/libtracker-common/tracker-parser-libunistring.c#n333


On the search terms:

* Exactly the same tokenizing function is used to process the given search terms, so sqlite does in the end compare individual tokens that are split and preprocessed the same way. There's only one exception here: we don't bother to filter stop words from the given search terms:
  https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n105


Sadly, there's barely any debug output in this process, as it happens *a lot*, I'm attaching a hack/patch that might help figure out what does FTS see during this process. Just applying the patch and searching with the compiled "tracker search" CLI command will print info about how is text tokenized in both the search terms and the indexed content.

I do hope this helps pinpoint the problem, in the worst case it will point further down to sqlite/fts5 if everything looks correct here...

Comment 13 Carlos Garnacho 2017-01-07 21:57:07 UTC

Created attachment 343104 [details] [review]
HACK: print tokenizing info

Comment 14 jc 2017-01-08 14:58:09 UTC

Created attachment 343122 [details]
Tokens

Here is the output, although I have no idea on how to interpret this.

Comment 15 L Holland 2017-10-18 17:39:24 UTC

FWIW I think I may be seeing the same or a closely related issue. I have numerous PDFs for which FTS only returns results for some of the text. My results seems similar to JC - indexing appears to have occurred correctly, and running the SQL query provided above returns a row that looks to contain the full text of the PDF in question - and certainly contains words for which 'tracker search' fails to return the document in question.

I'm on Arch linux - well, Antergos, but I doubt the difference is relevant to this issue - using GNOME 3.26 under Wayland with GDM.

I'm in a en.GB locale, so accented characters etc *shouldn't* be a major factor. One thing that I've noticed is that some search terms *do* work for these documents. It's hard to be completely sure, but it appears like it may be that only the first part of the document is properly searchable - search terms that only appear earlier in the file generate results but those from later on do not.

Comment 16 Daniel Nicolai 2019-01-11 12:53:47 UTC

FWIW I also have the issue as reported in comment 15 by L Holland. Tracker search does not return results for many words that are indexed correctly. I also straced the command as in comment 9 which returned no output for libunitring and the following output for libicu

openat(AT_FDCWD, "/usr/lib64/tracker-2.0/libicui18n.so.62", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib64/libicui18n.so.62", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib64/tracker-2.0/libicuuc.so.62", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib64/libicuuc.so.62", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib64/tracker-2.0/libicudata.so.62", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib64/libicudata.so.62", O_RDONLY|O_CLOEXEC) = 3


I created symbolic links from the files in the /usr/lib64/tracker-2.0/ directory to the corresponding files in the /usr/lib64/ directory and then reindexed the file but it still does not work.

Indeed it seems as if tracker only finds words at the beginning of the in the nie:plainTextContent

I am on Fedora 29 with gnome 3.30.2 and locale.conf says LANG="en_US.UTF-8"

Comment 17 Sam Thursfield 2021-05-26 22:24:33 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new enhancement request ticket at
  https://gitlab.gnome.org/GNOME/tracker/-/issues/

Thank you for your understanding and your help.