GNOME Bugzilla – Bug 776395
Indexed files missing from search results
Last modified: 2021-05-26 22:24:33 UTC
For some reason, some files are correctly indexed but not shown in the result. See the screenshot enclosed : - the settings show indexed paths. - I search a term, with a limit of 100 files, - The expected file (and around ten others) does not show in the results - Checking the indexation manually on this file shows that the indexation worked as expected. So there seem to be a problem in the output given by tracker search (and needle). (Note that I am using Recoll as a fallback solution, and no problem with searching the same file there).
Created attachment 342388 [details] Screenshot of failed search against an indexed file
I checked with some random other terms taken from such documents and made a search. As a result, such files are entirely invisible to full-text search.
I'm curious about what does the full-text search table know about these files. To kind of know that info, we need first to get its URN, doing: $ tracker info /path/to/file | head -n 2 should give you the urn:uuid:$UUID string that's an unique identifier for the given file. Next is checking the FTS table directly, doing: $ tracker sql -q "select * from fts5 where rowid = (select ID from Resource where uri='urn:uuid:...')" That should give you the contents of the FTS table for that particular file. Between the many columns, for documents I'd normally expect there to be the filename, and document text content and title. FWIW, I think this was a one time thing that has endured in the database, most probably, doing: $ tracker reset -f /path/to/file Or rebuilding the FTS tokenization data from scratch: $ tracker sql -q "INSERT INTO fts5(fts5) VALUES ('rebuild')" Will get the issue fixed, would be nice to do some forensics though.
Yes, there are many columns including filename, title and text: XMCO-ActuSecu-42-Securite_Imprimantes.pdf | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | (null) | actu sécu 42 l’ACTUSÉCU est un magazine numérique rédigé et édité par les consultants du cabinet de conseil XMCO NOVEMBRE 2015 [ ...rest of content ...] | (null) | (null) | (null) | (null) | (null) None of the two commands fixed the issue...
Does it also happen with the newly created file if you copy the document? In that case I'd appreciate if you sent an specimen to my email address... I can dispose of the file after checking the problem if you wish.
So, the same issue happens if I copy the file to test.pdf, for instance. No problem with sending you the file as it is non confidential, public information.
Did you receive my PDF? Even if you cannot fix it, I am interested to know whether it got indexed in your environment or not.
Sorry... I did receive it, and it indeed seems correctly indexed and searchable here (modulo the d' contractions). I think this can only be down to: 1) tokenizer backend issues: Tracker has libicu and libunistring backends, libicu is by far more popular across distros, and I just came to think it's the one I've tested, maybe yours uses libunistring by default? 2) Other locale dependent issues: things like unaccenting, word stemming and stop words are locale dependent to some extent. Do you use fr_FR@UTF-8 or anything more exotic? I will recompile with libunistring and double check things though.
Hey, no problem at all, thank you. Now I know that I have to try harder. 1) It is libicu, I verified it with strace : % strace tracker search -l 100 --disable-snippets "hacking+team" 2>&1 | grep -i libicu open("/usr/lib64/tracker-1.0/libicui18n.so.57", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/lib64/libicui18n.so.57", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib64/tracker-1.0/libicuuc.so.57", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/lib64/libicuuc.so.57", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib64/tracker-1.0/libicudata.so.57", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/lib64/libicudata.so.57", O_RDONLY|O_CLOEXEC) = 3 % strace tracker search -l 100 --disable-snippets "hacking+team" 2>&1 | grep -i libunitring % 2) It is fr.FR.UTF-8: % echo $LANG fr_FR.UTF-8 % I will make more tests with different distros and that file and let you know if it works somewhere.
Concerning (1), just realized that it is not finding the lib (I need to wake up). So, testing now with symbolic links from /usr/lib64. I probably need to re-index?
After more testing : - my main machine after reindexing the file, - on a fresh Fedora 25 machine, - on a fresh Ubuntu 16.04 system (note that here it seems to be compiled with libunistring) The result is coherent: - the file is indexed and you can search simple words inside, for instance "hacking" - search with several terms fails, like "hacking team". Isn't there something wrong with the way space is encoded?
That's weird because word breaks are found out through libicu/libunistring specific ways, so I wouldn't expect both to be broken the exact same way. The workings are: On the indexed content: * Tracker registers a custom tokenizer for the FTS table: https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n391 * When text in the FTS table is processed, the tokenizer is called with the full text in the specific row/column: https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n88 * The tokenizing function splits by word break (discarding stop words on the way), and calls back into the given FTS function to register each token individually: https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n132 * The tracker_parser_next() function used to iterate through the extracted tokens is the one with libicu/libunistring implementations, and looks for word breaks in library specific ways: https://git.gnome.org/browse/tracker/tree/src/libtracker-common/tracker-parser-libicu.c#n429 https://git.gnome.org/browse/tracker/tree/src/libtracker-common/tracker-parser-libunistring.c#n333 On the search terms: * Exactly the same tokenizing function is used to process the given search terms, so sqlite does in the end compare individual tokens that are split and preprocessed the same way. There's only one exception here: we don't bother to filter stop words from the given search terms: https://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-fts-tokenizer.c#n105 Sadly, there's barely any debug output in this process, as it happens *a lot*, I'm attaching a hack/patch that might help figure out what does FTS see during this process. Just applying the patch and searching with the compiled "tracker search" CLI command will print info about how is text tokenized in both the search terms and the indexed content. I do hope this helps pinpoint the problem, in the worst case it will point further down to sqlite/fts5 if everything looks correct here...
Created attachment 343104 [details] [review] HACK: print tokenizing info
Created attachment 343122 [details] Tokens Here is the output, although I have no idea on how to interpret this.
FWIW I think I may be seeing the same or a closely related issue. I have numerous PDFs for which FTS only returns results for some of the text. My results seems similar to JC - indexing appears to have occurred correctly, and running the SQL query provided above returns a row that looks to contain the full text of the PDF in question - and certainly contains words for which 'tracker search' fails to return the document in question. I'm on Arch linux - well, Antergos, but I doubt the difference is relevant to this issue - using GNOME 3.26 under Wayland with GDM. I'm in a en.GB locale, so accented characters etc *shouldn't* be a major factor. One thing that I've noticed is that some search terms *do* work for these documents. It's hard to be completely sure, but it appears like it may be that only the first part of the document is properly searchable - search terms that only appear earlier in the file generate results but those from later on do not.
FWIW I also have the issue as reported in comment 15 by L Holland. Tracker search does not return results for many words that are indexed correctly. I also straced the command as in comment 9 which returned no output for libunitring and the following output for libicu openat(AT_FDCWD, "/usr/lib64/tracker-2.0/libicui18n.so.62", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib64/libicui18n.so.62", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/lib64/tracker-2.0/libicuuc.so.62", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib64/libicuuc.so.62", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/lib64/tracker-2.0/libicudata.so.62", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib64/libicudata.so.62", O_RDONLY|O_CLOEXEC) = 3 I created symbolic links from the files in the /usr/lib64/tracker-2.0/ directory to the corresponding files in the /usr/lib64/ directory and then reindexed the file but it still does not work. Indeed it seems as if tracker only finds words at the beginning of the in the nie:plainTextContent I am on Fedora 29 with gnome 3.30.2 and locale.conf says LANG="en_US.UTF-8"
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new enhancement request ticket at https://gitlab.gnome.org/GNOME/tracker/-/issues/ Thank you for your understanding and your help.