GNOME Bugzilla – Bug 790284
Tracker should not exclude folders including .git by default
Last modified: 2021-05-26 22:26:13 UTC
This was already raised some times, e.g. in bug #775949. Me and garnacho had a long debugging session on IRC spanning multiple days, and hence this bug. The problem lies in Tracker by default excluding any directory including a .git subfolder. My setup includes: * my $HOME folder contains a .git repo. This is to keep config files like my .emacs and stuff in .config synchronized on multiple machines. * several folders under Documents/ contain latex-generated PDFs. The latex sources are under version control. I was wondering why these PDFs were not showing up in GNOME Documents and such. What would happen, since my $HOME was excluded, was that *all* subfolders and files were removed at start, even from folders such as Video/, Music/, Documents/, etc. This was causing a full reindex upon each login. Given I have tens of thousands of files, for about 3TB of data, the whole process thrashed my disk for 1 hour at each login, rendering the UI unresponsive at times. I wonder if excluding indexing sources this way is really a good idea. It prevents legitimate files (e.g. PDF generated from latex sources inside Documents/) to be found. It also causes unwanted and poorly documented behavior for subfolders. ...and the user can still download a tar.gz file of the sources and unpack it inside Downloads/, without a .git directory inside, which would then be indexed nevertheless. A checkout of GCC, for instance, would use .svn and not .git. So you cannot easily catch all cases this way no matter what. I would rather say, that we need an XDG folder for source files (this was added in xdg-user-dirs 0.0.4): https://www.freedesktop.org/wiki/Software/xdg-user-dirs/, and just index everything in the other XDG folders by default. Spotlight (Mac OS X) and Windows Search don't try to be clever this way: if something is under an indexed directory, it gets indexed and that's it. If you want to exclude stuff, you have to do it by hand. At the very least, if a folder is directly specified in the list of those to be indexed, do it even if it contains a .git folder. That would have solved the problem for me that I have .git in my $HOME (not the one for PDFs under Documents, though).
Created attachment 363494 [details] [review] tests: Add test for ignored content in configured folders There's some extreme cases where content filters in one configured root result on deletes that leak through nested configured roots. Add a testcase to catch this situation.
Created attachment 363495 [details] [review] libtracker-miner: Avoid triggering content filters on configured roots Folders being configured as indexing roots should win over any filter that might apply. The basename based filters correctly skip configured roots already, so do the same with the directory content filter. The practical side effect is that .git folders are now allowed on the directories configured in tracker-miner-fs (homedir and XDG dirs most usually). Tracker tries to stay out of source code trees which are a source of pointless grinding, but there's legit usecases to have these folders under git management: - User setups to bring in essential files across machines - Collections managed through git-annex Those are worth handling, even if the question also applies to folders found recursively and the .git heuristic proves limited.
Thanks Matteo for the bug, you raise fine points :). You're right, the .git filter indeed proves a too wide shot, and insuficient on some other cases... I actually sprinkle some untarred kernel trees when I want to stress test something in Tracker :P. There's some reasons why I think it's good to cut down code trees if possible: - The obvious grinding involved in having those indexed - The regular tracker-miner-fs startup case (with files already indexed and up-to-date) is O(n) to the amount of directories. - Many inotify handles get things progressively worse, too... - Many text files involves work tokenizing those for full-text search, and disk space in the database. - A bigger database results in more seeks and slower response times. - Last but not least, Tracker as-is would be a rather bland code search tool, IMHO it's ok not to pretend being one :). A proper one could be developed on top of Tracker libraries, but that's out of scope. The kernel tree case dutifully explodes those, IIRC I just need 2 of them to run out of inotify handles, the database grows into several GBs while it usually stays at some hundreds megs, and there's close to 100% chance that any 1-2 words search term I come up with has a match in kernel files. Git trees have the added drawback that "git checkout" can result in a massive number of file ops left for tracker-miner-fs/tracker-extract to handle. Sticking to the minimals here, I think it would be good to at least cut down on directory monitors and the plain text content to insert in the database, even if files end up indexed. For the latter we could have a more specific mimetype filter for the tracker-extract text module than plain/* so we skip source code files. However adding monitors is currently an all-or-none option, not much we can do without the help of filters. Or perhaps the other way around is shrugging those, and make it easy for the user to ignore specific folders. The other value in ignored-directories-with-content is the more obscure .trackerignore, which is basically an easter egg to be able to keep Tracker away from certain folders. Perhaps we can implement/announce it more properly as a "tracker config ignore <folder>" subcommand. (In reply to Matteo Settenvini from comment #0) > What would happen, since my $HOME was excluded, was that *all* subfolders > and files were removed at start, even from folders such as Video/, Music/, > Documents/, etc. This was causing a full reindex upon each login. Given I > have tens of thousands of files, for about 3TB of data, the whole process > thrashed my disk for 1 hour at each login, rendering the UI unresponsive at > times. There is one additional problem here. tracker-miner-fs should consider configured folders as standalone entities, the recursive delete breaks this invariant. A more correct behavior would have been deleting all content from $HOME, but preserving the one from XDG folders. > > At the very least, if a folder is directly specified in the list of those to > be indexed, do it even if it contains a .git folder. That would have solved > the problem for me that I have .git in my $HOME (not the one for PDFs under > Documents, though). The patches attached so far address this, I intend to make a release today with this minimal handling in.
(In reply to Carlos Garnacho from comment #3) > > There's some reasons why I think it's good to cut down code trees if > possible: > > [...] > > The kernel tree case dutifully explodes those, IIRC I just need 2 of them to > run out of inotify handles, the database grows into several GBs while it > usually stays at some hundreds megs, and there's close to 100% chance that > any 1-2 words search term I come up with has a match in kernel files. Git > trees have the added drawback that "git checkout" can result in a massive > number of file ops left for tracker-miner-fs/tracker-extract to handle. > It appears, then, that the problem lies more on the size of these trees than their nature as git checkouts. A more generic approach, which would be viable either with checkouts or plain source tarballs being unpacked, could be to use some different heuristics to detect if a folder should be excluded or not. Unfortunately, the general case is rather unsolvable: detecting the source root via the presence of a .git folder, a CMakeLists.txt, a configure.ac, a .svn folder, a Makefile, a cargo.toml file... soon risks to be too lax, or to exclude useful things (of course .git folders and .svn folders themselves should still be skipped). But I am not aware of good heuristics to decide if a folder corresponds to a project or not. It would probably only be safe for the user to decide. And excluding a list of extensions known to be associated to source files. A different approach might be to use a mixed set of rules: both excluding a folder if it is not explicitly listed among the directories to index and it contains a .git or .svn folder, and at the same time if it contains (recursively) more than say, 100 files or 20 folders. This would still allow small projects to be indexed, which are usually those inside Documents/. It is probably a slightly better compromise than now. > There is one additional problem here. tracker-miner-fs should consider > configured folders as standalone entities, the recursive delete breaks this > invariant. A more correct behavior would have been deleting all content from > $HOME, but preserving the one from XDG folders. > Yes, that was surprising to me. Also some more logging at an higher debug verbosity saying why some folders are ignored could not hurt. > > > > At the very least, if a folder is directly specified in the list of those to > > be indexed, do it even if it contains a .git folder. That would have solved > > the problem for me that I have .git in my $HOME (not the one for PDFs under > > Documents, though). > > The patches attached so far address this, I intend to make a release today > with this minimal handling in. Thanks about that!
Attachment 363494 [details] pushed as e810f4c - tests: Add test for ignored content in configured folders Attachment 363495 [details] pushed as 3e040c7 - libtracker-miner: Avoid triggering content filters on configured roots
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new enhancement request ticket at https://gitlab.gnome.org/GNOME/tracker/-/issues/ Thank you for your understanding and your help.