GNOME Bugzilla – Bug 169222
Remove files from index when a FSQ root is removed
Last modified: 2018-07-03 09:52:46 UTC
Beagle should filter out hits that are in .noindex directories. This situation arises when you decide to add a .noindex file to a directory after it was indexed by beagle (because it has to many irrelevant hits, like a directory somewhere lying around in your downloads tree containing dictionary files). Currently, Beagle doesn't remove those files from the index. Jon Trowbridge about this issue in #dashboard: 19:32:21 < uws> Will beagle delete stuff from the index after adding a .noindex file to a directory after that directory was indexed? 19:45:03 < trow> uws: No, it won't. We really should filter out hits on .noindexed files at query-time... that would cause us to queue up a delete for that index item. 19:45:20 < uws> trow: "should" 19:45:26 < uws> trow: That means it's not implemented yet :( 19:45:35 < trow> uws: Exactly. Could you file a bug about that? 19:45:45 < trow> uws: It wouldn't really be that hard to do. 19:45:54 < trow> uws: I'll just forget about it if it isn't in bugzilla. 19:46:06 < uws> trow: But it involves querying the filesystem for each hit... 19:47:10 < trow> We already have to query file filesystem for every (filesystem) hit, because we don't want to show hits for files that no longer exist. And we cache a lot of the .noindex information in memory, so I think we can do the check pretty efficiently. Thanks.
I've had a go at this. The fix is relatively simple but it highlights a problem with our caching. Will investigate more sometime soon!
Just to update this bug with progress, me and Fredrik are working on producing some sort of config file for configuring beagle, and dropping the .noindex / .neverindex functionality altogether to reduce complexity.
Once you get the .noindex stuff revamped, feel free to tackle this if you have any extra time.
There are a number of situations that FSQ needs to handle appropriately. Here's a list of things I've come up with so far. We need to be able to remove roots on-the-fly. If a root is removed we need to forget all the indexes on that root (Or do we just modify HitIsValid to check that a hit is within a root? I think the former...) If we ignore a pattern, we need to forget all the indexes on affected files (Or, again, do we do this through HitIsValid? Not sure what is best here..) If we unignore a pattern, we need to recrawl stuff. How much? Everything? If a root is added, we need sanity checks, like: 1. does it have EA's 2. is it inside another root 3. is another root inside it
When removing a root, or adding an ignore pattern/path, we should immediately drop all indexes on matching hits. (This may be easiest to achieve by firing a special query, and letting HitIsValid do the hard work) When removing an ignore pattern, we should mark the entire tree as dirty. When removing an ignore path, we should mark the affected path as dirty.
Created attachment 48990 [details] [review] Root dropping When we drop a root, immediatetly remove it from the indexes and remove the file attributes
With regard to the root dropping patch, we decided that flooding the scheduler is a bad idea, we should create some sort of periodic optimization routine instead: <trow> yeah, just some regularly-scheduled index maintenance <dsd> so basically... go over every file in the index, check HitIsValid, remove if not valid <trow> Yeah, something like that.
Created attachment 49336 [details] [review] Exclude pattern dropping - When we drop an exclude pattern, we need to recrawl the entire FS tree to pick up those files that we previously ignored - DirectoryPrivate.SetAllToUnknown_Unlocked needs to consider the situation where there are no children - When examining directory children, if ScanOne_Unlocked finds that we already know about the child, it should check the state of the child to see if it needs a scan anyway. - FSM.SetAllToUnknown should fire off a scan request so that everything gets rescanned/recrawled The end result of this is that FSM.SetAllToUnknown now does the right thing, rather than not doing much at all - this means that inotify queue overflows will now be handled correctly.
Created attachment 50021 [details] [review] HitIsValid on entire index ValidateContents should be invoked periodically when beagled isn't busy. This is untested. I'm not sure how to decide when ValidateContents should be invoked. Something like this combats root dropping, exclude path adding, and exclude pattern adding. (against branch)
Created attachment 50581 [details] [review] Recrawl all directories on exclude pattern removal
Created attachment 50586 [details] [review] Root dropping fix Fix the DirectoryModel.FullName exception that appeared when you queried for hits on a removed root.
Created attachment 50598 [details] [review] Forget about newly excluded paths
Created attachment 50622 [details] [review] Expire inotify watches when we drop a root or add an exclude path
Created attachment 51529 [details] [review] New watch dropping stuff Revamp of earlier work which only takes effect on exclude/remove (not rename)
What's the state of this bug?
Need to restart my efforts now that FSQ has become less of a moving target. Here's my list of cases we need to account for. Some are handled already, some are not. 1. Add exclude pattern Recursively drop matching internal directory references. Wait for ValidateContents to remove from index. 2. Remove exclude pattern Mark entire fs as dirty, recrawl. 3. Add exclude path Recursively drop internal directory references. Wait for ValidateContents to remove from index. 4. Remove exclude path Insert back into internal structure and mark for crawling. 5. Add root Add to internal structure and crawl. 6. Delete root Recursively drop internal directory structure. Wait for ValidateContents to remove from index. 7. Should not allow addition of root-inside-root 8. Should handle new root which is parent of existing
Anyone have any idea for status on any of this? looks pretty old...
Reopening this bug; not sure why it was closed, but I just ran into it today. When roots are deleted, the files underneath the root are not removed from the index.
*** Bug 405317 has been marked as a duplicate of this bug. ***
Comment #16 is a pretty accurate state of things, I'm changing the summary of the bug to be "Remove files from index when a FSQ root is removed"
Marking 405317 as duplicate in #19 leaves out the issues in beagle-search with the wrong calculation of the total matches. If an item is excluded it disappears almost instantly from beagle-search but even after implementing #16/3 the update of the total number of matches depends on the time to drop the directory references.
This bug is quite old and perhaps obsolete. If so, please close this, maintainers.
This is still an issue, unfortunately.
Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735): https://gitlab.gnome.org/Archive/beagle/commits/master "tracker" is an available alternative. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the responsibility for active development again.