After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 741852 - Process subdirectories only when contents are available
Process subdirectories only when contents are available
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Miners
unspecified
Other Windows
: Normal normal
: ---
Assigned To: tracker-general
tracker-general
Depends on:
Blocks:
 
 
Reported: 2014-12-22 12:34 UTC by Philip Van Hoof
Modified: 2015-03-17 20:46 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch that fixes this flaw (3.97 KB, patch)
2014-12-22 12:34 UTC, Philip Van Hoof
reviewed Details | Review

Description Philip Van Hoof 2014-12-22 12:34:53 UTC
Created attachment 293175 [details] [review]
Patch that fixes this flaw

The leveled notification has a bug when dealing with directories at the bottom of the crawler search results. They were being processed for deleted files when their contents were not available because the crawler never searched any deeper.

The file_system here isn't the actual filesystem, but a truncated representation and even when the file_system and the store query are at equal depth, the unpatched logic is wrong for subdirectories at that depth.

This patch adjusts the traversal function so that those directories are only scanned when it is their turn as the current root.

Notes: the branch being proposed here is a response to an earlier review but the max_depth parameter passing did not fix the original flaw being patched with this patch. https://git.gnome.org/browse/tracker/log/?h=crawler-max-depth

The max depth is in the call that creates the TrackerFileSystem. It's a bit ambiguous, the file_system is not a file system interface, it's just an in-memory represenation of a directory tree the directory scanning happens earlier.
Comment 1 Martyn Russell 2014-12-27 16:50:43 UTC
Hi Philip, thanks for the patch. One thing I wanted to check with you first... did you see Sam's patch which is somewhat related to this. See:

https://git.gnome.org/browse/tracker/commit/?id=7789afdc2ebe60e39d5931cfcc3aec60d0f4b971
Comment 2 Philip Van Hoof 2014-12-28 21:02:53 UTC
Saw the patch. Not sure how increasing the amount of subdirs to scan improves the problem as explained in this bug.
Comment 3 Martyn Russell 2014-12-30 12:40:18 UTC
Comment on attachment 293175 [details] [review]
Patch that fixes this flaw

This patch looks right to me. I think we already discussed this on the mailing list though:

https://mail.gnome.org/archives/tracker-list/2014-July/msg00000.html

Comments from Carlos in particular:
"""
I see how spurious ::file-deleted might happen, the TrackerFileSystem caches data for directories, so if a reindex happens on a directory that happened to preserve data, tracker_file_system_traverse() would recurse deep in the old tree while the crawling phase only updated the most direct children at that time.

I however think a better fix would be to add a max_depth parameter to tracker_file_system_traverse(), and pass the same depth that's given on tracker_crawler_start() (and implicitly performed in queries). AFAICS the approach in the patch would still fail for files deeper than MAX_DEPTH, as those won't be yet in the pending_files array. TrackerFileSystem is private, so API changes are just fine there.
"""

Now, we've added the max_depth as per my comment in relation to a fix for that work in comment #1.

So I'm wondering if:

a) this patch is still needed to fix the bug?
b) Carlos has any further comment here?

I got the impression that Carlos didn't want this patch and that the max_depth approach was what we agreed. Of course if the bug isn't fixed without this patch, we need to reconsider our approach a bit here.

***

I personally think the patch *seems* fine, but I do have a small concern about adding directories to the array like this as being a bit of a hack.

Carlos, care to comment?
Comment 4 Carlos Garnacho 2015-03-17 20:46:23 UTC
Thanks for the patch, I've finally went for a backwards fix, I think it's best to keep the ::file-* emission unchanged, handled always in the batch prior to processing that pending_dir itself. I've pushed commit c5425cd5a which excludes the dirs in pending_dirs from the sparql query, which is the same to practical effects.