After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 678986 - item_move() creates invalid SPARQL if source file does not exist
item_move() creates invalid SPARQL if source file does not exist
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Miners
0.14.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-general
Jamie McCracken
: 740920 741583 (view as bug list)
Depends on:
Blocks: 739235
 
 
Reported: 2012-06-27 14:54 UTC by Sam Thursfield
Modified: 2016-06-04 23:10 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
[PATCH 1/3] miner-fs: Fix tracking file move (1.47 KB, patch)
2014-10-27 09:43 UTC, Martin Kampas
none Details | Review
[PATCH 1/3] miner-fs: Fix tracking file move (1.52 KB, patch)
2014-10-27 09:57 UTC, Martin Kampas
committed Details | Review
[PATCH 2/3] miner-fs: Remove unused argument is_new (1.45 KB, patch)
2014-10-27 09:58 UTC, Martin Kampas
committed Details | Review
[PATCH 3/3] miner-fs: Fix handling files moved soon after creating (6.09 KB, patch)
2014-10-27 10:00 UTC, Martin Kampas
committed Details | Review

Description Sam Thursfield 2012-06-27 14:54:55 UTC
I produced this bug by renaming 1000 files on a USB stick while they were being crawled, from *.mp3 to *.mp3.hidden

In item_move(), 'source_iri' is NULL, which looks like a valid state but causes lots of bad SPARQL to be generated further down.

extract from miner-fs log:

(tracker-miner-fs:28781): Tracker-CRITICAL **: Could not execute sparql: Subject `(null)' is not in domain `nfo:FileDataObject' of property `nfo:fileName'

(tracker-miner-fs:28781): Tracker-CRITICAL **:   (Sparql buffer) Error in task 20 of the array-update: Subject `(null)' is not in domain `nfo:FileDataObject' of property `nfo:fileName'
(tracker-miner-fs:28781): Tracker-DEBUG:     Affected file: file:///media/0EE0-3D06/copies/2AM%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98%20(Single)%20-%2001%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98.99.mp3.hidden
(tracker-miner-fs:28781): Tracker-DEBUG:     Sparql: DELETE {   ?urn a rdfs:Resource } WHERE {  ?urn nie:url "file:///media/0EE0-3D06/copies/2AM%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98%20(Single)%20-%2001%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98.99.mp3.hidden" }DELETE {   <(null)> nfo:fileName ?f ;        nie:url ?u ;        nie:isStoredAs ?s ;        nfo:belongsToContainer ?b} WHERE {   <(null)> nfo:fileName ?f ;        nie:url ?u ;        nie:isStoredAs ?s ;        nfo:belongsToContainer ?b} INSERT INTO <(null)> {  <(null)> nfo:fileName "2AM - 이노래 (Single) - 01 - 이노래.99.mp3.hidden" ;        nie:url "file:///media/0EE0-3D06/copies/2AM%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98%20(Single)%20-%2001%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98.99.mp3.hidden" ;        nie:isStoredAs <(null)> ;        nfo:belongsToContainer "urn:uuid:d3b32ada-30c3-c4cc-55b7-1cc6a6a56d0a"}

(tracker-miner-fs:28781): Tracker-CRITICAL **: Could not execute sparql: Subject `(null)' is not in domain `nfo:FileDataObject' of property `nfo:fileName'

(tracker-miner-fs:28781): Tracker-CRITICAL **:   (Sparql buffer) Error in task 21 of the array-update: Subject `(null)' is not in domain `nfo:FileDataObject' of property `nfo:fileName'
(tracker-miner-fs:28781): Tracker-DEBUG:     Affected file: file:///media/0EE0-3D06/copies/2AM%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98%20(Single)%20-%2001%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98.9.mp3.hidden
(tracker-miner-fs:28781): Tracker-DEBUG:     Sparql: DELETE {   ?urn a rdfs:Resource } WHERE {  ?urn nie:url "file:///media/0EE0-3D06/copies/2AM%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98%20(Single)%20-%2001%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98.9.mp3.hidden" }DELETE {   <(null)> nfo:fileName ?f ;        nie:url ?u ;        nie:isStoredAs ?s ;        nfo:belongsToContainer ?b} WHERE {   <(null)> nfo:fileName ?f ;        nie:url ?u ;        nie:isStoredAs ?s ;        nfo:belongsToContainer ?b} INSERT INTO <(null)> {  <(null)> nfo:fileName "2AM - 이노래 (Single) - 01 - 이노래.9.mp3.hidden" ;        nie:url "file:///media/0EE0-3D06/copies/2AM%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98%20(Single)%20-%2001%20-%20%EC%9D%B4%EB%85%B8%EB%9E%98.9.mp3.hidden" ;        nie:isStoredAs <(null)> ;        nfo:belongsToContainer "urn:uuid:d3b32ada-30c3-c4cc-55b7-1cc6a6a56d0a"}
Comment 1 Martyn Russell 2012-06-28 08:44:11 UTC
(In reply to comment #0)
> I produced this bug by renaming 1000 files on a USB stick while they were being
> crawled, from *.mp3 to *.mp3.hidden
> 
> In item_move(), 'source_iri' is NULL, which looks like a valid state but causes
> lots of bad SPARQL to be generated further down.

It should be a valid state yea.
 
> extract from miner-fs log:
> 
> (tracker-miner-fs:28781): Tracker-CRITICAL **: Could not execute sparql:
> Subject `(null)' is not in domain `nfo:FileDataObject' of property
> `nfo:fileName'

Hmm, can you use gdb with G_DEBUG=fatal_warnings and figure where that's happening? Is it in item_move()? Or perhaps show the SPARQL here?
 
> (tracker-miner-fs:28781): Tracker-CRITICAL **:   (Sparql buffer) Error in task
> 20 of the array-update: Subject `(null)' is not in domain `nfo:FileDataObject'
> of property `nfo:fileName'

[snip]

I've not seen these errors before either. You're not seeing these with any of your recent patches or branches not in master yet are you?
Comment 2 Sam Thursfield 2012-06-28 08:55:57 UTC
The bug is in the item_move() function itself, it doesn't check for source_iri==NULL before building the SPARQL, and that is where the (null) comes from here:

(tracker-miner-fs:28781): Tracker-DEBUG:     Sparql:
DELETE {
   <(null)> nfo:fileName ?f ;
        nie:url ?u ;       
        nie:isStoredAs ?s ;
        nfo:belongsToContainer ?b
} WHERE {
   <(null)> nfo:fileName ?f ;
        nie:url ?u ;
        nie:isStoredAs ?s ;
        nfo:belongsToContainer ?b
}
INSERT INTO <(null)> {
   <(null)> nfo:fileName "<...>.mp3.hidden" ;
        nie:url "file:///<...>.mp3.hidden";
        nie:isStoredAs <(null)> ;
        nfo:belongsToContainer "urn:uuid:d3b32ada-30c3-c4cc-55b7-1cc6a6a56d0a"
}
Comment 3 Martyn Russell 2012-06-28 09:11:33 UTC
This actually makes sense. The source IRI was never added so the lookup fails and so it's NULL. So we're moving a file we didn't yet index to another location.

Hmm, this sounds like a bug in the way we process our queues.

The order is supposed to be:

 - Crawl
 - Monitor events

And for the monitor events, we do those in this order:

 - DELETED
 - CREATED
 - UPDATED
 - MOVED

So you would expect the CREATED event for the file to happen if it was created before moved. But in this case, it looks like our monitor event queues are being processed before the crawled queues...
Comment 4 Martin Kampas 2014-10-27 09:43:29 UTC
Created attachment 289391 [details] [review]
[PATCH 1/3] miner-fs: Fix tracking file move

Identified by MinerCrawlTest::test_07_move_from_monitored_to_monitored (300-miner-basic-ops.py)

In item_move() it fails to get source_iri, does not check it's validity and uses it in the DELETE expression of the SPARQL query constructed there.

Broken since d836f00 (libtracker-miner: Store iri transiently as GFile qdata) - tracker_file_notifier_get_file_iri() is added 'force' argument and (wrapped with lookup_file_urn) passed force=FALSE from item_move().  This call then fails for regular files because only directories are cached once crawling has completed as stated in the comment in finish_current_directory() in libtracker-miner/tracker-file-notifier.c.
Comment 5 Martin Kampas 2014-10-27 09:57:01 UTC
Created attachment 289394 [details] [review]
[PATCH 1/3] miner-fs: Fix tracking file move

Forgot to mention link to this BZ entry.. added now
Comment 6 Martin Kampas 2014-10-27 09:58:22 UTC
Created attachment 289395 [details] [review]
[PATCH 2/3] miner-fs: Remove unused argument is_new

Just a minor polishing prior to the next change.
Comment 7 Martin Kampas 2014-10-27 10:00:12 UTC
Created attachment 289397 [details] [review]
[PATCH 3/3] miner-fs: Fix handling files moved soon after creating

Noticed this when executing functional tests for write-back:

(tracker-miner-fs:21288): Tracker-CRITICAL **: Could not execute sparql:
 Subject `(null)' is not in domain `nfo:FileDataObject' of property
 `nfo:fileName'
 
There are more ways to get into this. One example:

copy ("file.txt", ".temp_XYZ.file.txt")
 - received G_FILE_MONITOR_EVENT_CREATED (".temp.file.txt")
 - received G_FILE_MONITOR_EVENT_CHANGED (".temp.file.txt")
 - received G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT (".temp.file.txt")
modify (".temp_XYZ.file.txt")
 - received G_FILE_MONITOR_EVENT_CHANGED (".temp.file.txt")
 - received G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT (".temp.file.txt")
mv (".temp_XYZ.file.txt", "file.txt")
 - received G_FILE_MONITOR_EVENT_MOVED (".temp.file.txt", "file.txt")
 - emitted  ITEM_MOVED (".temp.file.txt", "file.txt")
 
It was already handled in item_move() in past, but removed with eef0e7f
(libtracker-miner: Remove useless code) after previously misidentified
as useless in scope of ee58e67 (libtracker-miner: Add compat layer for
tracker_miner_fs_directory_*)

The comment from ee58e67 """FIXME: This situation shouldn't happen from
a TrackerFileNotifier event""" simply cannot be satisfied: no way to get
".temp.file.txt" indexed before ITEM_MOVED is processed - the file
disappears too fast.

This commit extends TrackerMonitor with one more case which can be
handled there though.
Comment 8 Martin Kampas 2014-10-27 10:05:24 UTC
Note that the three patches are for v1.1.4
Comment 9 Martyn Russell 2014-10-28 10:04:42 UTC
Comment on attachment 289394 [details] [review]
[PATCH 1/3] miner-fs: Fix tracking file move

Really nice catch. Your patch makes sense. Please go ahead and commit this!
Comment 10 Martyn Russell 2014-10-28 10:05:56 UTC
Comment on attachment 289395 [details] [review]
[PATCH 2/3] miner-fs: Remove unused argument is_new

Ha, I've been meaning to fix this for years, I keep forgetting to do it. Thanks.
Comment 11 Martyn Russell 2014-10-28 10:26:19 UTC
Comment on attachment 289397 [details] [review]
[PATCH 3/3] miner-fs: Fix handling files moved soon after creating

>From 35c25ac7d655f7d37f5a707a3a8313b05eec81bf Mon Sep 17 00:00:00 2001
>From: Martin Kampas <martin.kampas@tieto.com>
>Date: Mon, 20 Oct 2014 13:57:42 +0200
>Subject: [PATCH 3/3] miner-fs: Fix handling files moved soon after creating
>
>Noticed this when executing functional tests for write-back:
>
>(tracker-miner-fs:21288): Tracker-CRITICAL **: Could not execute sparql:
> Subject `(null)' is not in domain `nfo:FileDataObject' of property
> `nfo:fileName'
>
>There are more ways to get into this. One example:
>
>copy ("file.txt", ".temp_XYZ.file.txt")
> - received G_FILE_MONITOR_EVENT_CREATED (".temp.file.txt")
> - received G_FILE_MONITOR_EVENT_CHANGED (".temp.file.txt")
> - received G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT (".temp.file.txt")
>modify (".temp_XYZ.file.txt")
> - received G_FILE_MONITOR_EVENT_CHANGED (".temp.file.txt")
> - received G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT (".temp.file.txt")
>mv (".temp_XYZ.file.txt", "file.txt")
> - received G_FILE_MONITOR_EVENT_MOVED (".temp.file.txt", "file.txt")
> - emitted  ITEM_MOVED (".temp.file.txt", "file.txt")
>
>It was already handled in item_move() in past, but removed with eef0e7f
>(libtracker-miner: Remove useless code) after previously misidentified
>as useless in scope of ee58e67 (libtracker-miner: Add compat layer for
>tracker_miner_fs_directory_*)
>
>The comment from ee58e67 """FIXME: This situation shouldn't happen from
>a TrackerFileNotifier event""" simply cannot be satisfied: no way to get
>".temp.file.txt" indexed before ITEM_MOVED is processed - the file
>disappears too fast.
>
>This commit extends TrackerMonitor with one more case which can be
>handled there though.

OK. Thanks for the detail, this really helps us understand the problem!

>@@ -1140,6 +1141,10 @@ monitor_item_moved_cb (TrackerMonitor *monitor,
> 
> 		/* If the (parent) directory is in
> 		 * the filesystem, file is stored
>+		 *
>+		 * Notice! not definitely valid for non directories! As a result we may
>+		 * emit FILE_MOVED when we should emit FILE_CREATED instead.
>+		 * Let TrackerMinerFS handle it, see item_move().

True.

>diff --git a/src/libtracker-miner/tracker-miner-fs.c b/src/libtracker-miner/tracker-miner-fs.c
>index 9e2e239..4cdc7bf 100644
>--- a/src/libtracker-miner/tracker-miner-fs.c
>+++ b/src/libtracker-miner/tracker-miner-fs.c
>@@ -1510,7 +1510,7 @@ item_add_or_update (TrackerMinerFS *fs,
> 	 * created, its meta data might already be in the store
> 	 * (possibly inserted by other application) - in such a case
> 	 * we have to UPDATE, not INSERT. */
>-	urn = lookup_file_urn (fs, file, FALSE);
>+	urn = lookup_file_urn (fs, file, TRUE);

This concerns me a little. One of the efficiencies that Carlos added was to not ask the DB for every created/updated event from the file system. I suspect that's why we use FALSE here. Paradoxically, I notice the existing comment in the code which contradicts my point :)

Carlos, can you give your $2 on this before agreeing to the patch?
 
>@@ -1890,6 +1890,38 @@ item_move (TrackerMinerFS *fs,
> 		return retval;
> 	}
> 
>+	if (!source_exists) {
>+		/* There are more ways to get into this. One example:
>+		 *
>+		 * copy ("file.txt", ".temp_XYZ.file.txt")
>+		 *  - received G_FILE_MONITOR_EVENT_CREATED (".temp.file.txt")
>+		 *  - received G_FILE_MONITOR_EVENT_CHANGED (".temp.file.txt")
>+		 *  - received G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT (".temp.file.txt")
>+		 * modify (".temp_XYZ.file.txt")
>+		 *  - received G_FILE_MONITOR_EVENT_CHANGED (".temp.file.txt")
>+		 *  - received G_FILE_MONITOR_EVENT_CHANGES_DONE_HINT (".temp.file.txt")
>+		 * mv (".temp_XYZ.file.txt", "file.txt")
>+		 *  - received G_FILE_MONITOR_EVENT_MOVED (".temp.file.txt", "file.txt")
>+		 *  - emitted  ITEM_MOVED (".temp.file.txt", "file.txt")
>+		 *
>+		 * No way to get ".temp.file.txt" indexed before ITEM_MOVED is processed - the file
>+		 * disappears too fast.
>+		 */
>+
>+		gboolean retval = TRUE;
>+
>+		g_debug ("Source file '%s' not found in store to move, indexing '%s' from scratch", source_uri, uri);
>+
>+		retval = item_add_or_update (fs, file,
>+		                             G_PRIORITY_DEFAULT);
>+
>+		g_free (source_uri);
>+		g_free (uri);
>+		g_object_unref (file_info);
>+
>+		return retval;
>+	}
>+

Looks right to me!

>diff --git a/src/libtracker-miner/tracker-monitor.c b/src/libtracker-miner/tracker-monitor.c
>index 327882f..01a955e 100644
>--- a/src/libtracker-miner/tracker-monitor.c
>+++ b/src/libtracker-miner/tracker-monitor.c
>@@ -1106,6 +1106,30 @@ monitor_event_file_moved (TrackerMonitor *monitor,
> 		g_hash_table_remove (monitor->priv->pre_update, src_file);
> 
> 		/* And keep on notifying the MOVED event */
>+	} else if (monitor->priv->tree &&
>+	           !tracker_indexing_tree_file_is_indexable (monitor->priv->tree,
>+	                                                     src_file,
>+	                                                     G_FILE_TYPE_REGULAR)) {
>+		gchar *src_uri, *dst_uri;
>+
>+		src_uri = g_file_get_uri (src_file);
>+		dst_uri = g_file_get_uri (dst_file);
>+
>+		g_debug ("File 'A' not indexable - processing MOVE(A, B) as CHANGED(B) for '%s->%s'",
>+		         src_uri,
>+		         dst_uri);
>+
>+		g_hash_table_replace (monitor->priv->pre_update,
>+		                      g_object_ref (dst_file),
>+		                      event_data_new (dst_file,
>+		                                      NULL,
>+		                                      FALSE,
>+		                                      G_FILE_MONITOR_EVENT_CHANGED));
>+
>+		g_free (src_uri);
>+		g_free (dst_uri);
>+
>+		return;

I think this is right. I wonder why you go for CHANGED, doesn't CREATED make more sense?
Comment 12 Carlos Garnacho 2014-10-28 11:02:51 UTC
Good catch, although I'm not too sure on the approach... TrackerFileNotifier is the object meant to deal with these inconsistencies, and emit the correct high-level signals. I would expect it to emit ::file-created or ::file-updated in this case. If these inconsistencies are leaking to the TrackerMinerFS, it's then a bug in TrackerFileNotifier.

See monitor_item_moved_cb() in tracker-file-notifier.c, that's where move events are possibly transformed to ::file-created or ::file-deleted depending on source/dest file state in Tracker.
Comment 13 Martin Kampas 2014-10-29 09:07:40 UTC
I tried to fix this at TrackerMonitor/TrackerFileNotifier level but I never felt it 100% reliable and when I compared this with the CREATED case, which does too fail at TrackerMinerFS level when a file is (re)moved soon after it is created, I rather chose to make the higher level more robust.
Comment 14 Martin Kampas 2014-10-29 09:22:18 UTC
(In reply to comment #11)
> I think this is right. I wonder why you go for CHANGED, doesn't CREATED make
> more sense?

See the existing comment couple of lines above in the source file:

"""In case (a), note that B may already exist before, so instead of a CREATED we should be issuing an UPDATED instead... we don't do it as at the end we don't really mind, and we save a call to g_file_query_exists()."""

but later another comment says - and the commented code really does:

"""It is UPDATED instead of CREATED because the destination file could probably exist, and mistakenly reporting a CREATED event there can trick the monitor event compression [...]"""

My addition handles similar case so I also emit CHANGED instead of CREATED.
Comment 15 Carlos Garnacho 2014-10-29 13:30:47 UTC
(In reply to comment #13)
> I tried to fix this at TrackerMonitor/TrackerFileNotifier level but I never
> felt it 100% reliable and when I compared this with the CREATED case, which
> does too fail at TrackerMinerFS level when a file is (re)moved soon after it is
> created, I rather chose to make the higher level more robust.

Of course the higher level code has to deal with the occasional !g_file_query_exists() at the time of processing the file, but I expect that happening to be corrected by later TrackerFileNotifier signals, and possibly already queued events.

From your commit log in the 1st patch:
> with lookup_file_urn) passed force=FALSE from item_move().  This call then
> fails for regular files because only directories are cached once crawling has
> completed as stated in the comment in finish_current_directory() in
> libtracker-miner/tracker-file-notifier.c.

Correctly pointed out, that call is done at that point because 1) long term, we only maintain the directory tree in memory for monitoring and memory saving purposes, and 2) because at that time all GFiles that needed signaling already did so. The latter means that non-directory GFiles will be kept alive by the extra reference in the processing queue, calls like tracker_file_system_get_property() are meant to be valid in this state (forget_files() only unsets a "managed" flag and unref()s, maybe to 0, and freeing fully, or maybe not, see comment around tracker-file-system.c:70)

I suggest extending this logic to files signaled through monitor events in TrackerFileNotifier, other monitor events than move() at least canonicalize the GFiles on the TrackerFileSystem before signalling, all monitor events should be doing that, and querying/interning the data. The files then may be forgotten after emitting the TrackerFileNotifier signals. New checks around iri availability might also be added.

From your commit log in the 3rd patch:
> The comment from ee58e67 """FIXME: This situation shouldn't happen from
> a TrackerFileNotifier event""" simply cannot be satisfied: no way to get
> ".temp.file.txt" indexed before ITEM_MOVED is processed - the file
> disappears too fast.

If we go as suggested above, AFAICS this can only happen 1) in the case you fixed in TrackerMonitor, that part looks great to me :), and 2) if you s/copy/mv/ in the first step of that monitor event sequence, and it all happens within the TrackerMonitor coalesce time, it comes out as a move(a->a), this should be probably translated into an update too.
Comment 16 Martyn Russell 2014-10-29 16:36:13 UTC
(In reply to comment #15)
> (In reply to comment #13)
> > I tried to fix this at TrackerMonitor/TrackerFileNotifier level but I never
> > felt it 100% reliable and when I compared this with the CREATED case, which
> > does too fail at TrackerMinerFS level when a file is (re)moved soon after it is
> > created, I rather chose to make the higher level more robust.
> 
> Of course the higher level code has to deal with the occasional
> !g_file_query_exists() at the time of processing the file, but I expect that
> happening to be corrected by later TrackerFileNotifier signals, and possibly
> already queued events.

I should add, with the new TrackerDataProvider work, we should only query in certain circumstances. We can't always depend on this API. So the lower down the stack we do this the better.
Comment 17 Martin Kampas 2014-10-30 08:25:02 UTC
Then also the other test for source_exists[1] should be removed and source_exists==TRUE asserted in item_move(), right?

[1] https://git.gnome.org/browse/tracker/tree/src/libtracker-miner/tracker-miner-fs.c?id=1.1.4#n1880
Comment 18 Carlos Garnacho 2014-10-30 11:39:10 UTC
(In reply to comment #17)
> Then also the other test for source_exists[1] should be removed and
> source_exists==TRUE asserted in item_move(), right?
> 
> [1]
> https://git.gnome.org/browse/tracker/tree/src/libtracker-miner/tracker-miner-fs.c?id=1.1.4#n1880

Yeah, I do think so, that might just come out of TrackerFileNotifier as the right signal too if we have the iri at hand before emitting. 

Thanks for looking into this!
Comment 19 Martin Kampas 2014-10-31 07:15:28 UTC
This is one of the sequences we are trying to fix:

1. inotified "A created" - waiting for "changes done"
2. inotified "A changes done" - can expire "A created" now
3. (on timeout) "A created" expired and is propagated to miner_fs and enqueued
4. (on idle) "A created" is dequeued in miner_fs and file processing task started
5. inotified file "moved A->B", event propagated to miner_fs in one go and enqueued
6. (on idle) "moved A->B" reenqueued (QUEUE_WAIT) in miner_fs because of the pending task from step 4
7. file processing task from step 4 finished with error - file A disappeared too fast
8. (on idle) "moved A->B" dequeued in miner_fs and processing fail because A has not been recorded in DB

How would you suggest for prevent this from happening below TrackerMinerFS level?
Comment 20 Carlos Garnacho 2014-10-31 11:05:46 UTC
(In reply to comment #19)
> This is one of the sequences we are trying to fix:
> 
> How would you suggest for prevent this from happening below TrackerMinerFS
> level?

This is one of the cases there checking for "file not found" in TrackerMinerFS is unavoidable. IMO the sequence could be handled as:

> 1. inotified "A created" - waiting for "changes done"
> 2. inotified "A changes done" - can expire "A created" now
> 3. (on timeout) "A created" expired and is propagated to miner_fs and enqueued
> 4. (on idle) "A created" is dequeued in miner_fs and file processing task
> started
> 5. inotified file "moved A->B", event propagated to miner_fs in one go and
> enqueued

5b. TrackerFileNotifier checks the iri for both A and B. As A doesn't exist yet as known to Tracker, ::file-created or ::file-updated is emitted for B, depending on its previous existence.

Being pedantic here, at that point A might have been processed, extracted info for, and in the flush/commit queue, but just not yet in the store. Emitting ::file-deleted for A might be in place too. But let's assume A isn't processed yet and B didn't previously exist for the rest of the sequence.

> 6. (on idle) "moved A->B" reenqueued (QUEUE_WAIT) in miner_fs because of the
> pending task from step 4

6b. This comes our as a "B created" event, and put in such queue

> 7. file processing task from step 4 finished with error - file A disappeared
> too fast

Fair enough

> 8. (on idle) "moved A->B" dequeued in miner_fs and processing fail because A
> has not been recorded in DB

8b. The "B created" event is eventually processed, and B is (hopefully) still there to be indexed. The final state matches reality.

One of the initial intentions when doing TrackerFileNotifier is that each individual signal should take Tracker's representation of the filesystem from one coherent state to another (or just no-op/error out), and the net effect of all signals after the queue is fully processed leaves Tracker representing the real filesystem state. This is not fully achieved as you see...
Comment 21 Martin Kampas 2014-11-03 07:57:27 UTC
(In reply to comment #20)
> One of the initial intentions when doing TrackerFileNotifier is that each
> individual signal should take Tracker's representation of the filesystem from
> one coherent state to another [...]

IIUC this is more true with the current TrackerFileNotifier implementation than if we try to compensate for TrackerMinerFS "slowness" at TrackerFileNotifier level by emiting

    ::file-deleted(A) + ::file-created(B)

instead of

    ::file-created(A) + ::file-moved(A, B).

Note that emiting ::file-deleted(A) would be a must, not just an option. Otherwise A could stay in the db accidentally:

> Being pedantic here, at that point A might have been processed, extracted
> info for, and in the flush/commit queue, but just not yet in the store. [...]

IIUC TrackerMinerFS is the origin of this issue - processing cannot take zero time. Other issues similar to this one are already handled at TrackerMinerFS - with the QUEUE_WAIT mechanism. Let's keep this issue encapsulated as well. Or am I wrong? :)
Comment 22 Martyn Russell 2014-12-02 10:46:31 UTC
*** Bug 740920 has been marked as a duplicate of this bug. ***
Comment 23 Cosimo Cecchi 2014-12-16 16:23:15 UTC
*** Bug 741583 has been marked as a duplicate of this bug. ***
Comment 24 Cosimo Cecchi 2014-12-20 15:34:29 UTC
Can we please get at least the first patch committed to master and tracker-1.2 branches? It fixes a pretty major bug with file moves/renames.
Comment 25 Carlos Garnacho 2014-12-20 16:52:21 UTC
(In reply to comment #24)
> Can we please get at least the first patch committed to master and tracker-1.2
> branches? It fixes a pretty major bug with file moves/renames.

Yes, I agree. Talked along the lines with Martyn on Thurday in #tracker, but forgot to comment here. I'm just pushing these 2 patches in the relevant branches.

Fixing the bug at hand has higher priority than internal design or cleanups, I'll pile those in my TODO instead... I had the slight hope to turn these competent patches into future-proof contributions, no luck this time :).

As for the last patch itself, I share the concerns in:
(In reply to comment #11)
> >--- a/src/libtracker-miner/tracker-miner-fs.c
> >+++ b/src/libtracker-miner/tracker-miner-fs.c
> >@@ -1510,7 +1510,7 @@ item_add_or_update (TrackerMinerFS *fs,
> > 	 * created, its meta data might already be in the store
> > 	 * (possibly inserted by other application) - in such a case
> > 	 * we have to UPDATE, not INSERT. */
> >-	urn = lookup_file_urn (fs, file, FALSE);
> >+	urn = lookup_file_urn (fs, file, TRUE);
> 
> This concerns me a little. One of the efficiencies that Carlos added was to not
> ask the DB for every created/updated event from the file system. I suspect
> that's why we use FALSE here. Paradoxically, I notice the existing comment in
> the code which contradicts my point :)

This change seems somewhat unrelated to the problem at hand, AFAICS, this just means an extra queries returning 0 elements on all happenings of ::item-move that end up in item_add_or_update(), and OTOH means this extra query is done en-masse in eg. initial indexes of volumes, which is something to be avoided if we know it's superfluous.
Comment 26 Carlos Garnacho 2016-06-04 23:10:33 UTC
The item_move() changes in the third patch are convincing, the rest looks all superfluous, pushed a patch modified accordingly.