After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 719802 - Add a way for applications to set priorities on tracker crawler
Add a way for applications to set priorities on tracker crawler
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Miners
unspecified
Other Linux
: Normal normal
: ---
Assigned To: tracker-general
: 715194 719306 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2013-12-03 20:56 UTC by Xavier Claessens
Modified: 2014-03-17 16:34 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
TrackerDecorator: Make _prepend_ids and _delete_ids singular (7.36 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerDecorator: Avoid code duplication (5.71 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerDecorator: Fix a compilator warning (2.09 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
tracker_decorator_next(): Queue tasks when a query is needed (9.33 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerDecoratorInfo: register as boxed type (4.85 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
tracker-extract: Fix TrackerConfig not handling change notification (6.90 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerFileSystem: Fix warning when registering a property twice (1.11 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerExtractDecorator: Fix pause/resume (4.44 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerExtractDecorator: Add priority dbus interface (13.49 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerDecorator: Query the rdf:type for each element in the queue (6.35 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerDecorator: Replace the elem_queue with a GSequence (6.97 KB, patch)
2014-02-05 21:32 UTC, Xavier Claessens
committed Details | Review
TrackerDecorator: Sort elem_queue to extract important files first (5.38 KB, patch)
2014-02-05 21:33 UTC, Xavier Claessens
committed Details | Review
TrackerExtractDecorator: Add ClearRdfTypes DBus method (3.18 KB, patch)
2014-02-05 21:33 UTC, Xavier Claessens
committed Details | Review
tracker-extract: add "wait-for-miner-fs" setting (5.09 KB, patch)
2014-02-05 21:33 UTC, Xavier Claessens
committed Details | Review
Add TrackerExtractController to pause/resume the miner (15.16 KB, patch)
2014-02-05 21:33 UTC, Xavier Claessens
committed Details | Review

Description Xavier Claessens 2013-12-03 20:56:52 UTC
Use case: I plug an external storage with thousands of media files, and open a music application to play music from the storage I just plugged.

In that case, I want my music player to display all the music tracker found as soon as possible. That means that all other files should be temporally ignored from the external storage. It also means that listing the files without meta-data is enough at first and the UI will refresh itself later when more info gets extracted.

What I suggest:

1) Add a DBus interface that applications can use to tell tracker about their priority. It should support setting mime type filters, so music app can tell to ignore all files that are not "audio/*". Tracker will keep a mapping like:

{ "app1_bus_name": "audio/*", "app2_bus_name": "image/jpeg", ...}

and watch name owner to remove the filter if the app leaves. Of course the DBus API should have a way for the app to explicitly remove its filter.

2) When the crawler see a file, if it does not match current mime type filter, it is ignored. However a flag is set on the directory GFile telling that this folder needs to be re-crawled later when tracker is done crawling all the rest.

3) TrackerMinerFiles should skip the embedded metadata extracting if the filter map is not empty, but instead write a property on the nfo:FileDataObject saying that a 2nd step is needed to get all information. It will also set the flag on the directory GFile saying it needs to be re-crawled later.

4) The query in TrackerFileNotifier::sparql_file_query_start() should be modified to also get the property we wrote in point (3). Like that when it crawls a folder, it can know which files are missing the 2nd step of the extractor (getting embedded metadata).

5) When tracker is done crawling the whole file system, it must do a 2nd pass on directories that have been flagged as needing re-crawling (some of its files have been ignored, or only partly extracted). For the 2nd pass it will ignore filters.

6) when an external device is plugged, if tracker is currently on his 2nd pass, it is cancelled and start the 1st pass (taking filters into account) on the external media first. It will resume its 2nd pass when the 1st pass is done on the external device.
Comment 1 Xavier Claessens 2013-12-03 20:59:09 UTC
*** Bug 719306 has been marked as a duplicate of this bug. ***
Comment 2 Xavier Claessens 2013-12-03 20:59:52 UTC
*** Bug 715194 has been marked as a duplicate of this bug. ***
Comment 3 Xavier Claessens 2013-12-11 23:53:11 UTC
I started the work here: http://cgit.collabora.com/git/user/xclaesse/tracker.git/log/?h=multi-pass

I did not test the code at all yet, so it's not meant to be fully functional, but it gives an idea of the direction I'm taking. If people who knows tracker's internal have better ideas of where to hook my code, now please tell me!
Comment 4 Xavier Claessens 2013-12-23 23:59:05 UTC
Branch updated with another way of doing it, and it actually work now :)
Comment 5 Martyn Russell 2014-01-06 15:25:47 UTC
OK, I've had a chance to review this work... I have to say: it's quite impressive. The code quality is really good, coding style is spot on and technically I think the approach is what many people have been asking for over time. So thank you for making a start here Xavier ;)

So on to the review...

1. QUEUES:

a) Adding a separate queue for the multipass stuff is an interesting approach. I think it makes sense for the most part, but what worries me is how it works with existing queues. For example, the CREATED/UPDATED queues. In addition to this, what if the file is deleted while it's in the queue, do we make sure those files are not left around in the queue and are cleared up properly? Normally we would do this from the DELETED queue.

b) There is this work that Sam Thursfield did here which re-queues files that failed (usually due to extractor timeout or system load etc) and you can see this with the item_enqueue() API calls. I think this should be integrated somehow. One thing to consider is that you don't need a config to "retry" files that failed and we do have some internal counter (i.e. not in the DB) here to handle this. I've always thought this could be handled with more finesse and I think your approach here could be that way because you at least keep track of passes in the database in cases where we might have crashes or shutdown before completing an index.

2. The 'tracker:pass' property definitely needs some description or comments in the ontology file to describe it. Note that all the ontology properties are documented online and people can look them up, so they must be clear. Without reading the code, I don't think the property is obvious right now. Other classes/properties use: 'rdfs:comment' for this. Actually, many of the 'tracker' properties miss this I gathered :) Finally, when updating ANY ontology file, you must update the date at the top of the file so we know to update the database for people when they install new versions of tracker. This is how we do database schema migrations. See the nao:lastModified ... bit at the top.

3. Regarding the 'tracker:pass' property. I couldn't tell but it looks like this is a permanent property for all data objects, i.e. it's not deleted when a file is entirely up to date right? Our policy is usually to remove ALL properties and data we don't need to store to keep the database as lean as possible. Consider - what happens if the config switches from multi-pass enabled to disabled too - do we remove all these properties? I would think so at least. Thoughts?

4. Nice catch on the item_add_or_update() function with the unused parameter... this function has been around for years. Good that you noticed this! :)

5. The priv->stopped logic you removed in indexing_tree_directory_added() must remain sadly. I liked what you did by moving the common code to a function to start the crawling when pending directories are added to, but the starting and stopping part can not be implicit. There are 2 reasons for this. a) if the miner is paused, it shouldn't resume just because a directory was added in a shell or program. b) if the miner is being shutdown (another reason for priv->stopped = TRUE) new directories shouldn't be processed because they delay the time it takes for devices or desktops to poweroff or reboot, etc. I think adding the code back into your new helper function (add_pending_index_root()) should fix this.

6. Nice catch with:

   http://cgit.collabora.com/git/user/xclaesse/tracker.git/commit/?h=multi-pass&id=6c1fe55578c4a5c8b3c665073e8610c2828b4a9b


Other than that, I think it's a lot further along than I was expecting. Thanks for the hard work so far Xavier ;)
Comment 6 Xavier Claessens 2014-02-05 21:32:30 UTC
Created attachment 268217 [details] [review]
TrackerDecorator: Make _prepend_ids and _delete_ids singular

It is easier to pass ids one by one instead of creating a GArray.
Also it will be needed when passing more info about the id in
upcoming commit.
Comment 7 Xavier Claessens 2014-02-05 21:32:34 UTC
Created attachment 268218 [details] [review]
TrackerDecorator: Avoid code duplication

query_append_rdf_type_filter() was copy/pasted in 2 modules,
better share one internal function.
Comment 8 Xavier Claessens 2014-02-05 21:32:36 UTC
Created attachment 268219 [details] [review]
TrackerDecorator: Fix a compilator warning

gcc was thinking that strings could be used uninitialized.
This fix the warning and simplify the code using a GPtrArray
which is more natural for string arrays.
Comment 9 Xavier Claessens 2014-02-05 21:32:39 UTC
Created attachment 268220 [details] [review]
tracker_decorator_next(): Queue tasks when a query is needed

This avoid double query when calling tracker_decorator_next()
multiple times consecutively.
Comment 10 Xavier Claessens 2014-02-05 21:32:42 UTC
Created attachment 268221 [details] [review]
TrackerDecoratorInfo: register as boxed type

Introspection needs a boxed type to work. Also this fix ref-counting issue,
tracker_decorator_next_finish() returns the ref but caller couldn't unref it
because the unref function wasn't public. The other solution would be to make
_next_finish() transfer none and remove the destroy func in
g_task_return_pointer(), but that's unusual for finish functions.
Comment 11 Xavier Claessens 2014-02-05 21:32:45 UTC
Created attachment 268222 [details] [review]
tracker-extract: Fix TrackerConfig not handling change notification
Comment 12 Xavier Claessens 2014-02-05 21:32:48 UTC
Created attachment 268223 [details] [review]
TrackerFileSystem: Fix warning when registering a property twice

Use g_hash_table_contains() instead of _lookup() for the case
where the destroy function is NULL.
Comment 13 Xavier Claessens 2014-02-05 21:32:50 UTC
Created attachment 268224 [details] [review]
TrackerExtractDecorator: Fix pause/resume

It should not get the next file when the miner is paused/stopped,
or if there are no items available anymore.

This also ensure we never have more than MAX_EXTRACTING_FILES
(currently one) extraction operation running in parallel.
Comment 14 Xavier Claessens 2014-02-05 21:32:53 UTC
Created attachment 268225 [details] [review]
TrackerExtractDecorator: Add priority dbus interface

That dbus interface lets application tell tracker-extract which
rdf:type must be extracted in priority. For example, when the user
interacts with the music player, they want to give the priority to
"nfo:Audio" files.

This commit only introduces the DBus machinary, not the actual
implementation.
Comment 15 Xavier Claessens 2014-02-05 21:32:55 UTC
Created attachment 268226 [details] [review]
TrackerDecorator: Query the rdf:type for each element in the queue

The type will be used in upcoming commits to sort the elem_queue by
priority.
Comment 16 Xavier Claessens 2014-02-05 21:32:58 UTC
Created attachment 268228 [details] [review]
TrackerDecorator: Replace the elem_queue with a GSequence

This will allow to sort elements by priority.
Comment 17 Xavier Claessens 2014-02-05 21:33:01 UTC
Created attachment 268229 [details] [review]
TrackerDecorator: Sort elem_queue to extract important files first

There are 2 things affecting an element position in the queue:
prepend (typically files from an USB key the user plugged),
and prior (the file type matches what applications needs first, e.g.
nfo:Audio).

The queue is sorted like that:
1) prepend and prior files
2) prior files
3) prepend files
4) the rest

This garantees that if I open the Music app and plug an USB key, I'll
get the MP3s from that key first, which is probably want the user
wants.
Comment 18 Xavier Claessens 2014-02-05 21:33:04 UTC
Created attachment 268230 [details] [review]
TrackerExtractDecorator: Add ClearRdfTypes DBus method

Also handle SetRdfTypes with empty strv the same as ClearRdfTypes.
Comment 19 Xavier Claessens 2014-02-05 21:33:06 UTC
Created attachment 268231 [details] [review]
tracker-extract: add "wait-for-miner-fs" setting
Comment 20 Xavier Claessens 2014-02-05 21:33:09 UTC
Created attachment 268232 [details] [review]
Add TrackerExtractController to pause/resume the miner

It listen to "wait-for-miner-fs" option and check the status
of tracker-miner-fs to pause/resume the extractor.
Comment 21 Xavier Claessens 2014-02-05 21:36:44 UTC
Martyn: Thanks for your review, unfortunately master went to another direction so I had to rewrite my branch in a completely different way. The new patches are now attached and ready for review.

Branch with fixes to prepare for the priority code:
http://cgit.collabora.com/git/user/xclaesse/tracker.git/log/?h=misc

Branch that actually implement the priority idea:
http://cgit.collabora.com/git/user/xclaesse/tracker.git/log/?h=priority
Comment 22 Martyn Russell 2014-02-06 10:32:49 UTC
Comment on attachment 268217 [details] [review]
TrackerDecorator: Make _prepend_ids and _delete_ids singular

Looks good to me.
Comment 23 Martyn Russell 2014-02-06 10:35:20 UTC
Comment on attachment 268218 [details] [review]
TrackerDecorator: Avoid code duplication

looks good to me
Comment 24 Martyn Russell 2014-02-06 10:46:40 UTC
Comment on attachment 268219 [details] [review]
TrackerDecorator: Fix a compilator warning

Looks fine to me. I've always wondered what advantage GPtrArray vs GArray has - looks (from the code) like GArray is actually smaller in memory footprint internally, but other than that, not much difference.
Comment 25 Martyn Russell 2014-02-06 10:52:15 UTC
Comment on attachment 268221 [details] [review]
TrackerDecoratorInfo: register as boxed type

>@@ -1194,7 +1200,8 @@ tracker_decorator_next (TrackerDecorator    *decorator,
>  * tracker_decorator_next() to return the result of the task be it an
>  * error or not.
>  *
>- * Returns: (transfer full): (boxed): a #TrackerDecoratorInfo on success or #NULL on error.
>+ * Returns: (transfer full) (boxed): a #TrackerDecoratorInfo on success or
>+ *  #NULL on error. Free with tracker_decorator_info_unref().

I have a suspicion the gtkdoc tools will complain without the extra ":" between "(transfer full)" and "(boxed)".

The rest looks fine. Thanks Xavier!
Comment 26 Martyn Russell 2014-02-06 10:59:59 UTC
Comment on attachment 268222 [details] [review]
tracker-extract: Fix TrackerConfig not handling change notification

>From 029b1d648d29b48d8ccd742fcaa58444c357af45 Mon Sep 17 00:00:00 2001
>From: Xavier Claessens <xavier.claessens@collabora.co.uk>
>Date: Wed, 5 Feb 2014 16:10:06 -0500
>Subject: [PATCH] tracker-extract: Fix TrackerConfig not handling change
> notification
>
>https://bugzilla.gnome.org/show_bug.cgi?id=719802
>---
> src/tracker-extract/tracker-config.c | 90 +++++++++++-------------------------
> src/tracker-extract/tracker-config.h |  7 +--
> 2 files changed, 29 insertions(+), 68 deletions(-)
>
>diff --git a/src/tracker-extract/tracker-config.c b/src/tracker-extract/tracker-config.c
>index e3e2a4d..69bf841 100644
>--- a/src/tracker-extract/tracker-config.c
>+++ b/src/tracker-extract/tracker-config.c
>@@ -113,25 +113,21 @@ config_set_property (GObject      *object,
>                      const GValue *value,
>                      GParamSpec   *pspec)
> {
>+	TrackerConfig *config = TRACKER_CONFIG (object);
>+
> 	switch (param_id) {
>+	/* General */
>+	/* NOTE: We handle these because we have to be able
>+	 * to save these based on command line overrides.
>+	 */

Thanks for commenting here.
The general policy is that command line overrides do not get saved though, they're a session only config. Saved configs should remain.
The reason we have this is so that (for example), if you want to debug you can run with -v 3 but you don't want that to be saved and from then on always used during normal operation.

> 	case PROP_VERBOSITY:
>-		g_settings_set_enum (G_SETTINGS (object), "verbosity",
>-		                     g_value_get_enum (value));
>+		tracker_config_set_verbosity (config, g_value_get_enum (value));
> 		break;
> 
>+	/* We don't care about the others... we don't save anyway. */
> 	case PROP_SCHED_IDLE:
>-		g_settings_set_enum (G_SETTINGS (object), "sched-idle",
>-		                     g_value_get_enum (value));
>-		break;
>-
> 	case PROP_MAX_BYTES:
>-		g_settings_set_int (G_SETTINGS (object), "max-bytes",
>-		                    g_value_get_int (value));
>-		break;
>-
> 	case PROP_MAX_MEDIA_ART_WIDTH:
>-		g_settings_set_int (G_SETTINGS (object), "max-media-art-width",
>-		                    g_value_get_int (value));
> 		break;

These are important.
Because when we use g_object_set* (...), we expect the settings to be updated and IIRC, we don't bind the settings to do this automatically.
NOTE, this depends on the binary I think (e.g. miner-fs and extractor might be different), and there are reasons why we don't bind which should be documented in the code.

Put simply, this is all a bit of a mess. We want to be able to do the following:

1. Have real time config updates via GSettings
2. Have a GObject related to those GSettings
3. Be able to use a config file instead of GSettings (which adds problems)
4. Save config files (when using config files), if they don't exist.
5. Be able to adjust some configs before they're saved, e.g. some locations are special - like &DESKTOP; which is an alias.
... I may have forgotten one or two :)

The config code has had a bunch of iterations, so it's important to know all the use cases.

Will review again when I know what we're fixing here exactly and that the above cases have been considered/tested. It may well be the case that I just did the fixes for miner-fs and now you're doing them for tracker-extract, but I distinctly remember not using _bind() there...

Thanks again Xavier!
Comment 27 Martyn Russell 2014-02-06 11:01:43 UTC
Comment on attachment 268223 [details] [review]
TrackerFileSystem: Fix warning when registering a property twice

Thanks Xavier
Comment 28 Martyn Russell 2014-02-06 11:04:53 UTC
Comment on attachment 268224 [details] [review]
TrackerExtractDecorator: Fix pause/resume

> static void
>@@ -236,7 +256,9 @@ tracker_extract_decorator_paused (TrackerMiner *miner)
> 
> 	priv = TRACKER_EXTRACT_DECORATOR (miner)->priv;
> 	g_debug ("Decorator paused\n");
>-	g_timer_stop (priv->timer);
>+
>+	if (priv->timer)
>+		g_timer_stop (priv->timer);

Hmm, priv->timer should always be set, the check here shouldn't be necessary. Sounds like init/term code needs fixing...
Comment 29 Martyn Russell 2014-02-06 11:10:39 UTC
Xavier, I will finish the review later.

I noticed you have two branches too which makes things easier :)

I've reviewed the patches so far for the

  http://cgit.collabora.com/git/user/xclaesse/tracker.git/log/?h=misc

branch. I suggest we get those through first and I will start the next round of review after that.

Also, you don't have to post all the patches here, but thank you for doing that, makes review a bit easier at times.

When you're done with the misc branch (though I would rename that ;) I will merge to master and continue with your other branch.

Thanks for the work so far!
Comment 30 Xavier Claessens 2014-02-06 18:05:31 UTC
(In reply to comment #25)
> (From update of attachment 268221 [details] [review])
> >@@ -1194,7 +1200,8 @@ tracker_decorator_next (TrackerDecorator    *decorator,
> >  * tracker_decorator_next() to return the result of the task be it an
> >  * error or not.
> >  *
> >- * Returns: (transfer full): (boxed): a #TrackerDecoratorInfo on success or #NULL on error.
> >+ * Returns: (transfer full) (boxed): a #TrackerDecoratorInfo on success or
> >+ *  #NULL on error. Free with tracker_decorator_info_unref().
> 
> I have a suspicion the gtkdoc tools will complain without the extra ":" between
> "(transfer full)" and "(boxed)".

It's fine, I've never added ':' between each annotation in tp-glib. Actually I though the extra ':' would cause issues.

Thanks for your review, I've pushed the patches that are good so far.
Comment 31 Xavier Claessens 2014-02-06 18:30:37 UTC
About the config: I changed it to act like the one in miner/fs/. I've grepped and those settings are never set from the code, except for 'verbosity'. Note that g_settings_apply() is never called so it will never be stored anyway. That's a bug for the migration code AFAIK.

I did that change because that's how I need my 'wait-for-miner-fs' option to work, that I add in a later commit.

It is indeed more complex that I initially though, so take your time to double check if my patch is correct :-)
Comment 32 Xavier Claessens 2014-02-06 18:39:17 UTC
(In reply to comment #28)
> (From update of attachment 268224 [details] [review])
> > static void
> >@@ -236,7 +256,9 @@ tracker_extract_decorator_paused (TrackerMiner *miner)
> > 
> > 	priv = TRACKER_EXTRACT_DECORATOR (miner)->priv;
> > 	g_debug ("Decorator paused\n");
> >-	g_timer_stop (priv->timer);
> >+
> >+	if (priv->timer)
> >+		g_timer_stop (priv->timer);
> 
> Hmm, priv->timer should always be set, the check here shouldn't be necessary.
> Sounds like init/term code needs fixing...

It is actually a bit more complex: We want to calculate the time between tracker_extract_decorator_items_available() and tracker_extract_decorator_finished() but not the time spent while paused. So the timer is created in _items_available() and destroyed in _finished(). But pause/resume could be called when there are no items available, so timer would be NULL. Also note that items can become available while the miner is paused, that's why I do g_timer_new() followed by g_timer_stop() in that case.

That said, thinking about it again, it could be simplified by creating the timer in decoractor_init() and destroy in it in finalize(), then just reset it in items_available(). Do you prefer that ?
Comment 33 Martyn Russell 2014-03-17 16:34:19 UTC
This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.