Bug 767472 – Add TrackerResource and rewrite all extractors to use that instead of TrackerSparqlBuilder

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 767472 - Add TrackerResource and rewrite all extractors to use that instead of TrackerSparqlBuilder


Summary:	Add TrackerResource and rewrite all extractors to use that instead of Tracker...


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	General
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-general
QA Contact:	tracker-general

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-06-09 20:08 UTC by Sam Thursfield
Modified:	2016-09-03 10:33 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Bump GLib dependency to 2.44 (722 bytes, patch) 2016-06-09 20:08 UTC, Sam Thursfield	none	Details \| Review
libtracker-sparql: Add TRACKER_TYPE_URI (4.30 KB, patch) 2016-06-09 20:08 UTC, Sam Thursfield	none	Details \| Review
libtracker-sparql: Add TrackerNamespaceManager (14.04 KB, patch) 2016-06-09 20:08 UTC, Sam Thursfield	none	Details \| Review
libtracker-sparql: Add TrackerResource class (50.88 KB, patch) 2016-06-09 20:08 UTC, Sam Thursfield	none	Details \| Review
libtracker-extract: Add resource-helpers module (13.97 KB, patch) 2016-06-09 20:08 UTC, Sam Thursfield	none	Details \| Review
Use TrackerResource instead of TrackerSparqlBuilder in all extractors (355.42 KB, patch) 2016-06-09 20:09 UTC, Sam Thursfield	none	Details \| Review
Add support to extractors for outputting metadata as JSON-LD (12.62 KB, patch) 2016-06-09 20:09 UTC, Sam Thursfield	none	Details \| Review
*Speed and memory usage without* these changes** (2.47 KB, text/plain) 2016-06-09 20:10 UTC, Sam Thursfield		Details
*Speed and memory usage with* these changes** (2.47 KB, text/plain) 2016-06-09 20:10 UTC, Sam Thursfield		Details
Bump GLib dependency to 2.44 (719 bytes, patch) 2016-06-25 12:07 UTC, Sam Thursfield	committed	Details \| Review
libtracker-sparql: Add TRACKER_TYPE_URI (4.31 KB, patch) 2016-06-25 12:07 UTC, Sam Thursfield	committed	Details \| Review
libtracker-sparql: Add TrackerNamespaceManager (14.04 KB, patch) 2016-06-25 12:07 UTC, Sam Thursfield	committed	Details \| Review
libtracker-sparql: Add TrackerResource class (52.51 KB, patch) 2016-06-25 12:07 UTC, Sam Thursfield	committed	Details \| Review
libtracker-extract: Add resource-helpers module (13.98 KB, patch) 2016-06-25 12:07 UTC, Sam Thursfield	committed	Details \| Review
Use TrackerResource instead of TrackerSparqlBuilder in all extractors (354.95 KB, patch) 2016-06-25 12:07 UTC, Sam Thursfield	committed	Details \| Review
Add support to extractors for outputting metadata as JSON-LD (12.62 KB, patch) 2016-06-25 12:07 UTC, Sam Thursfield	none	Details \| Review
libtracker-extract: Restore trailing colon in nfo:Equipment URIs (1.17 KB, patch) 2016-09-01 08:40 UTC, Debarshi Ray	committed	Details \| Review

Description Sam Thursfield 2016-06-09 20:08:30 UTC

These patches introduce the TrackerResource API to libtracker-sparql, and
convert all of Tracker's extract modules to use that instead of
TrackerSparqlBuilder.

Rationale is described here:
<https://mail.gnome.org/archives/tracker-list/2016-April/msg00002.html>

I've tested this on a random set of inputs that cover each tracker-extract
module. (The tree of files was set up with the create-tree-from-real-data
script, which I've just pushed to 'master').

I compared against the wip/sam/extract-command branch to see if there were any
regressions in execution speed or memory usage, the command I ran was this:

    sudo sh -c 'echo 2 > /proc/sys/vm/drop_caches'; \
    time find 1 -type f -fprint /dev/stderr -exec \
        jhbuild -f /home/sam/.jhbuildrc run \
            /usr/bin/time -f "Time: %E  Mem: %MKb" \
                /opt/gnome/libexec/tracker-extract --verbosity=0 \
                                                   --file \{} \
                > /dev/null \;

Results are attached (as performance-old.txt an performance-new.txt); there is
some variance in speed both ways but it's clear that this branch doesn't
introduce any major changes there. The maximum RSS size is slightly higher in
most cases, but that could well be because I added a dependency on JSON-GLIB.
I have run each module under Valgrind and made sure there are no new memory
leaks.
The 400-extractor functional test is currently broken by these changes. I think
the best thing to do with that is delete the hacky ad-hoc SPARQL parsing code
it uses, and make use of the new JSON-LD output provided by tracker-extract
instead. This would be simpler than using the new Turtle output because Python
has a built-in JSON parser, but no built-in Turtle parser.

Code is available in branch wip/sam/resource, or attached to this bug.

These patches depend on <https://bugzilla.gnome.org/show_bug.cgi?id=751991>.

Comment 1 Sam Thursfield 2016-06-09 20:08:34 UTC

Created attachment 329508 [details] [review]
Bump GLib dependency to 2.44

So I can use G_DECLARE_DERIVABLE_TYPE() instead of writing out
everything manually.

Comment 2 Sam Thursfield 2016-06-09 20:08:39 UTC

Created attachment 329509 [details] [review]
libtracker-sparql: Add TRACKER_TYPE_URI

This is the same as G_TYPE_STRING, but we need to treat strings and URIs
differently when generating SPARQL from the TrackerResource class, so we
need a separate GType to keep track of which is which when given a
GValue.

Comment 3 Sam Thursfield 2016-06-09 20:08:44 UTC

Created attachment 329510 [details] [review]
libtracker-sparql: Add TrackerNamespaceManager

This will keep track of a set of namespaces and their prefixes. Then,
when we are serializing a resource, we can use it.

This is separate from the Namespace and Ontologies classes in
libtracker-data. They may be able to make use of it though.

Comment 4 Sam Thursfield 2016-06-09 20:08:49 UTC

Created attachment 329511 [details] [review]
libtracker-sparql: Add TrackerResource class

This provides a "resource-oriented" API for inserting and updating the
database. Rather than having to generate SPARQL queries, you can use the
TrackerResource abstraction to prepare information about a set of
resources, then generate a SPARQL query automatically.

TrackerResource can also serialize to Turtle directly.

Comment 5 Sam Thursfield 2016-06-09 20:08:54 UTC

Created attachment 329512 [details] [review]
libtracker-extract: Add resource-helpers module

This is factoring common code out of the individual extract modules.

Comment 6 Sam Thursfield 2016-06-09 20:09:01 UTC

Created attachment 329513 [details] [review]
Use TrackerResource instead of TrackerSparqlBuilder in all extractors

For a long time, all the Tracker extractors have manually constructed a
SPARQL update command using TrackerSparqlBuilder to represent their
output. This commit changes all of them to use the TrackerResource
class instead, which makes the code a lot more concise and readable.

This introduces some API breaks in the internal libtracker-extract
library. This has been a private library since Tracker 0.16 or earlier,
so it's fine.

If the extractors only output SPARQL then they are only useful to
people who are using a SPARQL store. Now we can output a serialization
format like Turtle as well. This will hopefully make the extract modules
useful outside of Tracker itself.

Comment 7 Sam Thursfield 2016-06-09 20:09:05 UTC

Created attachment 329514 [details] [review]
Add support to extractors for outputting metadata as JSON-LD

This adds a new dependency on the JSON-GLib library.

Comment 8 Sam Thursfield 2016-06-09 20:10:33 UTC

Created attachment 329515 [details]
Speed and memory usage *without* these changes

Comment 9 Sam Thursfield 2016-06-09 20:10:52 UTC

Created attachment 329516 [details]
Speed and memory usage *with* these changes

Comment 10 Hussam Al-Tayeb 2016-06-09 21:32:52 UTC

Hi. I tried compiling from wip/sam/resource branch and it stropped at:

../../../src/libtracker-extract/.libs/libtracker-extract.so: undefined reference to `tracker_resource_set_ensure_unique'
collect2: error: ld returned 1 exit status

Comment 11 Sam Thursfield 2016-06-10 11:06:08 UTC

Hi Hussam! Thanks for trying this out. Looks like I messed up while rebasing; I've pushed a fix, and tested the XMP parsing code (which was where the mistake was) against a JPEG with XMP metadata. I've not updated the attached patches on the assumption that there will be quite a few more things to fix yet :-)

Comment 12 Hussam Al-Tayeb 2016-06-10 21:49:55 UTC

Another one:

  CC       libextract_libav_la-tracker-extract-libav.lo
tracker-extract-libav.c: In function ‘tracker_extract_get_metadata’:
tracker-extract-libav.c:236:10: error: ‘performer_uri’ undeclared (first use in this function)
   g_free(performer_uri);
          ^~~~~~~~~~~~~


this is when configuring with --enable-generic-media-extractor=libav

Comment 13 Sam Thursfield 2016-06-11 00:47:06 UTC

Ah, I forgot about the libav extractor when testing. I'm glad you spotted this. I've now made sure the libav extractor works, and have pushed fixes for that and some other issues I spotted.

Comment 14 Hussam Al-Tayeb 2016-06-11 06:26:31 UTC

Another issue:
Full text search no longer works.
I can no longer do tracker search keyword.
It shows no results after recreating the tracker database.

Comment 15 Hussam Al-Tayeb 2016-06-11 06:43:47 UTC

(In reply to Hussam Al-Tayeb from comment #14)
> Another issue:
> Full text search no longer works.
> I can no longer do tracker search keyword.
> It shows no results after recreating the tracker database.

Just to confirm, with a build from master branch, this works as expected so only broken in wip/sam/resource branch.

Comment 16 Sam Thursfield 2016-06-13 16:21:34 UTC

I have to admit this is really puzzling me. I can reproduce the problem, but the correct data does *seem* to be being inserted into the store, so it's not obvious why search doesn't work. Will keep digging.

Comment 17 Sam Thursfield 2016-06-13 16:48:25 UTC

OK, I've pushed a fix. It was a bad DELETE {} command in the SPARQL generated by TrackerResource, which was deleting everything in the store rather than just the triples the extractor was about to replace!

Comment 18 Sam Thursfield 2016-06-25 12:07:08 UTC

Created attachment 330357 [details] [review]
Bump GLib dependency to 2.44

So I can use G_DECLARE_DERIVABLE_TYPE() instead of writing out
everything manually.

Comment 19 Sam Thursfield 2016-06-25 12:07:15 UTC

Created attachment 330358 [details] [review]
libtracker-sparql: Add TRACKER_TYPE_URI

This is the same as G_TYPE_STRING, but we need to treat strings and URIs
differently when generating SPARQL from the TrackerResource class, so we
need a separate GType to keep track of which is which when given a
GValue.

Comment 20 Sam Thursfield 2016-06-25 12:07:20 UTC

Created attachment 330359 [details] [review]
libtracker-sparql: Add TrackerNamespaceManager

This will keep track of a set of namespaces and their prefixes. Then,
when we are serializing a resource, we can use it.

This is separate from the Namespace and Ontologies classes in
libtracker-data. They may be able to make use of it though.

Comment 21 Sam Thursfield 2016-06-25 12:07:26 UTC

Created attachment 330360 [details] [review]
libtracker-sparql: Add TrackerResource class

This provides a "resource-oriented" API for inserting and updating the
database. Rather than having to generate SPARQL queries, you can use the
TrackerResource abstraction to prepare information about a set of
resources, then generate a SPARQL query automatically.

TrackerResource can also serialize to Turtle directly.

Comment 22 Sam Thursfield 2016-06-25 12:07:31 UTC

Created attachment 330361 [details] [review]
libtracker-extract: Add resource-helpers module

This is factoring common code out of the individual extract modules.

Comment 23 Sam Thursfield 2016-06-25 12:07:38 UTC

Created attachment 330362 [details] [review]
Use TrackerResource instead of TrackerSparqlBuilder in all extractors

For a long time, all the Tracker extractors have manually constructed a
SPARQL update command using TrackerSparqlBuilder to represent their
output. This commit changes all of them to use the TrackerResource
class instead, which makes the code a lot more concise and readable.

This introduces some API breaks in the internal libtracker-extract
library. This has been a private library since Tracker 0.16 or earlier,
so it's fine.

If the extractors only output SPARQL then they are only useful to
people who are using a SPARQL store. Now we can output a serialization
format like Turtle as well. This will hopefully make the extract modules
useful outside of Tracker itself.

Comment 24 Sam Thursfield 2016-06-25 12:07:44 UTC

Created attachment 330363 [details] [review]
Add support to extractors for outputting metadata as JSON-LD

This adds a new dependency on the JSON-GLib library.

Comment 25 Sam Thursfield 2016-06-25 12:09:43 UTC

Many thanks to Hussam for spotting my dumb mistakes. I've pushed a version with all the fixes to branch wip/sam/resource-rebase-1 <https://git.gnome.org/browse/tracker/log/?h=wip/sam/resource-rebase-1>.

Still depends on <https://bugzilla.gnome.org/show_bug.cgi?id=751991> (add a `tracker extract` command).

Comment 26 Sam Thursfield 2016-06-25 12:15:26 UTC

I should note the JSON-LD output is somewhat incomplete, it should generate an @context field, but does not. The JSON-LD stuff is all in the final commit, so the rest of the changes can be merged separately if need be. It might make sense to merge the JSON-LD stuff as-is but note that it's incomplete, since it won't break anything else; I don't mind either way

Comment 27 Hussam Al-Tayeb 2016-06-25 19:12:44 UTC

Hello. I'm not sure if this is related but libtracker-sparql/tracker-uri.h is no longer being installed on make install and it is #included from /usr/include/tracker-1.0/libtracker-sparql/tracker-sparql.h

which results in a compile error in nautilus 
/usr/include/tracker-1.0/libtracker-sparql/tracker-sparql.h:26:43: fatal error: libtracker-sparql/tracker-uri.h: No such file or directory

I manually installed the file to work around this.

Comment 28 Sam Thursfield 2016-06-27 00:05:57 UTC

Hi Hussam. That was indeed my fault, thanks for catching another issue ! I've pushed a fix.

Comment 29 Carlos Garnacho 2016-07-04 15:18:51 UTC

Review of attachment 330359 [details] [review]:

The object makes sense to me, given it's exposed in TrackerResource API, although I think: does it make sense to expose this at all? it probably only makes sense to print to turtle/sparql with the default TrackerNamespaceManager object anyway...

::: src/libtracker-sparql/tracker-namespace-manager.c
@@ +152,3 @@
+		tracker_namespace_manager_add_prefix (manager, "nmm", TRACKER_PREFIX_NMM);
+		tracker_namespace_manager_add_prefix (manager, "mlo", TRACKER_PREFIX_MLO);
+		tracker_namespace_manager_add_prefix (manager, "mfo", TRACKER_PREFIX_MFO);

It would be great if this initialization could be more dynamic (i.e. based on ontology data). It looks like one more place to change if ontology is added/changed. Perhaps letting TrackerOntology initialize this? it seems to go in the opposite direction that you seem to suggest in the commit log though.

Comment 30 Carlos Garnacho 2016-07-04 15:20:54 UTC

Review of attachment 330360 [details] [review]:

Nice API :), I've got mostly style comments.

::: src/libtracker-sparql/tracker-resource.c
@@ +25,3 @@
+#include <tracker-uri.h>
+#include <tracker-resource.h>
+#include <tracker-ontologies.h>

is this include needed?

This makes me think of a feature request for the future, that TrackerResource checks maxCardinality from the ontology for the added properties, so it's possible to error out in the miner/extractor, rather than catching this late when the sparql insertion fails.

But in the mean time, this include seems unused :).

@@ +202,3 @@
+
+	/* The identifier may be NULL, so I don't think we can pass it in the va_list
+	 * that g_object_new() accepts.

I think you can just do:

g_object_new (TRACKER_TYPE_RESOURCE,
              "identifier", NULL,
              NULL);

if it needs be, properties/values are dealt with in pairs, the NULL sentinel always applies when fetching the next property name.

@@ +240,3 @@
+tracker_resource_set_gvalue (TrackerResource *self,
+                             const char *property_uri,
+                             GValue *value)

I'd make these calls that copy the value take a const GValue *.

In provision to having TrackerResource perform basic ontology checks, perhaps all setter functions should be able to indicate that the property/value pair could not be added? Maybe just adding gboolean return values and in the future issuing warnings inside these functions? I'll be happy just being able to get proper backtraces to such invalid usages, so returning a GError in these functions seems overkill.

@@ +300,3 @@
+tracker_resource_add_gvalue (TrackerResource *self,
+                             const char *property_uri,
+                             GValue *value)

Same here AFAICS.

@@ +405,3 @@
+ */
+GList *tracker_resource_get_values (TrackerResource *self,
+                                    const char *property_uri)

Nit: we usually align variable names :). this happens in a few places...

@@ +511,3 @@
+
+	if (priv->identifier) {
+		g_free (priv->identifier);

No need for checking that priv->identifier is != NULL here.

@@ +958,3 @@
+	 * It would be better if we could use "DELETE DATA { pattern }". This is
+	 * allowed in SPARQL update 1.1, but not yet supported by Tracker's store.
+	 */

This is actually my bad... it used to work, but I broke it with the support for DELETE{}INSERT{}WHERE{}. I have some local patches fixing this and adding tests around.

::: src/libtracker-sparql/tracker-resource.h
@@ +76,3 @@
+char *tracker_resource_print_turtle(TrackerResource *self, TrackerNamespaceManager *namespaces);
+
+void tracker_resource_generate_sparql_update (TrackerResource *self, TrackerSparqlBuilder *builder, TrackerNamespaceManager *namespaces, const char *graph_id);

Generally, I think these two serializing functions should offer better error handling (return value, GError argument?). As already said, I find passing the namespaces manager a bit dubious... what happens if you pass a manager that doesn't have all the necessary prefixes to compose the ttl/sparql? what happens if the TrackerResource has a mix of shorthand and expanded property uris and a custom namespace manager has been given? etc...

::: src/libtracker-sparql/tracker-uri.h
@@ +29,3 @@
 #endif
 
+GType tracker_uri_get_type (void);

This belongs to the commit adding the tracker URI GType?

Comment 31 Carlos Garnacho 2016-07-04 15:21:53 UTC

Review of attachment 330361 [details] [review]:

Yay, it will be nice to have these in place. setting as a-c-n for the time being.

::: src/libtracker-extract/tracker-xmp.c
@@ +24,3 @@
 #include <libtracker-common/tracker-utils.h>
 
+#include "tracker-resource-helpers.h"

This change looks like it belongs in the commit modifying the extractor modules?

Comment 32 Carlos Garnacho 2016-07-04 15:22:56 UTC

Review of attachment 330357 [details] [review]:

Looks good

Comment 33 Carlos Garnacho 2016-07-04 15:23:26 UTC

Review of attachment 330358 [details] [review]:

Makes sense

Comment 34 Carlos Garnacho 2016-07-04 16:32:51 UTC

Review of attachment 330362 [details] [review]:

This patch is massive... nice code reduction also :).

Reading through the patch, it looks mostly correct to me, I kept an extra eye on:
- Places which look like could be using the new helper functions but don't, spotted a few of those.
- Places where we now generate named URNs where we used to generate a blank node, or viceversa. Spotted a couple too, and I'm a bit wary with those.
- Places where we inadvertently changed cardinality (eg. using set_string where add_string should be used)
- Places where we inadvertently changed the type

I spotted nothing about the last two on first sight, which is great :). Anyhow, the more eyes looking around, the better.

::: src/libtracker-extract/tracker-extract-info.c
@@ +223,3 @@
  *
+ * FIXME: you need a way to delete the logical resources when re-extracting a
+ * file -- still need to decide on API for that.

probably this needs handling at the miner-fs level, tracker-extract is supposed to always find the file as if just indexed by tracker-miner-fs.

::: src/libtracker-extract/tracker-xmp.c
@@ +916,3 @@
+		label = tracker_resource_new (NULL);
+		tracker_resource_set_uri (label, "rdf:type", "nao:Tag");
+		tracker_resource_set_string (label, "nao:prefLabel", p);

Doesn't this seem like a place for the shiny tracker_extract_new_tag()? although I see that function does not generate bnodes which seems to be what we want here.

@@ +931,3 @@
+		publisher = tracker_resource_new (NULL);
+		tracker_resource_set_uri (publisher, "rdf:type", "nco:Contact");
+		tracker_resource_set_string (publisher, "nco:fullname", data->publisher);

same here with tracker_extract_new_contact(), and a few other places below in this same file

::: src/tracker-extract/tracker-extract-abw.c
@@ +125,3 @@
+		creator = tracker_resource_new (NULL);
+		tracker_resource_set_uri (creator, "rdf:type", "nco:Contact");
+		tracker_resource_set_string (creator, "nco:fullname", str);

I'm starting to wonder whether the helper functions should be bnode friendly :).

::: src/tracker-extract/tracker-extract-flac.c
@@ +231,2 @@
+		album = tracker_resource_get_first_relation (album_disc, "nmm:albumDiscAlbum");
+		tracker_resource_set_string (album, "nmm:albumTrackCount", fd.trackcount);

This fd.trackcount handling used to be under a if (fd.trackcount) {} check, are TrackerResource functions NULL safe?

::: src/tracker-extract/tracker-extract-gif.c
@@ +357,2 @@
 		p = g_ptr_array_index (keywords, i);
+		tag = tracker_extract_new_tag (p);

Here you're replacing a blank node with a fixed urn tag, was that intended?

Comment 35 Carlos Garnacho 2016-07-04 16:38:39 UTC

(In reply to Sam Thursfield from comment #26)
> I should note the JSON-LD output is somewhat incomplete, it should generate
> an @context field, but does not. The JSON-LD stuff is all in the final
> commit, so the rest of the changes can be merged separately if need be. It
> might make sense to merge the JSON-LD stuff as-is but note that it's
> incomplete, since it won't break anything else; I don't mind either way

If we've got nothing on the queue looking forward for this, IMHO it can wait till this is resolved :).

One question though, it's only this last patch which depends on bug #751991, right? if that's the case, we might also defer these other patches till JSON-LD support is finalized.

Comment 36 Sam Thursfield 2016-07-04 20:50:33 UTC

Hi Carlos, cheers for reviewing!

A few of your points are about how the new code added to libtracker-sparql
has no knowledge of Tracker's specific ontologies. This was a deliberate
choice and is the same as how the current TrackerSparqlBuilder stuff works.
As I see it, the downsides of this (current) approach are...

* TrackerSparqlBuilder and TrackerResource can't ensure that the data being
  added is valid according to Tracker's ontologies, so instead when you
  try and add invalid stuff you get an error later on, and have to work
  backwards to find which bit of code generated the error.

* Adding or removing ontologies from src/ontologies requires corresponding
  updates to libtracker-sparql/tracker-ontologies.h (and now also
  libtracker-sparlq/tracker-namespace-manager.c).

The upside is mainly that the code is simpler. I like that libtracker-sparql
is generic and not tied up with the internals of libtracker-data, because
it makes the code more reusable. Right now, you can use TrackerResource
to generate SPARQL or Turtle for any standards-compliant triple store and any
set of ontologies. It's not limited to the things Tracker itself understands.

To get to the approach you're proposing, we need to move the code out of
libtracker-data that deals with ontology data, and into libtracker-sparql.
We can feasibly do this while keeping libtracker-sparql generic, because
it can be an opt-in thing. So my only worry is that it might be too hard.
I'll look into it and get back to you :-)

About TrackerNamespaceManager being public: this ties in with TrackerResource
being reusable. If you want to use TrackerResource to generate SPARQL for
completely non-Tracker related data, you can set up your own
TrackerNamespaceManager with the namespaces you care about. I don't expect
this will be useful right away, so we could hide it for now and make it
public later... but I don't see the API needing to change much.

> +#include <tracker-ontologies.h>
> is this include needed?

Yes, to use the TRACKER_PREFIX_RDF and TRACKER_PREFIX_RDFS macros.

> Nit: we usually align variable names :). this happens in a few places...

True. I have uncrustified it now :-)

> Generally, I think these two serializing functions should offer better error
> handling (return value, GError argument?). As already said, I find passing
> the namespaces manager a bit dubious... what happens if you pass a manager
> that doesn't have all the necessary prefixes to compose the ttl/sparql? what
> happens if the TrackerResource has a mix of shorthand and expanded property
> uris and a custom namespace manager has been given? etc...

I actually did have GError parameters there originally, and removed them
because there are very few expected ways it can fail. Right now the code
calls g_warning() and returns NULL in those cases, but the only time it
does so is: if the main resource has no identifier (which is a programmer
error), or if tracker_resource_set_gvalue() was called with a GType that
the code doesn't know how to serialize (which is again a programmer error).
I think g_warning() is appropriate for this kind of problem.

As for allowing a custom set of prefixes (TrackerNamespaceManager), hopefully
it makes more sense if you think about using this code independently of
Tracker-Store. If we want to allow generating arbitrary standards-compliant
SPARQL and Turtle, then we can't actually do *any* validation of prefixed
URIs and different URI schemes. The URI standard doesn't mandate anything
after the initial ':'. So if we see 'abc:bar', there's no way to know
whether this is a compacted URL using 'abc:' prefix, or a full URI using
scheme 'abc:'. The only way to know is to have the user tell us, by adding
'abc:' to the TrackerNamespaceManager that they pass in if 'abc:' is a prefix.

It would be nice to be a bit more helpful here. One way would be to split
tracker_resource_set_uri() into two functions:
tracker_resource_set_compact_uri() and tracker_resource_set_full_uri(). Then
we'd know which the user intended and could warn during serialization if there
are unexpanded compact URIs. But that makes code using this API a bit uglier.
Is it worth it?

> Doesn't this seem like a place for the shiny tracker_extract_new_tag()?
> although I see that function does not generate bnodes which seems to be what
> we want here.
> I'm starting to wonder whether the helper functions should be bnode friendly :).
> Here you're replacing a blank node with a fixed urn tag, was that intended?

I could do with advice here: what's the advantage of inserting tags as blank
nodes?

If you set the identifier for a TrackerResource to NULL it acts as a blank
node, so it's possible for the tracker_extract_new_tag() function to work
that way. But I'm confused why they are currently handled differently to how
we handle contacts, music albums, music artists, etc...

> This fd.trackcount handling used to be under a if (fd.trackcount) {} check,
> are TrackerResource functions NULL safe?

Hmm... this is a good question. There's not really a usecase for setting a
string value or URI value property to the SPARQL "null" type. We could add
tracker_resource_set_null() if that someone does need to do that. I've fixed
the code you found there, and added a 'validation' function for the
set_string(), add_string(), set_uri() and add_uri() functions that calls
g_warning() if NULL is passed.

> One question though, it's only this last patch which depends on bug #751991,
> right? if that's the case, we might also defer these other patches till
> JSON-LD support is finalized.

Right now the whole series depends on #751991. It could be rebased off there
fairly easily, but #751991 adds a nice way to run a Tracker extract module and
see the Turtle output, so it'd be good to merge it now if possible.

I'll move the JSON-LD patch into a separate bug.

Thanks again for the review, I've pushed branch wip/sam/resource-rebase-3 with
all the clear-cut fixes done. I've also deleted the older branches.

Comment 37 Carlos Garnacho 2016-07-05 12:37:34 UTC

(In reply to Sam Thursfield from comment #36)
> Hi Carlos, cheers for reviewing!

Yw, it was about time :), fyi I think it'd be great to merge this soon.

> 
> A few of your points are about how the new code added to libtracker-sparql
> has no knowledge of Tracker's specific ontologies. This was a deliberate
> choice and is the same as how the current TrackerSparqlBuilder stuff works.
> As I see it, the downsides of this (current) approach are...
> 
> * TrackerSparqlBuilder and TrackerResource can't ensure that the data being
>   added is valid according to Tracker's ontologies, so instead when you
>   try and add invalid stuff you get an error later on, and have to work
>   backwards to find which bit of code generated the error.
> 
> * Adding or removing ontologies from src/ontologies requires corresponding
>   updates to libtracker-sparql/tracker-ontologies.h (and now also
>   libtracker-sparlq/tracker-namespace-manager.c).

... which btw would be great to make fully dynamic, in provision for both making tracker a generic store in the future, and so more immediate ontology changes turn out seamless (which I see rather imminent tbh...)

> 
> The upside is mainly that the code is simpler. I like that libtracker-sparql
> is generic and not tied up with the internals of libtracker-data, because
> it makes the code more reusable. Right now, you can use TrackerResource
> to generate SPARQL or Turtle for any standards-compliant triple store and any
> set of ontologies. It's not limited to the things Tracker itself understands.

Fair point.

> 
> To get to the approach you're proposing, we need to move the code out of
> libtracker-data that deals with ontology data, and into libtracker-sparql.
> We can feasibly do this while keeping libtracker-sparql generic, because
> it can be an opt-in thing. So my only worry is that it might be too hard.
> I'll look into it and get back to you :-)

Let me suggest, how about implementing a TrackerNamespaceManager subclass in libtracker-data that adds all the ontology prefixes to self, and making the default namespace manager getter somehow hook into that?

> 
> About TrackerNamespaceManager being public: this ties in with TrackerResource
> being reusable. If you want to use TrackerResource to generate SPARQL for
> completely non-Tracker related data, you can set up your own
> TrackerNamespaceManager with the namespaces you care about. I don't expect
> this will be useful right away, so we could hide it for now and make it
> public later... but I don't see the API needing to change much.
> 
> > +#include <tracker-ontologies.h>
> > is this include needed?
> 
> Yes, to use the TRACKER_PREFIX_RDF and TRACKER_PREFIX_RDFS macros.

Oh, missed that :).

> 
> > Nit: we usually align variable names :). this happens in a few places...
> 
> True. I have uncrustified it now :-)
> 
> > Generally, I think these two serializing functions should offer better error
> > handling (return value, GError argument?). As already said, I find passing
> > the namespaces manager a bit dubious... what happens if you pass a manager
> > that doesn't have all the necessary prefixes to compose the ttl/sparql? what
> > happens if the TrackerResource has a mix of shorthand and expanded property
> > uris and a custom namespace manager has been given? etc...
> 
> I actually did have GError parameters there originally, and removed them
> because there are very few expected ways it can fail. Right now the code
> calls g_warning() and returns NULL in those cases, but the only time it
> does so is: if the main resource has no identifier (which is a programmer
> error), or if tracker_resource_set_gvalue() was called with a GType that
> the code doesn't know how to serialize (which is again a programmer error).
> I think g_warning() is appropriate for this kind of problem.

Point, if G_DEBUG=fatal* is able to get me to the place where the error happens, I'll be happy :).

> 
> As for allowing a custom set of prefixes (TrackerNamespaceManager), hopefully
> it makes more sense if you think about using this code independently of
> Tracker-Store. If we want to allow generating arbitrary standards-compliant
> SPARQL and Turtle, then we can't actually do *any* validation of prefixed
> URIs and different URI schemes. The URI standard doesn't mandate anything
> after the initial ':'. So if we see 'abc:bar', there's no way to know
> whether this is a compacted URL using 'abc:' prefix, or a full URI using
> scheme 'abc:'. The only way to know is to have the user tell us, by adding
> 'abc:' to the TrackerNamespaceManager that they pass in if 'abc:' is a
> prefix.
> 
> It would be nice to be a bit more helpful here. One way would be to split
> tracker_resource_set_uri() into two functions:
> tracker_resource_set_compact_uri() and tracker_resource_set_full_uri(). Then
> we'd know which the user intended and could warn during serialization if
> there
> are unexpanded compact URIs. But that makes code using this API a bit uglier.
> Is it worth it?

Nah, probably not. If the user mixes both, it's up to him. However, would be nice to spare the user from having to get the default namespace manager if this is the most common case, perhaps allowing NULL in the serialization functions and fetching the default manager internally in that case?

Oh, and now that I notice, IMO TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to the point that I'd favor deprecating it. I think it's unfortunate that the tracker_resource_generate_sparql_update() function takes one then :). I wouldn't mind if for the moment one is used internally, but I think I'd prefer that function to just return gchar* as well :).

> 
> > Doesn't this seem like a place for the shiny tracker_extract_new_tag()?
> > although I see that function does not generate bnodes which seems to be what
> > we want here.
> > I'm starting to wonder whether the helper functions should be bnode friendly :).
> > Here you're replacing a blank node with a fixed urn tag, was that intended?
> 
> I could do with advice here: what's the advantage of inserting tags as blank
> nodes?
> 
> If you set the identifier for a TrackerResource to NULL it acts as a blank
> node, so it's possible for the tracker_extract_new_tag() function to work
> that way. But I'm confused why they are currently handled differently to how
> we handle contacts, music albums, music artists, etc...

Fwiw, the first comment was about a contact urn, inserted as a blank node :), I was wondering there if the helper functions could be used more extensively, there's a few places where they look like they might fit but still things are done "manually", I presumed keeping those a blank node was the main intention.

As for the second comment (and your reply), I guess my concern is somewhat moot with the way TrackerResource works. With fixed URNs it turned out too easy to screw up breaking cardinality and whatnot, eg. assuming that the node is not there previously and inserting slightly different data, getting unintended collisions, etc.

I, however, have higher level concerns around fixed URN schemes in an RDF world with multiple independent information sources, esp. around how is the info filled in the gaps (eg. contacts from IM conversations being transparently merged with contacts from emails). In the end I think that each miner should be individually smarter at looking up and completing or inserting these kind of shared resources, a fixed URN scheme might look like it might help, but falls short because it's entirely made up and there is nothing actually enforcing it. And I'll stop rambling now :).

I also recommend you to read bug #766579 and the mentioned bugs in there to see some limitations we're hitting with fixed URNs (although presumably "make those more specific" might be a valid answer too).

Fwiw, I don't expect you tackle any of this here, but I would prefer that this refactor didn't really involve any substantial change to the final stored content.

> 
> > This fd.trackcount handling used to be under a if (fd.trackcount) {} check,
> > are TrackerResource functions NULL safe?
> 
> Hmm... this is a good question. There's not really a usecase for setting a
> string value or URI value property to the SPARQL "null" type. We could add
> tracker_resource_set_null() if that someone does need to do that. I've fixed
> the code you found there, and added a 'validation' function for the
> set_string(), add_string(), set_uri() and add_uri() functions that calls
> g_warning() if NULL is passed.

That sounds good :).

> 
> > One question though, it's only this last patch which depends on bug #751991,
> > right? if that's the case, we might also defer these other patches till
> > JSON-LD support is finalized.
> 
> Right now the whole series depends on #751991. It could be rebased off there
> fairly easily, but #751991 adds a nice way to run a Tracker extract module
> and
> see the Turtle output, so it'd be good to merge it now if possible.

Aye, will review later in the day :).

> 
> I'll move the JSON-LD patch into a separate bug.
> 
> Thanks again for the review, I've pushed branch wip/sam/resource-rebase-3
> with
> all the clear-cut fixes done. I've also deleted the older branches.

Thanks!

Comment 38 Sam Thursfield 2016-07-07 00:00:10 UTC

OK, I think the only remaining sticking point is the blank node changes.

I agree it'd be good if this rewrite of the code didn't change how we insert
data into store. But there's a tradeoff of keeping thing stable, vs. keeping
code simple, and vs. making extractors be consistant with each other.

The existing way of creating tags seems to be this:

	INSERT {
		GRAPH <xxx> {
			_:tag a nao:Tag ; nao:prefLabel "Tag Label"
		}
	} WHERE {
		FILTER (NOT EXISTS {
			?tag a nao:Tag ; nao:prefLabel "Tag Label"
		})
	}

And then in the metadata we do:

	<file> nao:hasTag ?tag1
	WHERE { ?tag1 a nao:Tag ; nao:prefLabel "Tag Label" }

This approach means there's only ever 1 resource for each "Tag Label",
and the URI will be a random, unique ID.

Adapting TrackerResource to generate this SPARQL would be quite a bit of extra
code. As far as I can see, the only benefit of having a tag URI that isn't based
on the tag's label is that, if you relabel the tag, its URI doesn't become
misleading.

But, even if we do create a tag resource at <urn:tag:mefirst> with label "Me
First", then rename it to "Me Second", nothing breaks. Now we have a tag with
URI <urn:tag:mefirst> and label "Me Second", which is a bit misleading, but
it's totally backwards-compatible with how things currently work.

I'm not sure it's worth adding lots of extra code to TrackerResource to
support this use case.


For "contacts", different extractors took different approaches. Some already
generated fixed urn:contact:XXX URIs. Others did it like this:

	<file> nco:publisher [ a nco:Contact ; nco:fullname "Publisher Name" ]

None of them check whether a nco:Contact resource with that name already
exists, so users may end up with many nco:Contact resources each representing
"Publisher Name", each with a random, unique URI. It's easy to do recreate this
behaviour with TrackerResource, but I think it's a bit broken in the first
place.

The real problem there is that the XMP "publisher" tag and such things
shouldn't be nco:Contact resources. All we have is the "name" for these,
and people's names are certainly not globally unique, so we can't really
infer much from that data. We should probably have a separate ContentCreator
resource type for these things.

I'd like to make the extractors consistant in this regard one way or the
other. I've done the easiest thing right now, which is ensure all extractors
create nco:Contact resources with fixed URIs for content-creator info. I'm
open to changing every extractor to use a different approach. I'm less open
to doing extra work just so we preserve the existing mismatched behaviour of
the different extractors, because I can't see it causing much harm. I guess
the only issue is if you have a friend named "James Joyce" stored as a contact,
and then you extract a .pdf file where the creator is someone named "James
Joyce", some program might think that it was written by your friend. But since
some extractors already used fixed URIs, this isn't a new problem, it's just
more likely to happen now. And I think it's minor in any case.

Good spot that not everything was using the resource-helper classes: I've
updated the code to use those everywhere (I think). I've also updated the
commit message in the last patch to note what behaviour changes there are in
extraction.

> Let me suggest, how about implementing a TrackerNamespaceManager subclass in
> libtracker-data that adds all the ontology prefixes to self, and making the
> default namespace manager getter somehow hook into that?

Good idea, I've not done that now, but the existing API allows that to happen
in future.

> would be nice to spare the user from having to get the default namespace
> manager if this is the most common case, perhaps allowing NULL in the
> serialization functions and fetching the default manager internally in that
> case?

Done

> TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to
> the point that I'd favor deprecating it. I think it's unfortunate that the
> tracker_resource_generate_sparql_update() function takes one then :). I
> wouldn't mind if for the moment one is used internally, but I think I'd
> prefer that function to just return gchar* as well :).

Done

New stuff pushed to branch wip/sam/resource-rebase-4

Comment 39 Sam Thursfield 2016-07-07 00:01:36 UTC

OK, I think the only remaining sticking point is the blank node changes.

I agree it'd be good if this rewrite of the code didn't change how we insert
data into store. But there's a tradeoff of keeping thing stable, vs. keeping
code simple, and vs. making extractors be consistant with each other.

The existing way of creating tags seems to be this:

	INSERT {
		GRAPH <xxx> {
			_:tag a nao:Tag ; nao:prefLabel "Tag Label"
		}
	} WHERE {
		FILTER (NOT EXISTS {
			?tag a nao:Tag ; nao:prefLabel "Tag Label"
		})
	}

And then in the metadata we do:

	<file> nao:hasTag ?tag1
	WHERE { ?tag1 a nao:Tag ; nao:prefLabel "Tag Label" }

This approach means there's only ever 1 resource for each "Tag Label",
and the URI will be a random, unique ID.

Adapting TrackerResource to generate this SPARQL would be quite a bit of extra
code. As far as I can see, the only benefit of having a tag URI that isn't based
on the tag's label is that, if you relabel the tag, its URI doesn't become
misleading.

But, even if we do create a tag resource at <urn:tag:mefirst> with label "Me
First", then rename it to "Me Second", nothing breaks. Now we have a tag with
URI <urn:tag:mefirst> and label "Me Second", which is a bit misleading, but
it's totally backwards-compatible with how things currently work.

I'm not sure it's worth adding lots of extra code to TrackerResource to
support this use case.


For "contacts", different extractors took different approaches. Some already
generated fixed urn:contact:XXX URIs. Others did it like this:

	<file> nco:publisher [ a nco:Contact ; nco:fullname "Publisher Name" ]

None of them check whether a nco:Contact resource with that name already
exists, so users may end up with many nco:Contact resources each representing
"Publisher Name", each with a random, unique URI. It's easy to do recreate this
behaviour with TrackerResource, but I think it's a bit broken in the first
place.

The real problem there is that the XMP "publisher" tag and such things
shouldn't be nco:Contact resources. All we have is the "name" for these,
and people's names are certainly not globally unique, so we can't really
infer much from that data. We should probably have a separate ContentCreator
resource type for these things.

I'd like to make the extractors consistant in this regard one way or the
other. I've done the easiest thing right now, which is ensure all extractors
create nco:Contact resources with fixed URIs for content-creator info. I'm
open to changing every extractor to use a different approach. I'm less open
to doing extra work just so we preserve the existing mismatched behaviour of
the different extractors, because I can't see it causing much harm. I guess
the only issue is if you have a friend named "James Joyce" stored as a contact,
and then you extract a .pdf file where the creator is someone named "James
Joyce", some program might think that it was written by your friend. But since
some extractors already used fixed URIs, this isn't a new problem, it's just
more likely to happen now. And I think it's minor in any case.

Good spot that not everything was using the resource-helper classes: I've
updated the code to use those everywhere (I think). I've also updated the
commit message in the last patch to note what behaviour changes there are in
extraction.

> Let me suggest, how about implementing a TrackerNamespaceManager subclass in
> libtracker-data that adds all the ontology prefixes to self, and making the
> default namespace manager getter somehow hook into that?

Good idea, I've not done that now, but the existing API allows that to happen
in future.

> would be nice to spare the user from having to get the default namespace
> manager if this is the most common case, perhaps allowing NULL in the
> serialization functions and fetching the default manager internally in that
> case?

Done

> TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to
> the point that I'd favor deprecating it. I think it's unfortunate that the
> tracker_resource_generate_sparql_update() function takes one then :). I
> wouldn't mind if for the moment one is used internally, but I think I'd
> prefer that function to just return gchar* as well :).

Done

New stuff pushed to branch wip/sam/resource-rebase-4

Comment 40 Carlos Garnacho 2016-07-12 11:58:14 UTC

(In reply to Sam Thursfield from comment #39)
> OK, I think the only remaining sticking point is the blank node changes.
> 
> I agree it'd be good if this rewrite of the code didn't change how we insert
> data into store. But there's a tradeoff of keeping thing stable, vs. keeping
> code simple, and vs. making extractors be consistant with each other.
> 
> The existing way of creating tags seems to be this:
> 
> 	INSERT {
> 		GRAPH <xxx> {
> 			_:tag a nao:Tag ; nao:prefLabel "Tag Label"
> 		}
> 	} WHERE {
> 		FILTER (NOT EXISTS {
> 			?tag a nao:Tag ; nao:prefLabel "Tag Label"
> 		})
> 	}
> 
> And then in the metadata we do:
> 
> 	<file> nao:hasTag ?tag1
> 	WHERE { ?tag1 a nao:Tag ; nao:prefLabel "Tag Label" }
> 
> This approach means there's only ever 1 resource for each "Tag Label",
> and the URI will be a random, unique ID.
> 
> Adapting TrackerResource to generate this SPARQL would be quite a bit of
> extra
> code. As far as I can see, the only benefit of having a tag URI that isn't
> based
> on the tag's label is that, if you relabel the tag, its URI doesn't become
> misleading.
> 
> But, even if we do create a tag resource at <urn:tag:mefirst> with label "Me
> First", then rename it to "Me Second", nothing breaks. Now we have a tag with
> URI <urn:tag:mefirst> and label "Me Second", which is a bit misleading, but
> it's totally backwards-compatible with how things currently work.

My concern is, how would TrackerResource cope if at a later point the extractor calls again tracker_extract_new_tag() so a new <urn:tag:mefirst> label with "Me First" text is created? I see two possibilities:

- The previous "Me Second" label remains, the file points to a "wrong" tag
- The label is changed behind the user's back

tags are a tricky usecase because they are both extracted and user-settable content.

> 
> I'm not sure it's worth adding lots of extra code to TrackerResource to
> support this use case.
> 
> 
> For "contacts", different extractors took different approaches. Some already
> generated fixed urn:contact:XXX URIs. Others did it like this:
> 
> 	<file> nco:publisher [ a nco:Contact ; nco:fullname "Publisher Name" ]
> 
> None of them check whether a nco:Contact resource with that name already
> exists, so users may end up with many nco:Contact resources each representing
> "Publisher Name", each with a random, unique URI. It's easy to do recreate
> this
> behaviour with TrackerResource, but I think it's a bit broken in the first
> place.

Agreed, we're in a pretty bad state wrt data coalescing.

> 
> The real problem there is that the XMP "publisher" tag and such things
> shouldn't be nco:Contact resources. All we have is the "name" for these,
> and people's names are certainly not globally unique, so we can't really
> infer much from that data. We should probably have a separate ContentCreator
> resource type for these things.

Yeah, probably...

> 
> I'd like to make the extractors consistant in this regard one way or the
> other. I've done the easiest thing right now, which is ensure all extractors
> create nco:Contact resources with fixed URIs for content-creator info. I'm
> open to changing every extractor to use a different approach. I'm less open
> to doing extra work just so we preserve the existing mismatched behaviour of
> the different extractors, because I can't see it causing much harm. I guess
> the only issue is if you have a friend named "James Joyce" stored as a
> contact,
> and then you extract a .pdf file where the creator is someone named "James
> Joyce", some program might think that it was written by your friend. But
> since

Bonus trouble points if you also have some document from your friend around in the FS :).

> some extractors already used fixed URIs, this isn't a new problem, it's just
> more likely to happen now. And I think it's minor in any case.

Agreed, it is with nco:Contact. 

> 
> Good spot that not everything was using the resource-helper classes: I've
> updated the code to use those everywhere (I think). I've also updated the
> commit message in the last patch to note what behaviour changes there are in
> extraction.

Ok, I won't cling to these. I guess the nao:Tag point is moot anyway, since our tools allow only adding/deleting, not editing tags. Things can get broken with tracker sparql -u, but that's no big news.

> 
> > Let me suggest, how about implementing a TrackerNamespaceManager subclass in
> > libtracker-data that adds all the ontology prefixes to self, and making the
> > default namespace manager getter somehow hook into that?
> 
> Good idea, I've not done that now, but the existing API allows that to happen
> in future.

Fair enough, will put it in the "nice to clean up soon" list here.

> 
> > would be nice to spare the user from having to get the default namespace
> > manager if this is the most common case, perhaps allowing NULL in the
> > serialization functions and fetching the default manager internally in that
> > case?
> 
> Done
> 
> > TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to
> > the point that I'd favor deprecating it. I think it's unfortunate that the
> > tracker_resource_generate_sparql_update() function takes one then :). I
> > wouldn't mind if for the moment one is used internally, but I think I'd
> > prefer that function to just return gchar* as well :).
> 
> Done
> 
> New stuff pushed to branch wip/sam/resource-rebase-4

Thanks! It's looking fit for master I'd say, I noticed some minor stuff:
- tracker_extract_new_tag/equipment() Adding an extra ':' at the end of the generated urn
- If the extractor module returns a NULL resource (ie. no data to be extracted), we still used to tag the file as successfully extracted through the nie:dataSource property, otherwise the file keeps coming on subsequent tracker-extract queries, and it gets stuck. https://paste.fedoraproject.org/390176/46832453/ fixes this here, feel free to squash.
- Missing g_return_* through the new public API calls

The first 2 could make sense to fix on the fly before you push, the last one can be fixed later on master, your call.

Comment 41 Carlos Garnacho 2016-07-12 12:39:31 UTC

Ugh, upon further inspection of the inserted data, I see some things like this in tracker info:

$ tracker info /some/audio/file.mp3
  ...
  'nie:isLogicalPartOf' = '_:7987'
  'nie:isLogicalPartOf' = '_:7988'
  ...
  'nmm:musicAlbum' = '_:7987'
  'nmm:musicAlbumDisc' = '_:7988'
  'nmm:performer' = '_:7986'

So there seems to be something fishy with the way we deal with URN objects.

Comment 42 Sam Thursfield 2016-07-14 01:29:49 UTC

> - tracker_extract_new_tag/equipment() Adding an extra ':' at the end of the generated urn

Fixed.

> - If the extractor module returns a NULL resource (ie. no data to be extracted), we still used to tag the file as successfully extracted through the nie:dataSource property, otherwise the file keeps coming on subsequent tracker-extract queries, and it gets stuck. https://paste.fedoraproject.org/390176/46832453/ fixes this here, feel free to squash.

I recently discovered this myself, the hard way! Now fixed.

> - Missing g_return_* through the new public API calls

Fixed.

> Ugh, upon further inspection of the inserted data, I see some things like this in tracker info: ...

That turned out to be a stray NULL in the g_object_new() call to create a
TrackerResource: the identifier was being ignored. But it also highlighted that
TrackerResource Turtle and SPARQL generation is actually broken if folk use 
blank nodes. I've now fixed both issues. In the process, I removed all usage of
TrackerSparqlBuilder from TrackerResource, which actually ended up allowing the
removal of some code because now the Turtle and SPARQL generation can share some
code paths.

I'm going away for a couple of months as of this Saturday. So, I'm going to
re-extract all my files again, check for any new errors, and then merge it
tomorrow if it looks OK. Of course, feel free to revert if some huge unforseen
disaster occurs.

Comment 43 Hussam Al-Tayeb 2016-07-14 06:21:10 UTC

It now fails to compile:

/usr/include/glib-2.0/gio/gsimpleasyncresult.h:104:21: note: declared here
 void                g_simple_async_result_complete         (GSimpleAsyncResult      *simple);
                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  CC       tracker-cursor.lo
  CC       tracker-namespace-manager.lo
  CC       tracker-resource.lo
tracker-resource.c:30:31: fatal error: tracker-generated.h: No such file or directory
 #include "tracker-generated.h"
                               ^
compilation terminated.
make[4]: *** [Makefile:728: tracker-resource.lo] Error 1

git branch
* wip/sam/resource-rebase-5

Comment 44 Sam Thursfield 2016-07-14 12:21:00 UTC

Hmm, I guess that was hidden for me because I was compiling from the libtracker-sparql directory and -I. is in the GCC arguments... or something. I've pushed a fix to branch wip/sam/resource-rebase-6

Comment 45 Hussam Al-Tayeb 2016-07-14 12:34:56 UTC

I always compile clean checkouts to test for these things :)

Thank you. It compiles now.

Comment 46 Carlos Garnacho 2016-07-14 12:46:43 UTC

(In reply to Sam Thursfield from comment #42)
> > - tracker_extract_new_tag/equipment() Adding an extra ':' at the end of the generated urn
> 
> Fixed.
> 
> > - If the extractor module returns a NULL resource (ie. no data to be extracted), we still used to tag the file as successfully extracted through the nie:dataSource property, otherwise the file keeps coming on subsequent tracker-extract queries, and it gets stuck. https://paste.fedoraproject.org/390176/46832453/ fixes this here, feel free to squash.
> 
> I recently discovered this myself, the hard way! Now fixed.
> 
> > - Missing g_return_* through the new public API calls
> 
> Fixed.
> 
> > Ugh, upon further inspection of the inserted data, I see some things like this in tracker info: ...
> 
> That turned out to be a stray NULL in the g_object_new() call to create a
> TrackerResource: the identifier was being ignored. But it also highlighted
> that
> TrackerResource Turtle and SPARQL generation is actually broken if folk use 
> blank nodes. I've now fixed both issues. In the process, I removed all usage
> of
> TrackerSparqlBuilder from TrackerResource, which actually ended up allowing
> the
> removal of some code because now the Turtle and SPARQL generation can share
> some
> code paths.

Sounds good!

> 
> I'm going away for a couple of months as of this Saturday. So, I'm going to

I guess this means I won't see you in Karlsruhe :). Enjoy this summer break!

> re-extract all my files again, check for any new errors, and then merge it
> tomorrow if it looks OK. Of course, feel free to revert if some huge
> unforseen
> disaster occurs.

Please do, I'll deal with the possible fallout :). This is a big step forward, thanks for doing this!

Comment 47 Sam Thursfield 2016-07-14 23:48:35 UTC

OK, all merged! and no, sadly no GUADEC for me this year. But who knows, perhaps next year i'll see you here in Manchester :-)

Comment 48 Debarshi Ray 2016-08-31 15:36:18 UTC

Review of attachment 330361 [details] [review]:

::: src/libtracker-extract/tracker-resource-helpers.c
@@ +122,3 @@
+	g_return_val_if_fail (make != NULL || model != NULL, NULL);
+
+	equip_uri = tracker_sparql_escape_uri_printf ("urn:equipment:%s:%s:", make ? make : "", model ? model : "");

I see that the trailing colon was dropped in the patch that was actually pushed (ie. commit 032e2a7ed874) - presumably because of comment 40. However, the trailing colon was used by the extractors prior to this change. eg., urn:equipment:Nokia:N9:, urn:equipment:SONY:HDR-CX405:, etc.. I wonder if this was deliberate. I only ask because I'd probably have to drop it from gnome-online-miners to keep the DB clean.

Comment 49 Debarshi Ray 2016-08-31 16:42:24 UTC

(In reply to Debarshi Ray from comment #48)
> I only ask because I'd probably have to drop it from
> gnome-online-miners to keep the DB clean.

See bug 770658 for a related gnome-online-miners fix.

Comment 50 Sam Thursfield 2016-08-31 17:19:22 UTC

(In reply to Debarshi Ray from comment #48)
> Review of attachment 330361 [details] [review] [review]:
> 
> ::: src/libtracker-extract/tracker-resource-helpers.c
> @@ +122,3 @@
> +	g_return_val_if_fail (make != NULL || model != NULL, NULL);
> +
> +	equip_uri = tracker_sparql_escape_uri_printf ("urn:equipment:%s:%s:", make
> ? make : "", model ? model : "");
> 
> I see that the trailing colon was dropped in the patch that was actually
> pushed (ie. commit 032e2a7ed874) - presumably because of comment 40.
> However, the trailing colon was used by the extractors prior to this change.
> eg., urn:equipment:Nokia:N9:, urn:equipment:SONY:HDR-CX405:, etc.. I wonder
> if this was deliberate. I only ask because I'd probably have to drop it from
> gnome-online-miners to keep the DB clean.

Thanks for spotting this! It wasn't a deliberate change. I think we should  restore the old behaviour before the stable release -- do you agree?

Comment 51 Carlos Garnacho 2016-08-31 21:19:00 UTC

(In reply to Sam Thursfield from comment #50)
> (In reply to Debarshi Ray from comment #48)
> > Review of attachment 330361 [details] [review] [review] [review]:
> > 
> > ::: src/libtracker-extract/tracker-resource-helpers.c
> > @@ +122,3 @@
> > +	g_return_val_if_fail (make != NULL || model != NULL, NULL);
> > +
> > +	equip_uri = tracker_sparql_escape_uri_printf ("urn:equipment:%s:%s:", make
> > ? make : "", model ? model : "");
> > 
> > I see that the trailing colon was dropped in the patch that was actually
> > pushed (ie. commit 032e2a7ed874) - presumably because of comment 40.
> > However, the trailing colon was used by the extractors prior to this change.
> > eg., urn:equipment:Nokia:N9:, urn:equipment:SONY:HDR-CX405:, etc.. I wonder
> > if this was deliberate. I only ask because I'd probably have to drop it from
> > gnome-online-miners to keep the DB clean.
> 
> Thanks for spotting this! It wasn't a deliberate change. I think we should 
> restore the old behaviour before the stable release -- do you agree?

fwiw, I agree it's worth to preserve data compatibility :)

Comment 52 Debarshi Ray 2016-09-01 08:40:27 UTC

Created attachment 334590 [details] [review]
libtracker-extract: Restore trailing colon in nfo:Equipment URIs

Comment 53 Sam Thursfield 2016-09-03 10:26:19 UTC

Review of attachment 334590 [details] [review]:

Thanks :-)