GNOME Bugzilla – Bug 767472
Add TrackerResource and rewrite all extractors to use that instead of TrackerSparqlBuilder
Last modified: 2016-09-03 10:33:07 UTC
These patches introduce the TrackerResource API to libtracker-sparql, and convert all of Tracker's extract modules to use that instead of TrackerSparqlBuilder. Rationale is described here: <https://mail.gnome.org/archives/tracker-list/2016-April/msg00002.html> I've tested this on a random set of inputs that cover each tracker-extract module. (The tree of files was set up with the create-tree-from-real-data script, which I've just pushed to 'master'). I compared against the wip/sam/extract-command branch to see if there were any regressions in execution speed or memory usage, the command I ran was this: sudo sh -c 'echo 2 > /proc/sys/vm/drop_caches'; \ time find 1 -type f -fprint /dev/stderr -exec \ jhbuild -f /home/sam/.jhbuildrc run \ /usr/bin/time -f "Time: %E Mem: %MKb" \ /opt/gnome/libexec/tracker-extract --verbosity=0 \ --file \{} \ > /dev/null \; Results are attached (as performance-old.txt an performance-new.txt); there is some variance in speed both ways but it's clear that this branch doesn't introduce any major changes there. The maximum RSS size is slightly higher in most cases, but that could well be because I added a dependency on JSON-GLIB. I have run each module under Valgrind and made sure there are no new memory leaks. The 400-extractor functional test is currently broken by these changes. I think the best thing to do with that is delete the hacky ad-hoc SPARQL parsing code it uses, and make use of the new JSON-LD output provided by tracker-extract instead. This would be simpler than using the new Turtle output because Python has a built-in JSON parser, but no built-in Turtle parser. Code is available in branch wip/sam/resource, or attached to this bug. These patches depend on <https://bugzilla.gnome.org/show_bug.cgi?id=751991>.
Created attachment 329508 [details] [review] Bump GLib dependency to 2.44 So I can use G_DECLARE_DERIVABLE_TYPE() instead of writing out everything manually.
Created attachment 329509 [details] [review] libtracker-sparql: Add TRACKER_TYPE_URI This is the same as G_TYPE_STRING, but we need to treat strings and URIs differently when generating SPARQL from the TrackerResource class, so we need a separate GType to keep track of which is which when given a GValue.
Created attachment 329510 [details] [review] libtracker-sparql: Add TrackerNamespaceManager This will keep track of a set of namespaces and their prefixes. Then, when we are serializing a resource, we can use it. This is separate from the Namespace and Ontologies classes in libtracker-data. They may be able to make use of it though.
Created attachment 329511 [details] [review] libtracker-sparql: Add TrackerResource class This provides a "resource-oriented" API for inserting and updating the database. Rather than having to generate SPARQL queries, you can use the TrackerResource abstraction to prepare information about a set of resources, then generate a SPARQL query automatically. TrackerResource can also serialize to Turtle directly.
Created attachment 329512 [details] [review] libtracker-extract: Add resource-helpers module This is factoring common code out of the individual extract modules.
Created attachment 329513 [details] [review] Use TrackerResource instead of TrackerSparqlBuilder in all extractors For a long time, all the Tracker extractors have manually constructed a SPARQL update command using TrackerSparqlBuilder to represent their output. This commit changes all of them to use the TrackerResource class instead, which makes the code a lot more concise and readable. This introduces some API breaks in the internal libtracker-extract library. This has been a private library since Tracker 0.16 or earlier, so it's fine. If the extractors only output SPARQL then they are only useful to people who are using a SPARQL store. Now we can output a serialization format like Turtle as well. This will hopefully make the extract modules useful outside of Tracker itself.
Created attachment 329514 [details] [review] Add support to extractors for outputting metadata as JSON-LD This adds a new dependency on the JSON-GLib library.
Created attachment 329515 [details] Speed and memory usage *without* these changes
Created attachment 329516 [details] Speed and memory usage *with* these changes
Hi. I tried compiling from wip/sam/resource branch and it stropped at: ../../../src/libtracker-extract/.libs/libtracker-extract.so: undefined reference to `tracker_resource_set_ensure_unique' collect2: error: ld returned 1 exit status
Hi Hussam! Thanks for trying this out. Looks like I messed up while rebasing; I've pushed a fix, and tested the XMP parsing code (which was where the mistake was) against a JPEG with XMP metadata. I've not updated the attached patches on the assumption that there will be quite a few more things to fix yet :-)
Another one: CC libextract_libav_la-tracker-extract-libav.lo tracker-extract-libav.c: In function ‘tracker_extract_get_metadata’: tracker-extract-libav.c:236:10: error: ‘performer_uri’ undeclared (first use in this function) g_free(performer_uri); ^~~~~~~~~~~~~ this is when configuring with --enable-generic-media-extractor=libav
Ah, I forgot about the libav extractor when testing. I'm glad you spotted this. I've now made sure the libav extractor works, and have pushed fixes for that and some other issues I spotted.
Another issue: Full text search no longer works. I can no longer do tracker search keyword. It shows no results after recreating the tracker database.
(In reply to Hussam Al-Tayeb from comment #14) > Another issue: > Full text search no longer works. > I can no longer do tracker search keyword. > It shows no results after recreating the tracker database. Just to confirm, with a build from master branch, this works as expected so only broken in wip/sam/resource branch.
I have to admit this is really puzzling me. I can reproduce the problem, but the correct data does *seem* to be being inserted into the store, so it's not obvious why search doesn't work. Will keep digging.
OK, I've pushed a fix. It was a bad DELETE {} command in the SPARQL generated by TrackerResource, which was deleting everything in the store rather than just the triples the extractor was about to replace!
Created attachment 330357 [details] [review] Bump GLib dependency to 2.44 So I can use G_DECLARE_DERIVABLE_TYPE() instead of writing out everything manually.
Created attachment 330358 [details] [review] libtracker-sparql: Add TRACKER_TYPE_URI This is the same as G_TYPE_STRING, but we need to treat strings and URIs differently when generating SPARQL from the TrackerResource class, so we need a separate GType to keep track of which is which when given a GValue.
Created attachment 330359 [details] [review] libtracker-sparql: Add TrackerNamespaceManager This will keep track of a set of namespaces and their prefixes. Then, when we are serializing a resource, we can use it. This is separate from the Namespace and Ontologies classes in libtracker-data. They may be able to make use of it though.
Created attachment 330360 [details] [review] libtracker-sparql: Add TrackerResource class This provides a "resource-oriented" API for inserting and updating the database. Rather than having to generate SPARQL queries, you can use the TrackerResource abstraction to prepare information about a set of resources, then generate a SPARQL query automatically. TrackerResource can also serialize to Turtle directly.
Created attachment 330361 [details] [review] libtracker-extract: Add resource-helpers module This is factoring common code out of the individual extract modules.
Created attachment 330362 [details] [review] Use TrackerResource instead of TrackerSparqlBuilder in all extractors For a long time, all the Tracker extractors have manually constructed a SPARQL update command using TrackerSparqlBuilder to represent their output. This commit changes all of them to use the TrackerResource class instead, which makes the code a lot more concise and readable. This introduces some API breaks in the internal libtracker-extract library. This has been a private library since Tracker 0.16 or earlier, so it's fine. If the extractors only output SPARQL then they are only useful to people who are using a SPARQL store. Now we can output a serialization format like Turtle as well. This will hopefully make the extract modules useful outside of Tracker itself.
Created attachment 330363 [details] [review] Add support to extractors for outputting metadata as JSON-LD This adds a new dependency on the JSON-GLib library.
Many thanks to Hussam for spotting my dumb mistakes. I've pushed a version with all the fixes to branch wip/sam/resource-rebase-1 <https://git.gnome.org/browse/tracker/log/?h=wip/sam/resource-rebase-1>. Still depends on <https://bugzilla.gnome.org/show_bug.cgi?id=751991> (add a `tracker extract` command).
I should note the JSON-LD output is somewhat incomplete, it should generate an @context field, but does not. The JSON-LD stuff is all in the final commit, so the rest of the changes can be merged separately if need be. It might make sense to merge the JSON-LD stuff as-is but note that it's incomplete, since it won't break anything else; I don't mind either way
Hello. I'm not sure if this is related but libtracker-sparql/tracker-uri.h is no longer being installed on make install and it is #included from /usr/include/tracker-1.0/libtracker-sparql/tracker-sparql.h which results in a compile error in nautilus /usr/include/tracker-1.0/libtracker-sparql/tracker-sparql.h:26:43: fatal error: libtracker-sparql/tracker-uri.h: No such file or directory I manually installed the file to work around this.
Hi Hussam. That was indeed my fault, thanks for catching another issue ! I've pushed a fix.
Review of attachment 330359 [details] [review]: The object makes sense to me, given it's exposed in TrackerResource API, although I think: does it make sense to expose this at all? it probably only makes sense to print to turtle/sparql with the default TrackerNamespaceManager object anyway... ::: src/libtracker-sparql/tracker-namespace-manager.c @@ +152,3 @@ + tracker_namespace_manager_add_prefix (manager, "nmm", TRACKER_PREFIX_NMM); + tracker_namespace_manager_add_prefix (manager, "mlo", TRACKER_PREFIX_MLO); + tracker_namespace_manager_add_prefix (manager, "mfo", TRACKER_PREFIX_MFO); It would be great if this initialization could be more dynamic (i.e. based on ontology data). It looks like one more place to change if ontology is added/changed. Perhaps letting TrackerOntology initialize this? it seems to go in the opposite direction that you seem to suggest in the commit log though.
Review of attachment 330360 [details] [review]: Nice API :), I've got mostly style comments. ::: src/libtracker-sparql/tracker-resource.c @@ +25,3 @@ +#include <tracker-uri.h> +#include <tracker-resource.h> +#include <tracker-ontologies.h> is this include needed? This makes me think of a feature request for the future, that TrackerResource checks maxCardinality from the ontology for the added properties, so it's possible to error out in the miner/extractor, rather than catching this late when the sparql insertion fails. But in the mean time, this include seems unused :). @@ +202,3 @@ + + /* The identifier may be NULL, so I don't think we can pass it in the va_list + * that g_object_new() accepts. I think you can just do: g_object_new (TRACKER_TYPE_RESOURCE, "identifier", NULL, NULL); if it needs be, properties/values are dealt with in pairs, the NULL sentinel always applies when fetching the next property name. @@ +240,3 @@ +tracker_resource_set_gvalue (TrackerResource *self, + const char *property_uri, + GValue *value) I'd make these calls that copy the value take a const GValue *. In provision to having TrackerResource perform basic ontology checks, perhaps all setter functions should be able to indicate that the property/value pair could not be added? Maybe just adding gboolean return values and in the future issuing warnings inside these functions? I'll be happy just being able to get proper backtraces to such invalid usages, so returning a GError in these functions seems overkill. @@ +300,3 @@ +tracker_resource_add_gvalue (TrackerResource *self, + const char *property_uri, + GValue *value) Same here AFAICS. @@ +405,3 @@ + */ +GList *tracker_resource_get_values (TrackerResource *self, + const char *property_uri) Nit: we usually align variable names :). this happens in a few places... @@ +511,3 @@ + + if (priv->identifier) { + g_free (priv->identifier); No need for checking that priv->identifier is != NULL here. @@ +958,3 @@ + * It would be better if we could use "DELETE DATA { pattern }". This is + * allowed in SPARQL update 1.1, but not yet supported by Tracker's store. + */ This is actually my bad... it used to work, but I broke it with the support for DELETE{}INSERT{}WHERE{}. I have some local patches fixing this and adding tests around. ::: src/libtracker-sparql/tracker-resource.h @@ +76,3 @@ +char *tracker_resource_print_turtle(TrackerResource *self, TrackerNamespaceManager *namespaces); + +void tracker_resource_generate_sparql_update (TrackerResource *self, TrackerSparqlBuilder *builder, TrackerNamespaceManager *namespaces, const char *graph_id); Generally, I think these two serializing functions should offer better error handling (return value, GError argument?). As already said, I find passing the namespaces manager a bit dubious... what happens if you pass a manager that doesn't have all the necessary prefixes to compose the ttl/sparql? what happens if the TrackerResource has a mix of shorthand and expanded property uris and a custom namespace manager has been given? etc... ::: src/libtracker-sparql/tracker-uri.h @@ +29,3 @@ #endif +GType tracker_uri_get_type (void); This belongs to the commit adding the tracker URI GType?
Review of attachment 330361 [details] [review]: Yay, it will be nice to have these in place. setting as a-c-n for the time being. ::: src/libtracker-extract/tracker-xmp.c @@ +24,3 @@ #include <libtracker-common/tracker-utils.h> +#include "tracker-resource-helpers.h" This change looks like it belongs in the commit modifying the extractor modules?
Review of attachment 330357 [details] [review]: Looks good
Review of attachment 330358 [details] [review]: Makes sense
Review of attachment 330362 [details] [review]: This patch is massive... nice code reduction also :). Reading through the patch, it looks mostly correct to me, I kept an extra eye on: - Places which look like could be using the new helper functions but don't, spotted a few of those. - Places where we now generate named URNs where we used to generate a blank node, or viceversa. Spotted a couple too, and I'm a bit wary with those. - Places where we inadvertently changed cardinality (eg. using set_string where add_string should be used) - Places where we inadvertently changed the type I spotted nothing about the last two on first sight, which is great :). Anyhow, the more eyes looking around, the better. ::: src/libtracker-extract/tracker-extract-info.c @@ +223,3 @@ * + * FIXME: you need a way to delete the logical resources when re-extracting a + * file -- still need to decide on API for that. probably this needs handling at the miner-fs level, tracker-extract is supposed to always find the file as if just indexed by tracker-miner-fs. ::: src/libtracker-extract/tracker-xmp.c @@ +916,3 @@ + label = tracker_resource_new (NULL); + tracker_resource_set_uri (label, "rdf:type", "nao:Tag"); + tracker_resource_set_string (label, "nao:prefLabel", p); Doesn't this seem like a place for the shiny tracker_extract_new_tag()? although I see that function does not generate bnodes which seems to be what we want here. @@ +931,3 @@ + publisher = tracker_resource_new (NULL); + tracker_resource_set_uri (publisher, "rdf:type", "nco:Contact"); + tracker_resource_set_string (publisher, "nco:fullname", data->publisher); same here with tracker_extract_new_contact(), and a few other places below in this same file ::: src/tracker-extract/tracker-extract-abw.c @@ +125,3 @@ + creator = tracker_resource_new (NULL); + tracker_resource_set_uri (creator, "rdf:type", "nco:Contact"); + tracker_resource_set_string (creator, "nco:fullname", str); I'm starting to wonder whether the helper functions should be bnode friendly :). ::: src/tracker-extract/tracker-extract-flac.c @@ +231,2 @@ + album = tracker_resource_get_first_relation (album_disc, "nmm:albumDiscAlbum"); + tracker_resource_set_string (album, "nmm:albumTrackCount", fd.trackcount); This fd.trackcount handling used to be under a if (fd.trackcount) {} check, are TrackerResource functions NULL safe? ::: src/tracker-extract/tracker-extract-gif.c @@ +357,2 @@ p = g_ptr_array_index (keywords, i); + tag = tracker_extract_new_tag (p); Here you're replacing a blank node with a fixed urn tag, was that intended?
(In reply to Sam Thursfield from comment #26) > I should note the JSON-LD output is somewhat incomplete, it should generate > an @context field, but does not. The JSON-LD stuff is all in the final > commit, so the rest of the changes can be merged separately if need be. It > might make sense to merge the JSON-LD stuff as-is but note that it's > incomplete, since it won't break anything else; I don't mind either way If we've got nothing on the queue looking forward for this, IMHO it can wait till this is resolved :). One question though, it's only this last patch which depends on bug #751991, right? if that's the case, we might also defer these other patches till JSON-LD support is finalized.
Hi Carlos, cheers for reviewing! A few of your points are about how the new code added to libtracker-sparql has no knowledge of Tracker's specific ontologies. This was a deliberate choice and is the same as how the current TrackerSparqlBuilder stuff works. As I see it, the downsides of this (current) approach are... * TrackerSparqlBuilder and TrackerResource can't ensure that the data being added is valid according to Tracker's ontologies, so instead when you try and add invalid stuff you get an error later on, and have to work backwards to find which bit of code generated the error. * Adding or removing ontologies from src/ontologies requires corresponding updates to libtracker-sparql/tracker-ontologies.h (and now also libtracker-sparlq/tracker-namespace-manager.c). The upside is mainly that the code is simpler. I like that libtracker-sparql is generic and not tied up with the internals of libtracker-data, because it makes the code more reusable. Right now, you can use TrackerResource to generate SPARQL or Turtle for any standards-compliant triple store and any set of ontologies. It's not limited to the things Tracker itself understands. To get to the approach you're proposing, we need to move the code out of libtracker-data that deals with ontology data, and into libtracker-sparql. We can feasibly do this while keeping libtracker-sparql generic, because it can be an opt-in thing. So my only worry is that it might be too hard. I'll look into it and get back to you :-) About TrackerNamespaceManager being public: this ties in with TrackerResource being reusable. If you want to use TrackerResource to generate SPARQL for completely non-Tracker related data, you can set up your own TrackerNamespaceManager with the namespaces you care about. I don't expect this will be useful right away, so we could hide it for now and make it public later... but I don't see the API needing to change much. > +#include <tracker-ontologies.h> > is this include needed? Yes, to use the TRACKER_PREFIX_RDF and TRACKER_PREFIX_RDFS macros. > Nit: we usually align variable names :). this happens in a few places... True. I have uncrustified it now :-) > Generally, I think these two serializing functions should offer better error > handling (return value, GError argument?). As already said, I find passing > the namespaces manager a bit dubious... what happens if you pass a manager > that doesn't have all the necessary prefixes to compose the ttl/sparql? what > happens if the TrackerResource has a mix of shorthand and expanded property > uris and a custom namespace manager has been given? etc... I actually did have GError parameters there originally, and removed them because there are very few expected ways it can fail. Right now the code calls g_warning() and returns NULL in those cases, but the only time it does so is: if the main resource has no identifier (which is a programmer error), or if tracker_resource_set_gvalue() was called with a GType that the code doesn't know how to serialize (which is again a programmer error). I think g_warning() is appropriate for this kind of problem. As for allowing a custom set of prefixes (TrackerNamespaceManager), hopefully it makes more sense if you think about using this code independently of Tracker-Store. If we want to allow generating arbitrary standards-compliant SPARQL and Turtle, then we can't actually do *any* validation of prefixed URIs and different URI schemes. The URI standard doesn't mandate anything after the initial ':'. So if we see 'abc:bar', there's no way to know whether this is a compacted URL using 'abc:' prefix, or a full URI using scheme 'abc:'. The only way to know is to have the user tell us, by adding 'abc:' to the TrackerNamespaceManager that they pass in if 'abc:' is a prefix. It would be nice to be a bit more helpful here. One way would be to split tracker_resource_set_uri() into two functions: tracker_resource_set_compact_uri() and tracker_resource_set_full_uri(). Then we'd know which the user intended and could warn during serialization if there are unexpanded compact URIs. But that makes code using this API a bit uglier. Is it worth it? > Doesn't this seem like a place for the shiny tracker_extract_new_tag()? > although I see that function does not generate bnodes which seems to be what > we want here. > I'm starting to wonder whether the helper functions should be bnode friendly :). > Here you're replacing a blank node with a fixed urn tag, was that intended? I could do with advice here: what's the advantage of inserting tags as blank nodes? If you set the identifier for a TrackerResource to NULL it acts as a blank node, so it's possible for the tracker_extract_new_tag() function to work that way. But I'm confused why they are currently handled differently to how we handle contacts, music albums, music artists, etc... > This fd.trackcount handling used to be under a if (fd.trackcount) {} check, > are TrackerResource functions NULL safe? Hmm... this is a good question. There's not really a usecase for setting a string value or URI value property to the SPARQL "null" type. We could add tracker_resource_set_null() if that someone does need to do that. I've fixed the code you found there, and added a 'validation' function for the set_string(), add_string(), set_uri() and add_uri() functions that calls g_warning() if NULL is passed. > One question though, it's only this last patch which depends on bug #751991, > right? if that's the case, we might also defer these other patches till > JSON-LD support is finalized. Right now the whole series depends on #751991. It could be rebased off there fairly easily, but #751991 adds a nice way to run a Tracker extract module and see the Turtle output, so it'd be good to merge it now if possible. I'll move the JSON-LD patch into a separate bug. Thanks again for the review, I've pushed branch wip/sam/resource-rebase-3 with all the clear-cut fixes done. I've also deleted the older branches.
(In reply to Sam Thursfield from comment #36) > Hi Carlos, cheers for reviewing! Yw, it was about time :), fyi I think it'd be great to merge this soon. > > A few of your points are about how the new code added to libtracker-sparql > has no knowledge of Tracker's specific ontologies. This was a deliberate > choice and is the same as how the current TrackerSparqlBuilder stuff works. > As I see it, the downsides of this (current) approach are... > > * TrackerSparqlBuilder and TrackerResource can't ensure that the data being > added is valid according to Tracker's ontologies, so instead when you > try and add invalid stuff you get an error later on, and have to work > backwards to find which bit of code generated the error. > > * Adding or removing ontologies from src/ontologies requires corresponding > updates to libtracker-sparql/tracker-ontologies.h (and now also > libtracker-sparlq/tracker-namespace-manager.c). ... which btw would be great to make fully dynamic, in provision for both making tracker a generic store in the future, and so more immediate ontology changes turn out seamless (which I see rather imminent tbh...) > > The upside is mainly that the code is simpler. I like that libtracker-sparql > is generic and not tied up with the internals of libtracker-data, because > it makes the code more reusable. Right now, you can use TrackerResource > to generate SPARQL or Turtle for any standards-compliant triple store and any > set of ontologies. It's not limited to the things Tracker itself understands. Fair point. > > To get to the approach you're proposing, we need to move the code out of > libtracker-data that deals with ontology data, and into libtracker-sparql. > We can feasibly do this while keeping libtracker-sparql generic, because > it can be an opt-in thing. So my only worry is that it might be too hard. > I'll look into it and get back to you :-) Let me suggest, how about implementing a TrackerNamespaceManager subclass in libtracker-data that adds all the ontology prefixes to self, and making the default namespace manager getter somehow hook into that? > > About TrackerNamespaceManager being public: this ties in with TrackerResource > being reusable. If you want to use TrackerResource to generate SPARQL for > completely non-Tracker related data, you can set up your own > TrackerNamespaceManager with the namespaces you care about. I don't expect > this will be useful right away, so we could hide it for now and make it > public later... but I don't see the API needing to change much. > > > +#include <tracker-ontologies.h> > > is this include needed? > > Yes, to use the TRACKER_PREFIX_RDF and TRACKER_PREFIX_RDFS macros. Oh, missed that :). > > > Nit: we usually align variable names :). this happens in a few places... > > True. I have uncrustified it now :-) > > > Generally, I think these two serializing functions should offer better error > > handling (return value, GError argument?). As already said, I find passing > > the namespaces manager a bit dubious... what happens if you pass a manager > > that doesn't have all the necessary prefixes to compose the ttl/sparql? what > > happens if the TrackerResource has a mix of shorthand and expanded property > > uris and a custom namespace manager has been given? etc... > > I actually did have GError parameters there originally, and removed them > because there are very few expected ways it can fail. Right now the code > calls g_warning() and returns NULL in those cases, but the only time it > does so is: if the main resource has no identifier (which is a programmer > error), or if tracker_resource_set_gvalue() was called with a GType that > the code doesn't know how to serialize (which is again a programmer error). > I think g_warning() is appropriate for this kind of problem. Point, if G_DEBUG=fatal* is able to get me to the place where the error happens, I'll be happy :). > > As for allowing a custom set of prefixes (TrackerNamespaceManager), hopefully > it makes more sense if you think about using this code independently of > Tracker-Store. If we want to allow generating arbitrary standards-compliant > SPARQL and Turtle, then we can't actually do *any* validation of prefixed > URIs and different URI schemes. The URI standard doesn't mandate anything > after the initial ':'. So if we see 'abc:bar', there's no way to know > whether this is a compacted URL using 'abc:' prefix, or a full URI using > scheme 'abc:'. The only way to know is to have the user tell us, by adding > 'abc:' to the TrackerNamespaceManager that they pass in if 'abc:' is a > prefix. > > It would be nice to be a bit more helpful here. One way would be to split > tracker_resource_set_uri() into two functions: > tracker_resource_set_compact_uri() and tracker_resource_set_full_uri(). Then > we'd know which the user intended and could warn during serialization if > there > are unexpanded compact URIs. But that makes code using this API a bit uglier. > Is it worth it? Nah, probably not. If the user mixes both, it's up to him. However, would be nice to spare the user from having to get the default namespace manager if this is the most common case, perhaps allowing NULL in the serialization functions and fetching the default manager internally in that case? Oh, and now that I notice, IMO TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to the point that I'd favor deprecating it. I think it's unfortunate that the tracker_resource_generate_sparql_update() function takes one then :). I wouldn't mind if for the moment one is used internally, but I think I'd prefer that function to just return gchar* as well :). > > > Doesn't this seem like a place for the shiny tracker_extract_new_tag()? > > although I see that function does not generate bnodes which seems to be what > > we want here. > > I'm starting to wonder whether the helper functions should be bnode friendly :). > > Here you're replacing a blank node with a fixed urn tag, was that intended? > > I could do with advice here: what's the advantage of inserting tags as blank > nodes? > > If you set the identifier for a TrackerResource to NULL it acts as a blank > node, so it's possible for the tracker_extract_new_tag() function to work > that way. But I'm confused why they are currently handled differently to how > we handle contacts, music albums, music artists, etc... Fwiw, the first comment was about a contact urn, inserted as a blank node :), I was wondering there if the helper functions could be used more extensively, there's a few places where they look like they might fit but still things are done "manually", I presumed keeping those a blank node was the main intention. As for the second comment (and your reply), I guess my concern is somewhat moot with the way TrackerResource works. With fixed URNs it turned out too easy to screw up breaking cardinality and whatnot, eg. assuming that the node is not there previously and inserting slightly different data, getting unintended collisions, etc. I, however, have higher level concerns around fixed URN schemes in an RDF world with multiple independent information sources, esp. around how is the info filled in the gaps (eg. contacts from IM conversations being transparently merged with contacts from emails). In the end I think that each miner should be individually smarter at looking up and completing or inserting these kind of shared resources, a fixed URN scheme might look like it might help, but falls short because it's entirely made up and there is nothing actually enforcing it. And I'll stop rambling now :). I also recommend you to read bug #766579 and the mentioned bugs in there to see some limitations we're hitting with fixed URNs (although presumably "make those more specific" might be a valid answer too). Fwiw, I don't expect you tackle any of this here, but I would prefer that this refactor didn't really involve any substantial change to the final stored content. > > > This fd.trackcount handling used to be under a if (fd.trackcount) {} check, > > are TrackerResource functions NULL safe? > > Hmm... this is a good question. There's not really a usecase for setting a > string value or URI value property to the SPARQL "null" type. We could add > tracker_resource_set_null() if that someone does need to do that. I've fixed > the code you found there, and added a 'validation' function for the > set_string(), add_string(), set_uri() and add_uri() functions that calls > g_warning() if NULL is passed. That sounds good :). > > > One question though, it's only this last patch which depends on bug #751991, > > right? if that's the case, we might also defer these other patches till > > JSON-LD support is finalized. > > Right now the whole series depends on #751991. It could be rebased off there > fairly easily, but #751991 adds a nice way to run a Tracker extract module > and > see the Turtle output, so it'd be good to merge it now if possible. Aye, will review later in the day :). > > I'll move the JSON-LD patch into a separate bug. > > Thanks again for the review, I've pushed branch wip/sam/resource-rebase-3 > with > all the clear-cut fixes done. I've also deleted the older branches. Thanks!
OK, I think the only remaining sticking point is the blank node changes. I agree it'd be good if this rewrite of the code didn't change how we insert data into store. But there's a tradeoff of keeping thing stable, vs. keeping code simple, and vs. making extractors be consistant with each other. The existing way of creating tags seems to be this: INSERT { GRAPH <xxx> { _:tag a nao:Tag ; nao:prefLabel "Tag Label" } } WHERE { FILTER (NOT EXISTS { ?tag a nao:Tag ; nao:prefLabel "Tag Label" }) } And then in the metadata we do: <file> nao:hasTag ?tag1 WHERE { ?tag1 a nao:Tag ; nao:prefLabel "Tag Label" } This approach means there's only ever 1 resource for each "Tag Label", and the URI will be a random, unique ID. Adapting TrackerResource to generate this SPARQL would be quite a bit of extra code. As far as I can see, the only benefit of having a tag URI that isn't based on the tag's label is that, if you relabel the tag, its URI doesn't become misleading. But, even if we do create a tag resource at <urn:tag:mefirst> with label "Me First", then rename it to "Me Second", nothing breaks. Now we have a tag with URI <urn:tag:mefirst> and label "Me Second", which is a bit misleading, but it's totally backwards-compatible with how things currently work. I'm not sure it's worth adding lots of extra code to TrackerResource to support this use case. For "contacts", different extractors took different approaches. Some already generated fixed urn:contact:XXX URIs. Others did it like this: <file> nco:publisher [ a nco:Contact ; nco:fullname "Publisher Name" ] None of them check whether a nco:Contact resource with that name already exists, so users may end up with many nco:Contact resources each representing "Publisher Name", each with a random, unique URI. It's easy to do recreate this behaviour with TrackerResource, but I think it's a bit broken in the first place. The real problem there is that the XMP "publisher" tag and such things shouldn't be nco:Contact resources. All we have is the "name" for these, and people's names are certainly not globally unique, so we can't really infer much from that data. We should probably have a separate ContentCreator resource type for these things. I'd like to make the extractors consistant in this regard one way or the other. I've done the easiest thing right now, which is ensure all extractors create nco:Contact resources with fixed URIs for content-creator info. I'm open to changing every extractor to use a different approach. I'm less open to doing extra work just so we preserve the existing mismatched behaviour of the different extractors, because I can't see it causing much harm. I guess the only issue is if you have a friend named "James Joyce" stored as a contact, and then you extract a .pdf file where the creator is someone named "James Joyce", some program might think that it was written by your friend. But since some extractors already used fixed URIs, this isn't a new problem, it's just more likely to happen now. And I think it's minor in any case. Good spot that not everything was using the resource-helper classes: I've updated the code to use those everywhere (I think). I've also updated the commit message in the last patch to note what behaviour changes there are in extraction. > Let me suggest, how about implementing a TrackerNamespaceManager subclass in > libtracker-data that adds all the ontology prefixes to self, and making the > default namespace manager getter somehow hook into that? Good idea, I've not done that now, but the existing API allows that to happen in future. > would be nice to spare the user from having to get the default namespace > manager if this is the most common case, perhaps allowing NULL in the > serialization functions and fetching the default manager internally in that > case? Done > TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to > the point that I'd favor deprecating it. I think it's unfortunate that the > tracker_resource_generate_sparql_update() function takes one then :). I > wouldn't mind if for the moment one is used internally, but I think I'd > prefer that function to just return gchar* as well :). Done New stuff pushed to branch wip/sam/resource-rebase-4
(In reply to Sam Thursfield from comment #39) > OK, I think the only remaining sticking point is the blank node changes. > > I agree it'd be good if this rewrite of the code didn't change how we insert > data into store. But there's a tradeoff of keeping thing stable, vs. keeping > code simple, and vs. making extractors be consistant with each other. > > The existing way of creating tags seems to be this: > > INSERT { > GRAPH <xxx> { > _:tag a nao:Tag ; nao:prefLabel "Tag Label" > } > } WHERE { > FILTER (NOT EXISTS { > ?tag a nao:Tag ; nao:prefLabel "Tag Label" > }) > } > > And then in the metadata we do: > > <file> nao:hasTag ?tag1 > WHERE { ?tag1 a nao:Tag ; nao:prefLabel "Tag Label" } > > This approach means there's only ever 1 resource for each "Tag Label", > and the URI will be a random, unique ID. > > Adapting TrackerResource to generate this SPARQL would be quite a bit of > extra > code. As far as I can see, the only benefit of having a tag URI that isn't > based > on the tag's label is that, if you relabel the tag, its URI doesn't become > misleading. > > But, even if we do create a tag resource at <urn:tag:mefirst> with label "Me > First", then rename it to "Me Second", nothing breaks. Now we have a tag with > URI <urn:tag:mefirst> and label "Me Second", which is a bit misleading, but > it's totally backwards-compatible with how things currently work. My concern is, how would TrackerResource cope if at a later point the extractor calls again tracker_extract_new_tag() so a new <urn:tag:mefirst> label with "Me First" text is created? I see two possibilities: - The previous "Me Second" label remains, the file points to a "wrong" tag - The label is changed behind the user's back tags are a tricky usecase because they are both extracted and user-settable content. > > I'm not sure it's worth adding lots of extra code to TrackerResource to > support this use case. > > > For "contacts", different extractors took different approaches. Some already > generated fixed urn:contact:XXX URIs. Others did it like this: > > <file> nco:publisher [ a nco:Contact ; nco:fullname "Publisher Name" ] > > None of them check whether a nco:Contact resource with that name already > exists, so users may end up with many nco:Contact resources each representing > "Publisher Name", each with a random, unique URI. It's easy to do recreate > this > behaviour with TrackerResource, but I think it's a bit broken in the first > place. Agreed, we're in a pretty bad state wrt data coalescing. > > The real problem there is that the XMP "publisher" tag and such things > shouldn't be nco:Contact resources. All we have is the "name" for these, > and people's names are certainly not globally unique, so we can't really > infer much from that data. We should probably have a separate ContentCreator > resource type for these things. Yeah, probably... > > I'd like to make the extractors consistant in this regard one way or the > other. I've done the easiest thing right now, which is ensure all extractors > create nco:Contact resources with fixed URIs for content-creator info. I'm > open to changing every extractor to use a different approach. I'm less open > to doing extra work just so we preserve the existing mismatched behaviour of > the different extractors, because I can't see it causing much harm. I guess > the only issue is if you have a friend named "James Joyce" stored as a > contact, > and then you extract a .pdf file where the creator is someone named "James > Joyce", some program might think that it was written by your friend. But > since Bonus trouble points if you also have some document from your friend around in the FS :). > some extractors already used fixed URIs, this isn't a new problem, it's just > more likely to happen now. And I think it's minor in any case. Agreed, it is with nco:Contact. > > Good spot that not everything was using the resource-helper classes: I've > updated the code to use those everywhere (I think). I've also updated the > commit message in the last patch to note what behaviour changes there are in > extraction. Ok, I won't cling to these. I guess the nao:Tag point is moot anyway, since our tools allow only adding/deleting, not editing tags. Things can get broken with tracker sparql -u, but that's no big news. > > > Let me suggest, how about implementing a TrackerNamespaceManager subclass in > > libtracker-data that adds all the ontology prefixes to self, and making the > > default namespace manager getter somehow hook into that? > > Good idea, I've not done that now, but the existing API allows that to happen > in future. Fair enough, will put it in the "nice to clean up soon" list here. > > > would be nice to spare the user from having to get the default namespace > > manager if this is the most common case, perhaps allowing NULL in the > > serialization functions and fetching the default manager internally in that > > case? > > Done > > > TrackerResource pretty much replaces the need for TrackerSparqlBuilder, to > > the point that I'd favor deprecating it. I think it's unfortunate that the > > tracker_resource_generate_sparql_update() function takes one then :). I > > wouldn't mind if for the moment one is used internally, but I think I'd > > prefer that function to just return gchar* as well :). > > Done > > New stuff pushed to branch wip/sam/resource-rebase-4 Thanks! It's looking fit for master I'd say, I noticed some minor stuff: - tracker_extract_new_tag/equipment() Adding an extra ':' at the end of the generated urn - If the extractor module returns a NULL resource (ie. no data to be extracted), we still used to tag the file as successfully extracted through the nie:dataSource property, otherwise the file keeps coming on subsequent tracker-extract queries, and it gets stuck. https://paste.fedoraproject.org/390176/46832453/ fixes this here, feel free to squash. - Missing g_return_* through the new public API calls The first 2 could make sense to fix on the fly before you push, the last one can be fixed later on master, your call.
Ugh, upon further inspection of the inserted data, I see some things like this in tracker info: $ tracker info /some/audio/file.mp3 ... 'nie:isLogicalPartOf' = '_:7987' 'nie:isLogicalPartOf' = '_:7988' ... 'nmm:musicAlbum' = '_:7987' 'nmm:musicAlbumDisc' = '_:7988' 'nmm:performer' = '_:7986' So there seems to be something fishy with the way we deal with URN objects.
> - tracker_extract_new_tag/equipment() Adding an extra ':' at the end of the generated urn Fixed. > - If the extractor module returns a NULL resource (ie. no data to be extracted), we still used to tag the file as successfully extracted through the nie:dataSource property, otherwise the file keeps coming on subsequent tracker-extract queries, and it gets stuck. https://paste.fedoraproject.org/390176/46832453/ fixes this here, feel free to squash. I recently discovered this myself, the hard way! Now fixed. > - Missing g_return_* through the new public API calls Fixed. > Ugh, upon further inspection of the inserted data, I see some things like this in tracker info: ... That turned out to be a stray NULL in the g_object_new() call to create a TrackerResource: the identifier was being ignored. But it also highlighted that TrackerResource Turtle and SPARQL generation is actually broken if folk use blank nodes. I've now fixed both issues. In the process, I removed all usage of TrackerSparqlBuilder from TrackerResource, which actually ended up allowing the removal of some code because now the Turtle and SPARQL generation can share some code paths. I'm going away for a couple of months as of this Saturday. So, I'm going to re-extract all my files again, check for any new errors, and then merge it tomorrow if it looks OK. Of course, feel free to revert if some huge unforseen disaster occurs.
It now fails to compile: /usr/include/glib-2.0/gio/gsimpleasyncresult.h:104:21: note: declared here void g_simple_async_result_complete (GSimpleAsyncResult *simple); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CC tracker-cursor.lo CC tracker-namespace-manager.lo CC tracker-resource.lo tracker-resource.c:30:31: fatal error: tracker-generated.h: No such file or directory #include "tracker-generated.h" ^ compilation terminated. make[4]: *** [Makefile:728: tracker-resource.lo] Error 1 git branch * wip/sam/resource-rebase-5
Hmm, I guess that was hidden for me because I was compiling from the libtracker-sparql directory and -I. is in the GCC arguments... or something. I've pushed a fix to branch wip/sam/resource-rebase-6
I always compile clean checkouts to test for these things :) Thank you. It compiles now.
(In reply to Sam Thursfield from comment #42) > > - tracker_extract_new_tag/equipment() Adding an extra ':' at the end of the generated urn > > Fixed. > > > - If the extractor module returns a NULL resource (ie. no data to be extracted), we still used to tag the file as successfully extracted through the nie:dataSource property, otherwise the file keeps coming on subsequent tracker-extract queries, and it gets stuck. https://paste.fedoraproject.org/390176/46832453/ fixes this here, feel free to squash. > > I recently discovered this myself, the hard way! Now fixed. > > > - Missing g_return_* through the new public API calls > > Fixed. > > > Ugh, upon further inspection of the inserted data, I see some things like this in tracker info: ... > > That turned out to be a stray NULL in the g_object_new() call to create a > TrackerResource: the identifier was being ignored. But it also highlighted > that > TrackerResource Turtle and SPARQL generation is actually broken if folk use > blank nodes. I've now fixed both issues. In the process, I removed all usage > of > TrackerSparqlBuilder from TrackerResource, which actually ended up allowing > the > removal of some code because now the Turtle and SPARQL generation can share > some > code paths. Sounds good! > > I'm going away for a couple of months as of this Saturday. So, I'm going to I guess this means I won't see you in Karlsruhe :). Enjoy this summer break! > re-extract all my files again, check for any new errors, and then merge it > tomorrow if it looks OK. Of course, feel free to revert if some huge > unforseen > disaster occurs. Please do, I'll deal with the possible fallout :). This is a big step forward, thanks for doing this!
OK, all merged! and no, sadly no GUADEC for me this year. But who knows, perhaps next year i'll see you here in Manchester :-)
Review of attachment 330361 [details] [review]: ::: src/libtracker-extract/tracker-resource-helpers.c @@ +122,3 @@ + g_return_val_if_fail (make != NULL || model != NULL, NULL); + + equip_uri = tracker_sparql_escape_uri_printf ("urn:equipment:%s:%s:", make ? make : "", model ? model : ""); I see that the trailing colon was dropped in the patch that was actually pushed (ie. commit 032e2a7ed874) - presumably because of comment 40. However, the trailing colon was used by the extractors prior to this change. eg., urn:equipment:Nokia:N9:, urn:equipment:SONY:HDR-CX405:, etc.. I wonder if this was deliberate. I only ask because I'd probably have to drop it from gnome-online-miners to keep the DB clean.
(In reply to Debarshi Ray from comment #48) > I only ask because I'd probably have to drop it from > gnome-online-miners to keep the DB clean. See bug 770658 for a related gnome-online-miners fix.
(In reply to Debarshi Ray from comment #48) > Review of attachment 330361 [details] [review] [review]: > > ::: src/libtracker-extract/tracker-resource-helpers.c > @@ +122,3 @@ > + g_return_val_if_fail (make != NULL || model != NULL, NULL); > + > + equip_uri = tracker_sparql_escape_uri_printf ("urn:equipment:%s:%s:", make > ? make : "", model ? model : ""); > > I see that the trailing colon was dropped in the patch that was actually > pushed (ie. commit 032e2a7ed874) - presumably because of comment 40. > However, the trailing colon was used by the extractors prior to this change. > eg., urn:equipment:Nokia:N9:, urn:equipment:SONY:HDR-CX405:, etc.. I wonder > if this was deliberate. I only ask because I'd probably have to drop it from > gnome-online-miners to keep the DB clean. Thanks for spotting this! It wasn't a deliberate change. I think we should restore the old behaviour before the stable release -- do you agree?
(In reply to Sam Thursfield from comment #50) > (In reply to Debarshi Ray from comment #48) > > Review of attachment 330361 [details] [review] [review] [review]: > > > > ::: src/libtracker-extract/tracker-resource-helpers.c > > @@ +122,3 @@ > > + g_return_val_if_fail (make != NULL || model != NULL, NULL); > > + > > + equip_uri = tracker_sparql_escape_uri_printf ("urn:equipment:%s:%s:", make > > ? make : "", model ? model : ""); > > > > I see that the trailing colon was dropped in the patch that was actually > > pushed (ie. commit 032e2a7ed874) - presumably because of comment 40. > > However, the trailing colon was used by the extractors prior to this change. > > eg., urn:equipment:Nokia:N9:, urn:equipment:SONY:HDR-CX405:, etc.. I wonder > > if this was deliberate. I only ask because I'd probably have to drop it from > > gnome-online-miners to keep the DB clean. > > Thanks for spotting this! It wasn't a deliberate change. I think we should > restore the old behaviour before the stable release -- do you agree? fwiw, I agree it's worth to preserve data compatibility :)
Created attachment 334590 [details] [review] libtracker-extract: Restore trailing colon in nfo:Equipment URIs
Review of attachment 334590 [details] [review]: Thanks :-)