After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 746401 - epub: handle multiple dc:identifier tags
epub: handle multiple dc:identifier tags
Product: tracker
Classification: Core
Component: Extractor
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Depends on:
Reported: 2015-03-18 13:40 UTC by Carlos Garnacho
Modified: 2015-04-09 14:51 UTC
See Also:
GNOME target: ---
GNOME version: ---

tracker-extract-epub: Ensure we only have one nie:identifier (1.83 KB, patch)
2015-03-18 13:41 UTC, Carlos Garnacho
accepted-commit_now Details | Review

Description Carlos Garnacho 2015-03-18 13:40:40 UTC
I received an epub here that in its .opf metadata contains:

<dc:identifier id="uuid_id" opf:scheme="uuid">urn:uuid:</dc:identifier>
<dc:identifier opf:scheme="UUID">urn:uuid:dff6037b-450e-4912-9433-5e6ca7937669</dc:identifier>

Having multiple UUID identifiers look like brokenness of the file (and it's amusingly the first broken uuid urn which is referenced in the TOC file), and directly translates to several nie:identifiers being added to the sparql query, which causes sparql warnings as we break its cardinality.

Looking at the code, AFAICS it may also be the case if a file has both UUID/ISBN identifiers, as both are translated to nie:identifier. 

I'm attaching a patch that just adds the first identifier found as nie:identifier and ignores the rest, this makes the insertion successful in these cases. The patch should be considered after the hard code freeze is lifted.
Comment 1 Carlos Garnacho 2015-03-18 13:41:21 UTC
Created attachment 299707 [details] [review]
tracker-extract-epub: Ensure we only have one nie:identifier

This property has maxCardinality=1, we are however possibly adding
multiple values there, either in both UUID/ISBN forms, or as multiple
UUIDs in faulty epubs.

ISBN should probably be its own rdf:Property, in the mean time, stick
to the first nie:identifier found, and ignore the rest.
Comment 2 Martyn Russell 2015-03-20 19:51:56 UTC
Review of attachment 299707 [details] [review]:

Looks right to me. I presume the first ID is the best to use of all of them?
Comment 3 Carlos Garnacho 2015-04-09 14:49:26 UTC
(In reply to Martyn Russell from comment #2)
> Review of attachment 299707 [details] [review] [review]:
> Looks right to me. I presume the first ID is the best to use of all of them?

We just don't know, files could be broken in whatever way. Although this property is just nrl:maxCardinality 1, it is not nrl:InverseFunctionalProperty so we don't have to care about its uniqueness.

The one piece of info I consider interesting is ISBN, but we don't have ontology for this. There's been a bug open to nepomuk about it for quite long [1], but still open... Nepomuk doesn't seem to see much activity at all nowadays :(

Anyway, I'm pushing to master/1.2

Comment 4 Carlos Garnacho 2015-04-09 14:51:38 UTC
Attachment 299707 [details] pushed as 5c70907 - tracker-extract-epub: Ensure we only have one nie:identifier