Bug 735460 – ePub/eBooks indexing bugs

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 735460 - ePub/eBooks indexing bugs


Summary:	ePub/eBooks indexing bugs


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:

URL:
Whiteboard:

Depends on:
Blocks:	704316

Reported:	2014-08-26 16:18 UTC by Bastien Nocera
Modified:	2014-08-28 16:00 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Part-fix for the ontology (1.22 KB, patch) 2014-08-28 13:15 UTC, Martyn Russell	committed	Details \| Review
tracker-extract: Add dummy extractor (2.91 KB, patch) 2014-08-28 14:40 UTC, Bastien Nocera	committed	Details \| Review
tracker-extract: Mark EPub files as e-Books (730 bytes, patch) 2014-08-28 14:49 UTC, Bastien Nocera	committed	Details \| Review
tracker-extract: Add support for comic book formats (821 bytes, patch) 2014-08-28 14:49 UTC, Bastien Nocera	none	Details \| Review
tracker-extract: Add support for more eBook formats (1.38 KB, patch) 2014-08-28 14:49 UTC, Bastien Nocera	none	Details \| Review
tracker-extract: Add support for comic book formats (1.28 KB, patch) 2014-08-28 15:06 UTC, Bastien Nocera	committed	Details \| Review
tracker-extract: Add support for more eBook formats (1.31 KB, patch) 2014-08-28 15:06 UTC, Bastien Nocera	committed	Details \| Review

Description Bastien Nocera 2014-08-26 16:18:02 UTC

First, tracker-extract-epub.c seems not to add the PaginatedTextDocument rdf:type to epub files.

Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books, it's currently impossible to filter those from PDFs by RDF type. Could a tracker-specific RDF type be added?

Comment 1 Martyn Russell 2014-08-26 17:23:19 UTC

(In reply to comment #0)
> First, tracker-extract-epub.c seems not to add the PaginatedTextDocument
> rdf:type to epub files.

We should add that to the rules files, it's a quick text file update.
 
> Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books,
> it's currently impossible to filter those from PDFs by RDF type. Could a
> tracker-specific RDF type be added?

About Nepomuk, I can double check that later in the week for you.
About specific RDF types, we could do it, but it's not necessary. You can also filter based on mime type for more granularity on paginated text documents.

I actually wonder if there is a type for "readonly" or "non-editable" type documents like PDFs - which they kind of are.

Comment 2 Ivan Frade 2014-08-26 18:05:40 UTC

An nfo:EBook class would fit nicely in the ontology (similar to nfo:Presentation or nfo:Spreadsheet).

Is there any heuristic to know if a PDF is an ebook (and not a regular document)? Otherwise PDF ebooks won't be classified correctly. Although user could also have a way to mark a document as ebook.


About readonly: IIRC there is not way to indicate in the ontology if an object is "readonly": if it is a "native" tracker object (it only exists in Tracker DB), we cannot enforce that restriction, and if it is a FS object the filesystem takes care of it...

Comment 3 Bastien Nocera 2014-08-27 14:03:21 UTC

(In reply to comment #1)
> (In reply to comment #0)
> > First, tracker-extract-epub.c seems not to add the PaginatedTextDocument
> > rdf:type to epub files.
> 
> We should add that to the rules files, it's a quick text file update.

OK.

> > Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books,
> > it's currently impossible to filter those from PDFs by RDF type. Could a
> > tracker-specific RDF type be added?
> 
> About Nepomuk, I can double check that later in the week for you.
> About specific RDF types, we could do it, but it's not necessary. You can also
> filter based on mime type for more granularity on paginated text documents.

Hmm, I'd need to filter for:
- epub
- cbz, cbr, cbt, cb7 (comic book formats)
- fb2 (fiction book) and its compressed version
- Mobi

That's quite a lot of mime-types to filter for, and having a specific RDF type would certainly help.

> I actually wonder if there is a type for "readonly" or "non-editable" type
> documents like PDFs - which they kind of are.

Won't need that, but sure.

(In reply to comment #2)
> An nfo:EBook class would fit nicely in the ontology (similar to
> nfo:Presentation or nfo:Spreadsheet).

Nod.

> Is there any heuristic to know if a PDF is an ebook (and not a regular
> document)? Otherwise PDF ebooks won't be classified correctly. Although user
> could also have a way to mark a document as ebook.

That's the plan. I plan to have a way to mark PDFs as books/comics, so that they don't appear in gnome-documents, but only in the books application.

> About readonly: IIRC there is not way to indicate in the ontology if an object
> is "readonly": if it is a "native" tracker object (it only exists in Tracker
> DB), we cannot enforce that restriction, and if it is a FS object the
> filesystem takes care of it...

I think that Martyn meant "readonly" as "cannot be edited", not "is read only on the filesystem".

Comment 4 Martyn Russell 2014-08-28 10:59:10 UTC

(In reply to comment #3)
> (In reply to comment #1)
> > (In reply to comment #0)
> > > First, tracker-extract-epub.c seems not to add the PaginatedTextDocument
> > > rdf:type to epub files.
> > 
> > We should add that to the rules files, it's a quick text file update.
> 
> OK.
> 
> > > Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books,
> > > it's currently impossible to filter those from PDFs by RDF type. Could a
> > > tracker-specific RDF type be added?
> > 
> > About Nepomuk, I can double check that later in the week for you.
> > About specific RDF types, we could do it, but it's not necessary. You can also
> > filter based on mime type for more granularity on paginated text documents.
> 
> Hmm, I'd need to filter for:
> - epub
> - cbz, cbr, cbt, cb7 (comic book formats)
> - fb2 (fiction book) and its compressed version
> - Mobi
> 
> That's quite a lot of mime-types to filter for, and having a specific RDF type
> would certainly help.

In some ways, I think it would be better to have a separate library or integration with shared-mime-info to relate MIME types to RDF types.

But anyway.

Here is an example file (I added this file last week, no coding was needed and now we have MIME types mapped to RDF type automatically):

  https://git.gnome.org/browse/tracker/tree/src/tracker-extract/15-source-code.rule

We do have "generic" rules files too, e.g. for GStreamer:

  https://git.gnome.org/browse/tracker/tree/src/tracker-extract/90-gstreamer-video-generic.rule

But I would use globbing like that sparingly.

The file you probably want to update is:

  https://git.gnome.org/browse/tracker/tree/src/tracker-extract/10-epub.rule

Do you know all the MIME types you need to cover Bastien?
 
> > Is there any heuristic to know if a PDF is an ebook (and not a regular
> > document)? Otherwise PDF ebooks won't be classified correctly. Although user
> > could also have a way to mark a document as ebook.
> 
> That's the plan. I plan to have a way to mark PDFs as books/comics, so that
> they don't appear in gnome-documents, but only in the books application.

Nice!
How did you get on with the suggestion I gave you on IRC yesterday btw?

(In reply to comment #2)
> An nfo:EBook class would fit nicely in the ontology (similar to
> nfo:Presentation or nfo:Spreadsheet).

I checked the ontology:

  http://www.semanticdesktop.org/ontologies/2007/03/22/nfo/#PaginatedTextDocument

I didn't see anything representing a book sadly, but we can add nfo:EBook indeed.
Bastien, do you need help with this?

Comment 5 Martyn Russell 2014-08-28 13:15:48 UTC

Created attachment 284694 [details] [review]
Part-fix for the ontology

This is what you would need on the ontology side. I am not sure if it should be a subclass of PaginatedDocument or Document, Ivan comments?

Comment 6 Bastien Nocera 2014-08-28 14:40:56 UTC

Created attachment 284707 [details] [review]
tracker-extract: Add dummy extractor

For use with data types that don't have any additional metadata
inside the file, but need tagging with specific RDF types.

Comment 7 Bastien Nocera 2014-08-28 14:49:01 UTC

Created attachment 284708 [details] [review]
tracker-extract: Mark EPub files as e-Books

Comment 8 Bastien Nocera 2014-08-28 14:49:06 UTC

Created attachment 284709 [details] [review]
tracker-extract: Add support for comic book formats

Through the dummy extractor

Comment 9 Bastien Nocera 2014-08-28 14:49:12 UTC

Created attachment 284710 [details] [review]
tracker-extract: Add support for more eBook formats

For which metadata extraction isn't currently available.

Comment 10 Bastien Nocera 2014-08-28 15:06:22 UTC

Created attachment 284711 [details] [review]
tracker-extract: Add support for comic book formats

Through the dummy extractor

Comment 11 Bastien Nocera 2014-08-28 15:06:28 UTC

Created attachment 284712 [details] [review]
tracker-extract: Add support for more eBook formats

For which metadata extraction isn't currently available.

Comment 12 Bastien Nocera 2014-08-28 15:59:34 UTC

Attachment 284707 [details] pushed as ba4944f - tracker-extract: Add dummy extractor
Attachment 284708 [details] pushed as 1952b51 - tracker-extract: Mark EPub files as e-Books
Attachment 284711 [details] pushed as 3f86f6d - tracker-extract: Add support for comic book formats
Attachment 284712 [details] pushed as b73efe8 - tracker-extract: Add support for more eBook formats