GNOME Bugzilla – Bug 735460
ePub/eBooks indexing bugs
Last modified: 2014-08-28 16:00:01 UTC
First, tracker-extract-epub.c seems not to add the PaginatedTextDocument rdf:type to epub files. Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books, it's currently impossible to filter those from PDFs by RDF type. Could a tracker-specific RDF type be added?
(In reply to comment #0) > First, tracker-extract-epub.c seems not to add the PaginatedTextDocument > rdf:type to epub files. We should add that to the rules files, it's a quick text file update. > Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books, > it's currently impossible to filter those from PDFs by RDF type. Could a > tracker-specific RDF type be added? About Nepomuk, I can double check that later in the week for you. About specific RDF types, we could do it, but it's not necessary. You can also filter based on mime type for more granularity on paginated text documents. I actually wonder if there is a type for "readonly" or "non-editable" type documents like PDFs - which they kind of are.
An nfo:EBook class would fit nicely in the ontology (similar to nfo:Presentation or nfo:Spreadsheet). Is there any heuristic to know if a PDF is an ebook (and not a regular document)? Otherwise PDF ebooks won't be classified correctly. Although user could also have a way to mark a document as ebook. About readonly: IIRC there is not way to indicate in the ontology if an object is "readonly": if it is a "native" tracker object (it only exists in Tracker DB), we cannot enforce that restriction, and if it is a FS object the filesystem takes care of it...
(In reply to comment #1) > (In reply to comment #0) > > First, tracker-extract-epub.c seems not to add the PaginatedTextDocument > > rdf:type to epub files. > > We should add that to the rules files, it's a quick text file update. OK. > > Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books, > > it's currently impossible to filter those from PDFs by RDF type. Could a > > tracker-specific RDF type be added? > > About Nepomuk, I can double check that later in the week for you. > About specific RDF types, we could do it, but it's not necessary. You can also > filter based on mime type for more granularity on paginated text documents. Hmm, I'd need to filter for: - epub - cbz, cbr, cbt, cb7 (comic book formats) - fb2 (fiction book) and its compressed version - Mobi That's quite a lot of mime-types to filter for, and having a specific RDF type would certainly help. > I actually wonder if there is a type for "readonly" or "non-editable" type > documents like PDFs - which they kind of are. Won't need that, but sure. (In reply to comment #2) > An nfo:EBook class would fit nicely in the ontology (similar to > nfo:Presentation or nfo:Spreadsheet). Nod. > Is there any heuristic to know if a PDF is an ebook (and not a regular > document)? Otherwise PDF ebooks won't be classified correctly. Although user > could also have a way to mark a document as ebook. That's the plan. I plan to have a way to mark PDFs as books/comics, so that they don't appear in gnome-documents, but only in the books application. > About readonly: IIRC there is not way to indicate in the ontology if an object > is "readonly": if it is a "native" tracker object (it only exists in Tracker > DB), we cannot enforce that restriction, and if it is a FS object the > filesystem takes care of it... I think that Martyn meant "readonly" as "cannot be edited", not "is read only on the filesystem".
(In reply to comment #3) > (In reply to comment #1) > > (In reply to comment #0) > > > First, tracker-extract-epub.c seems not to add the PaginatedTextDocument > > > rdf:type to epub files. > > > > We should add that to the rules files, it's a quick text file update. > > OK. > > > > Furthermore, as the Nepomuk ontology is missing an RDF type to tag e-books, > > > it's currently impossible to filter those from PDFs by RDF type. Could a > > > tracker-specific RDF type be added? > > > > About Nepomuk, I can double check that later in the week for you. > > About specific RDF types, we could do it, but it's not necessary. You can also > > filter based on mime type for more granularity on paginated text documents. > > Hmm, I'd need to filter for: > - epub > - cbz, cbr, cbt, cb7 (comic book formats) > - fb2 (fiction book) and its compressed version > - Mobi > > That's quite a lot of mime-types to filter for, and having a specific RDF type > would certainly help. In some ways, I think it would be better to have a separate library or integration with shared-mime-info to relate MIME types to RDF types. But anyway. Here is an example file (I added this file last week, no coding was needed and now we have MIME types mapped to RDF type automatically): https://git.gnome.org/browse/tracker/tree/src/tracker-extract/15-source-code.rule We do have "generic" rules files too, e.g. for GStreamer: https://git.gnome.org/browse/tracker/tree/src/tracker-extract/90-gstreamer-video-generic.rule But I would use globbing like that sparingly. The file you probably want to update is: https://git.gnome.org/browse/tracker/tree/src/tracker-extract/10-epub.rule Do you know all the MIME types you need to cover Bastien? > > Is there any heuristic to know if a PDF is an ebook (and not a regular > > document)? Otherwise PDF ebooks won't be classified correctly. Although user > > could also have a way to mark a document as ebook. > > That's the plan. I plan to have a way to mark PDFs as books/comics, so that > they don't appear in gnome-documents, but only in the books application. Nice! How did you get on with the suggestion I gave you on IRC yesterday btw? (In reply to comment #2) > An nfo:EBook class would fit nicely in the ontology (similar to > nfo:Presentation or nfo:Spreadsheet). I checked the ontology: http://www.semanticdesktop.org/ontologies/2007/03/22/nfo/#PaginatedTextDocument I didn't see anything representing a book sadly, but we can add nfo:EBook indeed. Bastien, do you need help with this?
Created attachment 284694 [details] [review] Part-fix for the ontology This is what you would need on the ontology side. I am not sure if it should be a subclass of PaginatedDocument or Document, Ivan comments?
Created attachment 284707 [details] [review] tracker-extract: Add dummy extractor For use with data types that don't have any additional metadata inside the file, but need tagging with specific RDF types.
Created attachment 284708 [details] [review] tracker-extract: Mark EPub files as e-Books
Created attachment 284709 [details] [review] tracker-extract: Add support for comic book formats Through the dummy extractor
Created attachment 284710 [details] [review] tracker-extract: Add support for more eBook formats For which metadata extraction isn't currently available.
Created attachment 284711 [details] [review] tracker-extract: Add support for comic book formats Through the dummy extractor
Created attachment 284712 [details] [review] tracker-extract: Add support for more eBook formats For which metadata extraction isn't currently available.
Attachment 284707 [details] pushed as ba4944f - tracker-extract: Add dummy extractor Attachment 284708 [details] pushed as 1952b51 - tracker-extract: Mark EPub files as e-Books Attachment 284711 [details] pushed as 3f86f6d - tracker-extract: Add support for comic book formats Attachment 284712 [details] pushed as b73efe8 - tracker-extract: Add support for more eBook formats