GNOME Bugzilla – Bug 735645
EPub extractor bug fixes
Last modified: 2021-05-26 22:23:19 UTC
.
Created attachment 284761 [details] [review] tracker-extract: Add fallback for creation date in EPubs If we only have this in the OPF file: <dc:date>2011-04-13</dc:date> use this as the date.
Created attachment 284762 [details] [review] tracker-extract: Show where parsing errors happen in EPubs Error extracting EPUB contents (OEBPS/Text/info.xhtml): Error on line 59: Entity name 'copy' is not known is better than: Error extracting EPUB contents: Error on line 59: Entity name 'copy' is not known
Created attachment 284763 [details] [review] tracker-extract: Try harder when getting EPub contents GMarkup is really not that good at parsing XML, so we need to try harder to ignore errors parsing the contents of EPub files, and populate the index with *some* data.
Comment on attachment 284761 [details] [review] tracker-extract: Add fallback for creation date in EPubs The patch here looks OK, I am more concerned with why the data->element == OPF_TAG_TYPE_UNKNOWN here. 1. Are there multiple dc:date cases? 2. Related to #1, do we even need the conditional check before setting the element to OPF_TAG_TYPE_CREATED? I wonder if we even need that entire attribute checking block of code, what else would dc:date be used for?
Comment on attachment 284762 [details] [review] tracker-extract: Show where parsing errors happen in EPubs Looks good thanks Bastien!
Comment on attachment 284763 [details] [review] tracker-extract: Try harder when getting EPub contents My main concern here is that I prefer not to have errors or warnings as debug logging. I would use g_message() here. Also, just for consistency, we use g_warning ("Foo '%s', %s", file, error->message) most other places in the code base. Just a nitpick though :) When done please commit, looks fine otherwise.
Comment on attachment 284762 [details] [review] tracker-extract: Show where parsing errors happen in EPubs Attachment 284762 [details] pushed as ba23d6e - tracker-extract: Show where parsing errors happen in EPubs
Created attachment 284961 [details] [review] tracker-extract: Try harder when getting EPub contents GMarkup is really not that good at parsing XML, so we need to try harder to ignore errors parsing the contents of EPub files, and populate the index with *some* data.
Comment on attachment 284961 [details] [review] tracker-extract: Try harder when getting EPub contents Attachment 284961 [details] pushed as 3e993a9 - tracker-extract: Try harder when getting EPub contents
(In reply to comment #4) > (From update of attachment 284761 [details] [review]) > The patch here looks OK, I am more concerned with why the data->element == > OPF_TAG_TYPE_UNKNOWN here. > > 1. Are there multiple dc:date cases? Yes. > 2. Related to #1, do we even need the conditional check before setting the > element to OPF_TAG_TYPE_CREATED? I wonder if we even need that entire attribute > checking block of code, what else would dc:date be used for? See at http://netkingcol.blogspot.co.uk/2010/01/closer-look-at-opf.html for example: <dc:date opf:event="original-publication">1922</dc:date> <dc:date opf:event="epub-publication">2009-09-24</dc:date>
(In reply to comment #10) > (In reply to comment #4) > > (From update of attachment 284761 [details] [review] [details]) > > The patch here looks OK, I am more concerned with why the data->element == > > OPF_TAG_TYPE_UNKNOWN here. > > > > 1. Are there multiple dc:date cases? > > Yes. OK, so from the link you gave (below): "The set of values for event are not defined by this specification; possible values may include: creation, publication, and modification." Which might explain why there is ONLY one date tag (OPF_TAG_TYPE_CREATED), because we can't know which it is anyway. > > 2. Related to #1, do we even need the conditional check before setting the > > element to OPF_TAG_TYPE_CREATED? I wonder if we even need that entire attribute > > checking block of code, what else would dc:date be used for? > > See at http://netkingcol.blogspot.co.uk/2010/01/closer-look-at-opf.html for > example: > <dc:date opf:event="original-publication">1922</dc:date> > <dc:date opf:event="epub-publication">2009-09-24</dc:date> So, we should either: a. Check for "epub-publication", to capture all date cases. b. Remove the entire block if we only have one type of date (OPF_TAG_TYPE_CREATED): for (i = 0; attribute_names[i] != NULL; i++) { if (g_strcmp0 (attribute_names[i], "opf:event") == 0 && g_strcmp0 (attribute_values[i], "original-publication") == 0) { data->element = OPF_TAG_TYPE_CREATED; break; } } I don't see any value in parsing the attributes at all here.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new enhancement request ticket at https://gitlab.gnome.org/GNOME/tracker/-/issues/ Thank you for your understanding and your help.