Bug 735645 – EPub extractor bug fixes

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 735645 - EPub extractor bug fixes


Summary:	EPub extractor bug fixes


Status:	RESOLVED OBSOLETE

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	tracker-extractor

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-08-28 22:32 UTC by Bastien Nocera
Modified:	2021-05-26 22:23 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
tracker-extract: Add fallback for creation date in EPubs (1.04 KB, patch) 2014-08-28 22:32 UTC, Bastien Nocera	rejected	Details \| Review
tracker-extract: Show where parsing errors happen in EPubs (1.39 KB, patch) 2014-08-28 22:32 UTC, Bastien Nocera	committed	Details \| Review
tracker-extract: Try harder when getting EPub contents (2.33 KB, patch) 2014-08-28 22:32 UTC, Bastien Nocera	needs-work	Details \| Review
tracker-extract: Try harder when getting EPub contents (2.33 KB, patch) 2014-08-31 21:14 UTC, Bastien Nocera	committed	Details \| Review

Description Bastien Nocera 2014-08-28 22:32:18 UTC

Comment 1 Bastien Nocera 2014-08-28 22:32:21 UTC

Created attachment 284761 [details] [review]
tracker-extract: Add fallback for creation date in EPubs

If we only have this in the OPF file:
<dc:date>2011-04-13</dc:date>
use this as the date.

Comment 2 Bastien Nocera 2014-08-28 22:32:27 UTC

Created attachment 284762 [details] [review]
tracker-extract: Show where parsing errors happen in EPubs

Error extracting EPUB contents (OEBPS/Text/info.xhtml): Error on line 59: Entity name 'copy' is not known
is better than:
Error extracting EPUB contents: Error on line 59: Entity name 'copy' is not known

Comment 3 Bastien Nocera 2014-08-28 22:32:37 UTC

Created attachment 284763 [details] [review]
tracker-extract: Try harder when getting EPub contents

GMarkup is really not that good at parsing XML, so we need to try
harder to ignore errors parsing the contents of EPub files, and
populate the index with *some* data.

Comment 4 Martyn Russell 2014-08-29 09:09:05 UTC

Comment on attachment 284761 [details] [review]
tracker-extract: Add fallback for creation date in EPubs

The patch here looks OK, I am more concerned with why the data->element == OPF_TAG_TYPE_UNKNOWN here.

1. Are there multiple dc:date cases?

2. Related to #1, do we even need the conditional check before setting the element to OPF_TAG_TYPE_CREATED? I wonder if we even need that entire attribute checking block of code, what else would dc:date be used for?

Comment 5 Martyn Russell 2014-08-29 09:09:50 UTC

Comment on attachment 284762 [details] [review]
tracker-extract: Show where parsing errors happen in EPubs

Looks good thanks Bastien!

Comment 6 Martyn Russell 2014-08-29 09:13:03 UTC

Comment on attachment 284763 [details] [review]
tracker-extract: Try harder when getting EPub contents

My main concern here is that I prefer not to have errors or warnings as debug logging. I would use g_message() here.

Also, just for consistency, we use g_warning ("Foo '%s', %s", file, error->message) most other places in the code base. Just a nitpick though :)

When done please commit, looks fine otherwise.

Comment 7 Bastien Nocera 2014-08-29 14:22:14 UTC

Comment on attachment 284762 [details] [review]
tracker-extract: Show where parsing errors happen in EPubs

Attachment 284762 [details] pushed as ba23d6e - tracker-extract: Show where parsing errors happen in EPubs

Comment 8 Bastien Nocera 2014-08-31 21:14:33 UTC

Created attachment 284961 [details] [review]
tracker-extract: Try harder when getting EPub contents

GMarkup is really not that good at parsing XML, so we need to try
harder to ignore errors parsing the contents of EPub files, and
populate the index with *some* data.

Comment 9 Bastien Nocera 2014-08-31 21:16:04 UTC

Comment on attachment 284961 [details] [review]
tracker-extract: Try harder when getting EPub contents

Attachment 284961 [details] pushed as 3e993a9 - tracker-extract: Try harder when getting EPub contents

Comment 10 Bastien Nocera 2014-08-31 21:18:22 UTC

(In reply to comment #4)
> (From update of attachment 284761 [details] [review])
> The patch here looks OK, I am more concerned with why the data->element ==
> OPF_TAG_TYPE_UNKNOWN here.
> 
> 1. Are there multiple dc:date cases?

Yes.

> 2. Related to #1, do we even need the conditional check before setting the
> element to OPF_TAG_TYPE_CREATED? I wonder if we even need that entire attribute
> checking block of code, what else would dc:date be used for?

See at http://netkingcol.blogspot.co.uk/2010/01/closer-look-at-opf.html for example:
<dc:date opf:event="original-publication">1922</dc:date>
<dc:date opf:event="epub-publication">2009-09-24</dc:date>

Comment 11 Martyn Russell 2014-09-02 08:40:53 UTC

(In reply to comment #10)
> (In reply to comment #4)
> > (From update of attachment 284761 [details] [review] [details])
> > The patch here looks OK, I am more concerned with why the data->element ==
> > OPF_TAG_TYPE_UNKNOWN here.
> > 
> > 1. Are there multiple dc:date cases?
> 
> Yes.

OK, so from the link you gave (below):

"The set of values for event are not defined by this specification; possible values may include: creation, publication, and modification."

Which might explain why there is ONLY one date tag (OPF_TAG_TYPE_CREATED), because we can't know which it is anyway.

> > 2. Related to #1, do we even need the conditional check before setting the
> > element to OPF_TAG_TYPE_CREATED? I wonder if we even need that entire attribute
> > checking block of code, what else would dc:date be used for?
> 
> See at http://netkingcol.blogspot.co.uk/2010/01/closer-look-at-opf.html for
> example:
> <dc:date opf:event="original-publication">1922</dc:date>
> <dc:date opf:event="epub-publication">2009-09-24</dc:date>

So, we should either:

a. Check for "epub-publication", to capture all date cases.

b. Remove the entire block if we only have one type of date (OPF_TAG_TYPE_CREATED):

	for (i = 0; attribute_names[i] != NULL; i++) {
		if (g_strcmp0 (attribute_names[i], "opf:event") == 0 &&
		    g_strcmp0 (attribute_values[i], "original-publication") == 0) {		
			data->element = OPF_TAG_TYPE_CREATED;
			break;
		}
	}

I don't see any value in parsing the attributes at all here.

Comment 12 Sam Thursfield 2021-05-26 22:23:19 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new enhancement request ticket at
  https://gitlab.gnome.org/GNOME/tracker/-/issues/

Thank you for your understanding and your help.