Bug 616493 – Remove unzip dependency from the OASIS extractor

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 616493 - Remove unzip dependency from the OASIS extractor


Summary:	Remove unzip dependency from the OASIS extractor


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	0.9.x
Hardware:	Other Linux

Importance:	Normal minor
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-04-22 09:33 UTC by Aleksander Morgado
Modified:	2016-12-15 14:07 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Use libgsf instead of unzip in the OASIS extractor (17.79 KB, patch) 2010-04-22 09:43 UTC, Aleksander Morgado	none	Details \| Review
Fixed pvanhoof's comments (17.79 KB, patch) 2010-04-26 13:52 UTC, Aleksander Morgado	needs-work	Details \| Review
Fixed martyn's comments (17.38 KB, patch) 2010-04-27 09:34 UTC, Aleksander Morgado	none	Details \| Review

Description Aleksander Morgado 2010-04-22 09:33:48 UTC

Following changes stated in bug #615765, for OASIS/metadata files.

Currently, OASIS extractor uses "unzip -p" to get the contents of the meta.xml file inside the OASIS file. This dependency can be removed using libgsf, in the same way as done for msoffice/docx files.

Comment 1 Aleksander Morgado 2010-04-22 09:43:40 UTC

Created attachment 159318 [details] [review]
Use libgsf instead of unzip in the OASIS extractor

Created new files for the code common to both msoffice and oasis extractors, in tracker-extract. The idea of putting this common code in libtracker-extract was not so good as the common code was libgsf-dependent, and libgsf is optional during compilation.

Comment 2 Philip Van Hoof 2010-04-26 09:40:23 UTC

Why are you adding these checks? I don't see them in the original code. Also, if you do want to check for it, shouldn't you report an error then? Not sure if it's necessary and a line above the construct I see a "info.content = g_string_new ("");", so since you aren't freeing it at "return" you're probably leaking that in case of (!context).


+	if (!context) {
+		return;
+	}

Please move these near the top of the file, and our usual style is no space between the # character and define.

+	/* Size of the buffer to use */
+#   define XML_BUFFER_SIZE            8192         /* bytes */
+	/* Note: 20 MBytes of max size is really assumed to be a safe limit. */
+#   define XML_MAX_BYTES_READ         (20u << 20)  /* bytes */



I like a lot of code-commenting, but this is a bit overkill:

+	/* it's safe to call g_free on NULL pointers */

It's also safe to call g_clear_error on NULL, if you want to avoid the if statements

+	if (error)
+		g_error_free (error);

Comment 3 Aleksander Morgado 2010-04-26 13:37:33 UTC

> Why are you adding these checks? I don't see them in the original code. Also,
> if you do want to check for it, shouldn't you report an error then? Not sure if
> it's necessary and a line above the construct I see a "info.content =
> g_string_new ("");", so since you aren't freeing it at "return" you're probably
> leaking that in case of (!context).
> 
> 
> +    if (!context) {
> +        return;
> +    }
> 

Yeah, actually not needed as long as the address of the parser is a proper one. Removing the checks then.


> Please move these near the top of the file, and our usual style is no space
> between the # character and define.
> 
> +    /* Size of the buffer to use */
> +#   define XML_BUFFER_SIZE            8192         /* bytes */
> +    /* Note: 20 MBytes of max size is really assumed to be a safe limit. */
> +#   define XML_MAX_BYTES_READ         (20u << 20)  /* bytes */
> 


Ok.

> 
> I like a lot of code-commenting, but this is a bit overkill:
> 
> +    /* it's safe to call g_free on NULL pointers */
> 

Ok, removing it :-)

> It's also safe to call g_clear_error on NULL, if you want to avoid the if
> statements
> 
> +    if (error)
> +        g_error_free (error);

In this specific case where error variable is no longer used, it's probably better to call g_error_free() directly.

Comment 4 Aleksander Morgado 2010-04-26 13:52:39 UTC

Created attachment 159597 [details] [review]
Fixed pvanhoof's comments

Comment 5 Philip Van Hoof 2010-04-26 17:02:59 UTC

For me that commit is good. Perhaps ask Martyn or Carlos to do a quick review?

Comment 6 Philip Van Hoof 2010-04-26 17:04:06 UTC

Review of attachment 159597 [details] [review]:

I'll set it to accepted-commit_now. Up to you if you want a Martyn-or-Carlos review on top.

Comment 7 Martyn Russell 2010-04-27 08:41:24 UTC

Comment on attachment 159597 [details] [review]
Fixed pvanhoof's comments

>+if HAVE_LIBGSF
>+tracker_extract_SOURCES += tracker-gsf-common.c tracker-gsf-common.h
>+tracker_extract_LDADD += $(LIBGSF_LIBS)
>+endif

I don't think we need to call it -common, tracker-gsf.[ch] should be fine. 

> static gboolean
> xml_read (MsOfficeXMLParserInfo *parser_info,
>           const gchar           *xml_filename,
>@@ -2254,7 +2145,9 @@ xml_read (MsOfficeXMLParserInfo *parser_info,
> 	if (context) {
> 		/* Load the internal XML file from the Zip archive, and parse it
> 		 * using the given context */
>-		parse_xml_contents (parser_info->uri, xml_filename, context);
>+		tracker_gsf_common_parse_xml_in_zip (parser_info->uri,
>+		                                     xml_filename,
>+		                                     context);

Yea, using _common_ in the API is a bit ugly here, I would prefer tracker_gsf_parse_*()

> 	context = g_markup_parse_context_new (&parser, 0, &info, NULL);
>+	if (!context) {
>+		return;
>+	}

Hmm, I presume the NULL is a space for GError, we should probably make use of that so we know why markup parsing fails if it does. Otherwise we will get reports like "my file doesn't extract properly or I don't see the details in tracker..." and we won't be able to see from the logs quickly why this is the case.

>-	/* No need to unlink meta.xml, as it goes to stdout of the
>-	 *  spawned child (-p option in unzip) */
>+	/* ---- First, parse metadata ---- */

Hmm, I don't think we need the ---- in there.

>+	/* ---- Next, parse contents ---- */

Same here.

>diff --git a/src/tracker-extract/tracker-gsf-common.c b/src/tracker-extract/tracker-gsf-common.c
>new file mode 100644
>index 0000000..a4072d0
>--- /dev/null
>+++ b/src/tracker-extract/tracker-gsf-common.c
>@@ -0,0 +1,153 @@
>+/*
>+ * Copyright (C) 2010 Lanedo GmbH <aleksander@lanedo.com>

We should use Copyright to Nokia. Since they're paying for the work to be done and that's what we do everywhere else. This is only different if done in your spare time and not used in the Maemo/Meego.

We use (Ivan is the contact we use in all cases right now):

  Copyright (C) 2010, Nokia <ivan.frade@nokia.com>

>+static GsfInput *
>+find_member (GsfInfile *arch,
>+             char const *name)
>+{
>+	gchar const *slash = strchr (name, '/');

Probably a silly question, but why not const gchar* ? I am thinking more for consistency with the rest of Tracker really.

>+	if (slash) {
>+		gchar *dirname = g_strndup (name, slash - name);

I prefer function calls like this to be done after variable declaration, otherwise it just gets messy. Exceptions are things like integers/NULL/etc.

>+/**
>+ * tracker_parse_xml_in_zip:

Hmm, this function name doesn't match the actual function name and would cause gtkdoc warnings.

>+ * @zip_file_uri: URI of the ZIP archive
>+ * @xml_filename: Name of the XML file stored inside the ZIP archive
>+ * @context: Markup context to be used when parsing the XML
>+ *
>+ * This function reads and parses the contents of an XML file stored
>+ *  inside a ZIP compressed archive. Reading and parsing is done buffered, and
>+ *  maximum size of the uncompressed XML file is limited to be to 20MBytes.

Just wondering, why indent 2nd and 3rd line above?

>+tracker_gsf_common_parse_xml_in_zip (const gchar *zip_file_uri,
>+                                     const gchar *xml_filename,
>+                                     GMarkupParseContext *context)

Coding style, the above function doesn't line up parameters.

>+{
>+	gchar *filename;
>+	GError *error = NULL;
>+	GsfInfile *infile = NULL;;
>+	GsfInput *src = NULL;
>+	GsfInput *member = NULL;
>+
>+	/* Size of the buffer to use */
>+#   define XML_BUFFER_SIZE            8192         /* bytes */
>+	/* Note: 20 MBytes of max size is really assumed to be a safe limit. */
>+#   define XML_MAX_BYTES_READ         (20u << 20)  /* bytes */

Please don't use #defines inside functions, as Philip already said, we prefer those at the top of the file.

>+#   undef XML_BUFFER_SIZE
>+#   undef XML_MAX_BYTES_READ

Not sure why we're undefing these.

--

Generally a good patch. I wonder if this would be more useful in libtracker-extract/ though rather than tracker-extract/ If so the license would be changed to LGPL of course. Depends how useful this is outside our extractors internally. It probably isn't that useful.

So if you could rename those s/tracker-gsf-common/tracker-gsf/g I would be happy :)

Thanks for the patch Aleksander :)

Comment 8 Aleksander Morgado 2010-04-27 08:59:30 UTC

> >+if HAVE_LIBGSF
> >+tracker_extract_SOURCES += tracker-gsf-common.c tracker-gsf-common.h
> >+tracker_extract_LDADD += $(LIBGSF_LIBS)
> >+endif
> 
> I don't think we need to call it -common, tracker-gsf.[ch] should be fine. 
> 

Ok.

> > static gboolean
> > xml_read (MsOfficeXMLParserInfo *parser_info,
> >           const gchar           *xml_filename,
> >@@ -2254,7 +2145,9 @@ xml_read (MsOfficeXMLParserInfo *parser_info,
> > 	if (context) {
> > 		/* Load the internal XML file from the Zip archive, and parse it
> > 		 * using the given context */
> >-		parse_xml_contents (parser_info->uri, xml_filename, context);
> >+		tracker_gsf_common_parse_xml_in_zip (parser_info->uri,
> >+		                                     xml_filename,
> >+		                                     context);
> 
> Yea, using _common_ in the API is a bit ugly here, I would prefer
> tracker_gsf_parse_*()
> 

Ok.

> > 	context = g_markup_parse_context_new (&parser, 0, &info, NULL);
> >+	if (!context) {
> >+		return;
> >+	}
> 
> Hmm, I presume the NULL is a space for GError, we should probably make use of
> that so we know why markup parsing fails if it does. Otherwise we will get
> reports like "my file doesn't extract properly or I don't see the details in
> tracker..." and we won't be able to see from the logs quickly why this is the
> case.
> 

Actually it's not space for a GError. The only situation where g_markup_parse_context_new() returns NULL is when &parser is NULL, which cannot happen in this case as parser is in stack. The last parameter is a GDestroyNotify. Anyway, I believe I removed the check for (!context) in the updated patch after pvanhoof's comments.

> >-	/* No need to unlink meta.xml, as it goes to stdout of the
> >-	 *  spawned child (-p option in unzip) */
> >+	/* ---- First, parse metadata ---- */
> 
> Hmm, I don't think we need the ---- in there.

Ok.

> 
> >+	/* ---- Next, parse contents ---- */
> 
> Same here.

Ok.

> 
> >diff --git a/src/tracker-extract/tracker-gsf-common.c b/src/tracker-extract/tracker-gsf-common.c
> >new file mode 100644
> >index 0000000..a4072d0
> >--- /dev/null
> >+++ b/src/tracker-extract/tracker-gsf-common.c
> >@@ -0,0 +1,153 @@
> >+/*
> >+ * Copyright (C) 2010 Lanedo GmbH <aleksander@lanedo.com>
> 
> We should use Copyright to Nokia. Since they're paying for the work to be done
> and that's what we do everywhere else. This is only different if done in your
> spare time and not used in the Maemo/Meego.
> 
> We use (Ivan is the contact we use in all cases right now):
> 
>   Copyright (C) 2010, Nokia <ivan.frade@nokia.com>
> 

Ah, understood.

> >+static GsfInput *
> >+find_member (GsfInfile *arch,
> >+             char const *name)
> >+{
> >+	gchar const *slash = strchr (name, '/');
> 
> Probably a silly question, but why not const gchar* ? I am thinking more for
> consistency with the rest of Tracker really.

That's actually coming from the original find_member() method here:
http://vsdump.sourcearchive.com/documentation/0.0.44/vsd__utils_8c-source.html

Will change it anyway.

> 
> >+	if (slash) {
> >+		gchar *dirname = g_strndup (name, slash - name);
> 
> I prefer function calls like this to be done after variable declaration,
> otherwise it just gets messy. Exceptions are things like integers/NULL/etc.
> 

Yeah, same here. Will also change it.

> >+/**
> >+ * tracker_parse_xml_in_zip:
> 
> Hmm, this function name doesn't match the actual function name and would cause
> gtkdoc warnings.

Oops :-)

> 
> >+ * @zip_file_uri: URI of the ZIP archive
> >+ * @xml_filename: Name of the XML file stored inside the ZIP archive
> >+ * @context: Markup context to be used when parsing the XML
> >+ *
> >+ * This function reads and parses the contents of an XML file stored
> >+ *  inside a ZIP compressed archive. Reading and parsing is done buffered, and
> >+ *  maximum size of the uncompressed XML file is limited to be to 20MBytes.
> 
> Just wondering, why indent 2nd and 3rd line above?

Bah, just because they're in the same paragraph. So if multiple paragraphs included one after the other it's easier to read it in the source code.

> 
> >+tracker_gsf_common_parse_xml_in_zip (const gchar *zip_file_uri,
> >+                                     const gchar *xml_filename,
> >+                                     GMarkupParseContext *context)
> 
> Coding style, the above function doesn't line up parameters.

Ok.

> 
> >+{
> >+	gchar *filename;
> >+	GError *error = NULL;
> >+	GsfInfile *infile = NULL;;
> >+	GsfInput *src = NULL;
> >+	GsfInput *member = NULL;
> >+
> >+	/* Size of the buffer to use */
> >+#   define XML_BUFFER_SIZE            8192         /* bytes */
> >+	/* Note: 20 MBytes of max size is really assumed to be a safe limit. */
> >+#   define XML_MAX_BYTES_READ         (20u << 20)  /* bytes */
> 
> Please don't use #defines inside functions, as Philip already said, we prefer
> those at the top of the file.
> 
> >+#   undef XML_BUFFER_SIZE
> >+#   undef XML_MAX_BYTES_READ
> 
> Not sure why we're undefing these.

It seems the second patch I uploaded was exactly the same as the first one without pvanhoof's comments fixed, because I had also fixed that. Bad day yesterday... :-)

> 
> --
> 
> Generally a good patch. I wonder if this would be more useful in
> libtracker-extract/ though rather than tracker-extract/ If so the license would
> be changed to LGPL of course. Depends how useful this is outside our extractors
> internally. It probably isn't that useful.
> 

Actually, left it in tracker-extract at the end and not in libtracker-extract because libgsf was a compilation option, and was not a good idea probably to have a method in libtracker-extract which was only available if libgsf was being used.

> So if you could rename those s/tracker-gsf-common/tracker-gsf/g I would be
> happy :)
> 

Sure, updating the patch now.

Comment 9 Aleksander Morgado 2010-04-27 09:34:00 UTC

Created attachment 159677 [details] [review]
Fixed martyn's comments

Comment 10 Aleksander Morgado 2010-04-27 10:19:12 UTC

Pushed to git master after martyn's approval.