Bug 615948 – Improve reading msoffice/xml files

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 615948 - Improve reading msoffice/xml files


Summary:	Improve reading msoffice/xml files


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	0.9.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-04-16 11:26 UTC by Aleksander Morgado
Modified:	2010-04-20 20:00 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Buffered reading with no extra heap, and limited to 20MBytes (4.44 KB, patch) 2010-04-16 11:36 UTC, Aleksander Morgado	needs-work	Details \| Review

Description Aleksander Morgado 2010-04-16 11:26:24 UTC

As background, msoffice/xml (docx) are just zip archives containing XML files.

As per bug #615765, the contents of the msoffice/xml (docx) files are currently read in the following way:
 * Using libgsf, load the contents of the whole uncompressed XML file in a newly allocated string in heap.
 * Pass the whole string to the GMarkupParseContext parser

This has two main problems:
 * Unsafe (the docx may be malicious, and the uncompressed XML file may be of Gigabytes)
 * The whole XML document is loaded in heap, and then parsed

Currently, this can be improved in the following way:
 * Limit the max number of bytes to be read from the XML file, to some safe limit like 20 MBytes.
 * Don't load the whole doc in heap. Use a buffer in stack to read the contents chunk by chunk and pass each of them to the GMarkupParseContext.

Comment 1 Aleksander Morgado 2010-04-16 11:36:44 UTC

Created attachment 158887 [details] [review]
Buffered reading with no extra heap, and limited to 20MBytes

Comment 2 Martyn Russell 2010-04-20 12:58:07 UTC

Comment on attachment 158887 [details] [review]
Buffered reading with no extra heap, and limited to 20MBytes

>+		guint8 buf [XML_BUFFER_SIZE];

No space between variable and square brackets needed.

>+		while ((accum  <= XML_MAX_BYTES_READ) &&
>+		       (chunk_size > 0) &&
>+		       (gsf_input_read (GSF_INPUT (member), chunk_size, buf) != NULL)) {

No need for extra brackets.

>+			/* update accumulated count */
>+			accum += chunk_size;

Space needed between next comment below and this previous code block.

>+			/* Pass the read stream to the context parser... */
>+			g_markup_parse_context_parse (context, buf, chunk_size, NULL);

Another space here please.

>+			/* update bytes to be read */
>+			remaining_size -= chunk_size;
>+			chunk_size = MIN (remaining_size, XML_BUFFER_SIZE);

Rest looks fine to me.
Thanks for the patch Alexander.
You can commit after fixing the small issues above, thanks.

Comment 3 Aleksander Morgado 2010-04-20 20:00:20 UTC

In git master now.