After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 615948 - Improve reading msoffice/xml files
Improve reading msoffice/xml files
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Extractor
0.9.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Jamie McCracken
Depends on:
Blocks:
 
 
Reported: 2010-04-16 11:26 UTC by Aleksander Morgado
Modified: 2010-04-20 20:00 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Buffered reading with no extra heap, and limited to 20MBytes (4.44 KB, patch)
2010-04-16 11:36 UTC, Aleksander Morgado
needs-work Details | Review

Description Aleksander Morgado 2010-04-16 11:26:24 UTC
As background, msoffice/xml (docx) are just zip archives containing XML files.

As per bug #615765, the contents of the msoffice/xml (docx) files are currently read in the following way:
 * Using libgsf, load the contents of the whole uncompressed XML file in a newly allocated string in heap.
 * Pass the whole string to the GMarkupParseContext parser

This has two main problems:
 * Unsafe (the docx may be malicious, and the uncompressed XML file may be of Gigabytes)
 * The whole XML document is loaded in heap, and then parsed

Currently, this can be improved in the following way:
 * Limit the max number of bytes to be read from the XML file, to some safe limit like 20 MBytes.
 * Don't load the whole doc in heap. Use a buffer in stack to read the contents chunk by chunk and pass each of them to the GMarkupParseContext.
Comment 1 Aleksander Morgado 2010-04-16 11:36:44 UTC
Created attachment 158887 [details] [review]
Buffered reading with no extra heap, and limited to 20MBytes
Comment 2 Martyn Russell 2010-04-20 12:58:07 UTC
Comment on attachment 158887 [details] [review]
Buffered reading with no extra heap, and limited to 20MBytes

>+		guint8 buf [XML_BUFFER_SIZE];

No space between variable and square brackets needed.

>+		while ((accum  <= XML_MAX_BYTES_READ) &&
>+		       (chunk_size > 0) &&
>+		       (gsf_input_read (GSF_INPUT (member), chunk_size, buf) != NULL)) {

No need for extra brackets.

>+			/* update accumulated count */
>+			accum += chunk_size;

Space needed between next comment below and this previous code block.

>+			/* Pass the read stream to the context parser... */
>+			g_markup_parse_context_parse (context, buf, chunk_size, NULL);

Another space here please.

>+			/* update bytes to be read */
>+			remaining_size -= chunk_size;
>+			chunk_size = MIN (remaining_size, XML_BUFFER_SIZE);

Rest looks fine to me.
Thanks for the patch Alexander.
You can commit after fixing the small issues above, thanks.
Comment 3 Aleksander Morgado 2010-04-20 20:00:20 UTC
In git master now.