GNOME Bugzilla – Bug 615948
Improve reading msoffice/xml files
Last modified: 2010-04-20 20:00:20 UTC
As background, msoffice/xml (docx) are just zip archives containing XML files. As per bug #615765, the contents of the msoffice/xml (docx) files are currently read in the following way: * Using libgsf, load the contents of the whole uncompressed XML file in a newly allocated string in heap. * Pass the whole string to the GMarkupParseContext parser This has two main problems: * Unsafe (the docx may be malicious, and the uncompressed XML file may be of Gigabytes) * The whole XML document is loaded in heap, and then parsed Currently, this can be improved in the following way: * Limit the max number of bytes to be read from the XML file, to some safe limit like 20 MBytes. * Don't load the whole doc in heap. Use a buffer in stack to read the contents chunk by chunk and pass each of them to the GMarkupParseContext.
Created attachment 158887 [details] [review] Buffered reading with no extra heap, and limited to 20MBytes
Comment on attachment 158887 [details] [review] Buffered reading with no extra heap, and limited to 20MBytes >+ guint8 buf [XML_BUFFER_SIZE]; No space between variable and square brackets needed. >+ while ((accum <= XML_MAX_BYTES_READ) && >+ (chunk_size > 0) && >+ (gsf_input_read (GSF_INPUT (member), chunk_size, buf) != NULL)) { No need for extra brackets. >+ /* update accumulated count */ >+ accum += chunk_size; Space needed between next comment below and this previous code block. >+ /* Pass the read stream to the context parser... */ >+ g_markup_parse_context_parse (context, buf, chunk_size, NULL); Another space here please. >+ /* update bytes to be read */ >+ remaining_size -= chunk_size; >+ chunk_size = MIN (remaining_size, XML_BUFFER_SIZE); Rest looks fine to me. Thanks for the patch Alexander. You can commit after fixing the small issues above, thanks.
In git master now.