GNOME Bugzilla – Bug 615765
Unsafe extracting of data when using heap to store the whole file
Last modified: 2010-04-26 14:01:40 UTC
Some extractors may give problems with big files as they are loading the whole file contents in heap. It could be quite dangerous with file formats which really are compressed files, as you don't really know the real compression ratio of the file being decompressed, so if the whole decompressed document turns out to be of Gigabytes and is read in memory it may crash the process (thinking in malicious 'bomb' files) The following list shows which ones may be affected by this issue: * oasis extractor: Forks & spawns (g_spawn_sync) into an unzip process, and the whole output of the execution is stored in a newly allocated string in heap. Actually, the stdout of the "unzip -p" being executed is the whole uncompressed file content, which may get too big. Solution: either g_spawn_async_with_pipes() and read the output buffered until a max (as in ps-gz extractor, see below) or use libgsf to read the contents chunk by chunk without the need of forking. * msoffice extractor: Reads (chunk by chunk) the whole file contents in memory, converts them to UTF-8, and afterwards the contents are normalized and limited to the configured max words. A mixed method of normalizing and counting words while reading would probably be less heap-consuming and faster, as you could stop when the max words is reached. * msoffice/xml extractor: (after fix for GB#615035) Decompresses the Zip file(s) contents using libgsf, and loads the whole decompressed file(s) contents in newly allocated memory in heap. Not sure anyway if the contents are being indexed at all, can't see any check for counting words up to the maximum configured. Same approach as for OASIS should be used, as they're at the end pretty similar (both Zip files with XML docs inside). * totem extractor: Forks & spawns (g_spawn_sync) into a totem-video-indexer process, and the whole output of the execution is stored in a newly allocated string in heap. Actually, as long as the output of totem-video-indexer is not too big, this shouldn't give any problem, so not really big issue. * mplayer extractor: Same as totem, but calling mplayer instead. Somewhat safer extractors: * ps-gz extractor: Forks & spawns (g_spawn_async_with_pipes) into a gunzip process. This actually is quite safer than the g_spawn_sync method, as the stdout of the process executed is readable in a file descriptor, not fully stored in heap, so this allows reading the decompressed file in chunks. Anyway, this extractor just extracts the file contents in a temp file, with a size of the decompressed file up to 20Mbytes (hardcoded), and then its contents are extracted with the normal ps extractor. It would be much faster if avoided the need for the temp file. * pdf extractor: This one is probably the safest one. Reads the contents of the PDF file 'page by page', normalizing and counting words in each iteration, and stopping either when no more pages or when number of words reached the max configured. Just added in the list to indicate that all the other extractors should follow this same approach when possible. Please, comments are welcome, as I may be missing things here.
> * oasis extractor: Forks & spawns (g_spawn_sync) into an unzip process, and > the whole output of the execution is stored in a newly allocated string in > heap. Actually, the stdout of the "unzip -p" being executed is the whole > uncompressed file content, which may get too big. Solution: either > g_spawn_async_with_pipes() and read the output buffered until a max (as in > ps-gz extractor, see below) or use libgsf to read the contents chunk by chunk > without the need of forking. > The OASIS extractor actually uses the previous described method to get the 'meta.xml' file contents. The real data is read using another fork & spawn (g_spawn_command_line_sync) calling odt2txt, and the whole output of odt2txt is stored in heap, and only after that the words are normalized and limited to the maximum number of words, so also shows the same issue.
> The OASIS extractor actually uses the previous described method to get the > 'meta.xml' file contents. The real data is read using another fork & spawn > (g_spawn_command_line_sync) calling odt2txt, and the whole output of odt2txt is > stored in heap, and only after that the words are normalized and limited to the > maximum number of words, so also shows the same issue. This issue is covered now in bug #615858
> * msoffice/xml extractor: (after fix for GB#615035) Decompresses the Zip > file(s) contents using libgsf, and loads the whole decompressed file(s) > contents in newly allocated memory in heap. Not sure anyway if the contents are > being indexed at all, can't see any check for counting words up to the maximum > configured. Same approach as for OASIS should be used, as they're at the end > pretty similar (both Zip files with XML docs inside). > This issue is covered now in bug #615948
> * msoffice extractor: Reads (chunk by chunk) the whole file contents in > memory, converts them to UTF-8, and afterwards the contents are normalized and > limited to the configured max words. A mixed method of normalizing and counting > words while reading would probably be less heap-consuming and faster, as you > could stop when the max words is reached. > This issue is covered now in bug #616158
> * ps-gz extractor: Forks & spawns (g_spawn_async_with_pipes) into a gunzip > process. This actually is quite safer than the g_spawn_sync method, as the > stdout of the process executed is readable in a file descriptor, not fully > stored in heap, so this allows reading the decompressed file in chunks. Anyway, > this extractor just extracts the file contents in a temp file, with a size of > the decompressed file up to 20Mbytes (hardcoded), and then its contents are > extracted with the normal ps extractor. It would be much faster if avoided the > need for the temp file. This issue is covered now in bug #616165
Improvement in msoffice/excel files now covered in bug #616329
Improvement in msoffice/powerpoint files now covered in bug #616403
I think this bug can be marked as completed right Aleksander, was there anything else you had in mind to fix?
Just OASIS/metadata files fix is missing, to avoid the need of unzip command to open those files. Will send the patch for that one today probably.
> * oasis extractor: Forks & spawns (g_spawn_sync) into an unzip process, and > the whole output of the execution is stored in a newly allocated string in > heap. Actually, the stdout of the "unzip -p" being executed is the whole > uncompressed file content, which may get too big. Solution: either > g_spawn_async_with_pipes() and read the output buffered until a max (as in > ps-gz extractor, see below) or use libgsf to read the contents chunk by chunk > without the need of forking. > Last one, OASIS/metadata files, covered now in bug #616493
Closed as all issues have been addressed in separate bugs.