After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 615765 - Unsafe extracting of data when using heap to store the whole file
Unsafe extracting of data when using heap to store the whole file
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Extractor
0.9.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Jamie McCracken
Depends on:
Blocks:
 
 
Reported: 2010-04-14 17:02 UTC by Aleksander Morgado
Modified: 2010-04-26 14:01 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Aleksander Morgado 2010-04-14 17:02:10 UTC
Some extractors may give problems with big files as they are loading the whole file contents in heap. It could be quite dangerous with file formats which really are compressed files, as you don't really know the real compression ratio of the file being decompressed, so if the whole decompressed document turns out to be of Gigabytes and is read in memory it may crash the process (thinking in malicious 'bomb' files)  The following list shows which ones may be affected by this issue:


 * oasis extractor: Forks & spawns (g_spawn_sync) into an unzip process, and the whole output of the execution is stored in a newly allocated string in heap. Actually, the stdout of the "unzip -p" being executed is the whole uncompressed file content, which may get too big. Solution: either g_spawn_async_with_pipes() and read the output buffered until a max (as in ps-gz extractor, see below) or use libgsf to read the contents chunk by chunk without the need of forking.


 * msoffice extractor: Reads (chunk by chunk) the whole file contents in memory, converts them to UTF-8, and afterwards the contents are normalized and limited to the configured max words. A mixed method of normalizing and counting words while reading would probably be less heap-consuming and faster, as you could stop when the max words is reached.


 * msoffice/xml extractor: (after fix for GB#615035) Decompresses the Zip file(s) contents using libgsf, and loads the whole decompressed file(s) contents in newly allocated memory in heap. Not sure anyway if the contents are being indexed at all, can't see any check for counting words up to the maximum configured. Same approach as for OASIS should be used, as they're at the end pretty similar (both Zip files with XML docs inside).


 * totem extractor: Forks & spawns (g_spawn_sync) into a totem-video-indexer process, and the whole output of the execution is stored in a newly allocated string in heap. Actually, as long as the output of totem-video-indexer is not too big, this shouldn't give any problem, so not really big issue.

 * mplayer extractor: Same as totem, but calling mplayer instead.

Somewhat safer extractors:

 * ps-gz extractor: Forks & spawns (g_spawn_async_with_pipes) into a gunzip process. This actually is quite safer than the g_spawn_sync method, as the stdout of the process executed is readable in a file descriptor, not fully stored in heap, so this allows reading the decompressed file in chunks. Anyway, this extractor just extracts the file contents in a temp file, with a size of the decompressed file up to 20Mbytes (hardcoded), and then its contents are extracted with the normal ps extractor. It would be much faster if avoided the need for the temp file.

 * pdf extractor: This one is probably the safest one. Reads the contents of the PDF file 'page by page', normalizing and counting words in each iteration, and stopping either when no more pages or when number of words reached the max configured. Just added in the list to indicate that all the other extractors should follow this same approach when possible.


Please, comments are welcome, as I may be missing things here.
Comment 1 Aleksander Morgado 2010-04-15 08:26:18 UTC
>  * oasis extractor: Forks & spawns (g_spawn_sync) into an unzip process, and
> the whole output of the execution is stored in a newly allocated string in
> heap. Actually, the stdout of the "unzip -p" being executed is the whole
> uncompressed file content, which may get too big. Solution: either
> g_spawn_async_with_pipes() and read the output buffered until a max (as in
> ps-gz extractor, see below) or use libgsf to read the contents chunk by chunk
> without the need of forking.
> 

The OASIS extractor actually uses the previous described method to get the 'meta.xml' file contents. The real data is read using another fork & spawn (g_spawn_command_line_sync) calling odt2txt, and the whole output of odt2txt is stored in heap, and only after that the words are normalized and limited to the maximum number of words, so also shows the same issue.
Comment 2 Aleksander Morgado 2010-04-19 07:59:44 UTC
 
> The OASIS extractor actually uses the previous described method to get the
> 'meta.xml' file contents. The real data is read using another fork & spawn
> (g_spawn_command_line_sync) calling odt2txt, and the whole output of odt2txt is
> stored in heap, and only after that the words are normalized and limited to the
> maximum number of words, so also shows the same issue.

This issue is covered now in bug #615858
Comment 3 Aleksander Morgado 2010-04-19 08:01:05 UTC
>  * msoffice/xml extractor: (after fix for GB#615035) Decompresses the Zip
> file(s) contents using libgsf, and loads the whole decompressed file(s)
> contents in newly allocated memory in heap. Not sure anyway if the contents are
> being indexed at all, can't see any check for counting words up to the maximum
> configured. Same approach as for OASIS should be used, as they're at the end
> pretty similar (both Zip files with XML docs inside).
> 

This issue is covered now in bug #615948
Comment 4 Aleksander Morgado 2010-04-19 10:57:17 UTC
 
>  * msoffice extractor: Reads (chunk by chunk) the whole file contents in
> memory, converts them to UTF-8, and afterwards the contents are normalized and
> limited to the configured max words. A mixed method of normalizing and counting
> words while reading would probably be less heap-consuming and faster, as you
> could stop when the max words is reached.
> 

This issue is covered now in bug #616158
Comment 5 Aleksander Morgado 2010-04-19 12:18:44 UTC
 
>  * ps-gz extractor: Forks & spawns (g_spawn_async_with_pipes) into a gunzip
> process. This actually is quite safer than the g_spawn_sync method, as the
> stdout of the process executed is readable in a file descriptor, not fully
> stored in heap, so this allows reading the decompressed file in chunks. Anyway,
> this extractor just extracts the file contents in a temp file, with a size of
> the decompressed file up to 20Mbytes (hardcoded), and then its contents are
> extracted with the normal ps extractor. It would be much faster if avoided the
> need for the temp file.

This issue is covered now in bug #616165
Comment 6 Aleksander Morgado 2010-04-20 20:02:43 UTC
Improvement in msoffice/excel files now covered in bug #616329
Comment 7 Aleksander Morgado 2010-04-21 14:32:23 UTC
Improvement in msoffice/powerpoint files now covered in bug #616403
Comment 8 Martyn Russell 2010-04-22 06:40:53 UTC
I think this bug can be marked as completed right Aleksander, was there anything else you had in mind to fix?
Comment 9 Aleksander Morgado 2010-04-22 07:52:07 UTC
Just OASIS/metadata files fix is missing, to avoid the need of unzip command to open those files.

Will send the patch for that one today probably.
Comment 10 Aleksander Morgado 2010-04-22 09:49:01 UTC
>  * oasis extractor: Forks & spawns (g_spawn_sync) into an unzip process, and
> the whole output of the execution is stored in a newly allocated string in
> heap. Actually, the stdout of the "unzip -p" being executed is the whole
> uncompressed file content, which may get too big. Solution: either
> g_spawn_async_with_pipes() and read the output buffered until a max (as in
> ps-gz extractor, see below) or use libgsf to read the contents chunk by chunk
> without the need of forking.
> 

Last one, OASIS/metadata files, covered now in bug #616493
Comment 11 Aleksander Morgado 2010-04-26 14:01:40 UTC
Closed as all issues have been addressed in separate bugs.