After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 615858 - Improve reading OASIS files
Improve reading OASIS files
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Extractor
0.9.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Jamie McCracken
Depends on:
Blocks:
 
 
Reported: 2010-04-15 15:16 UTC by Aleksander Morgado
Modified: 2010-04-15 16:50 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Improved OASIS extractor (4.85 KB, patch)
2010-04-15 15:21 UTC, Aleksander Morgado
none Details | Review
Updated patch (6.14 KB, patch)
2010-04-15 16:37 UTC, Aleksander Morgado
accepted-commit_now Details | Review

Description Aleksander Morgado 2010-04-15 15:16:28 UTC
As per bug #615765, the contents of the OASIS files are currently read in the following way:
 * Fork & spawn into a odt2txt process and wait for it to finish
 * Get the whole stdout of the child process, whatever big it is, in a string allocated in heap.
 * Normalize the contents of the whole string and limit it up to the max number of configured words.

Currently, this can be improved in the following way:
 * Fork & spawn into a odt2txt process, without waiting for the child to finish
 * Buffered read the stdout of the child process, up to a max number of bytes predefined.
 * In each buffered read, perform the normalization, and count the number of normalized words
 * Stop the buffered read when either:
  a) No more contents to read from stdout
  b) Reached maximum number of bytes to read (1MByte for example)
  c) Reached maximum number of words to read (from conf)
Comment 1 Aleksander Morgado 2010-04-15 15:21:34 UTC
Created attachment 158821 [details] [review]
Improved OASIS extractor
Comment 2 Aleksander Morgado 2010-04-15 16:37:53 UTC
Created attachment 158825 [details] [review]
Updated patch

Reindented and added non-hardcoded max bytes to read, as:
 3 * max_words * max_word_length
Comment 3 Philip Van Hoof 2010-04-15 16:41:41 UTC
Reviewed, please push to master
Comment 4 Aleksander Morgado 2010-04-15 16:50:54 UTC
Pushed.