After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 169758 - .doc filter is taking a very long time to process a large document
.doc filter is taking a very long time to process a large document
Status: RESOLVED FIXED
Product: beagle
Classification: Other
Component: General
0.0.x
Other All
: Normal normal
: Milestone 2
Assigned To: Veerapuram Varadhan
Veerapuram Varadhan
Depends on:
Blocks:
 
 
Reported: 2005-03-09 19:17 UTC by Jon Trowbridge
Modified: 2005-04-05 15:19 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Jon Trowbridge 2005-03-09 19:17:40 UTC
Please describe the problem:
It is a 364k file and beagle-extract-content reports:
fixme:appname = Microsoft Word 10.1
fixme:template = Normal
fixme:revisionnumber = 7
fixme:page-count = 265
fixme:word-count = 34894
(Unfortunately I cannot provide the document in question.)

It takes a full minute for beagle-extract-content to run to completion.  The
above metadata appears immediately.

I added code to the FilterDOC.IndexText method to print debug spew every 100 or
so tokens.  The debug messages came very quickly at first, and gradually slowed
down as it processed the document.  Something must not be scaling very well.

A good first step would be to try to reproduce it with other large documents. 
The file I discuss above is, fortunately, the only large .doc file I have lying
around.

Steps to reproduce:


Actual results:


Expected results:


Does this happen every time?


Other information:
Comment 1 Veerapuram Varadhan 2005-03-10 12:45:38 UTC
Fixed in CVS.  I have tested the fix against the Bruce-eckel's TICPP ebook
version (600 pages approx.) and it takes between 93 and 110 seconds as compared
to 3400+ seconds. 

Jon: Can you verify this fix against your special ;-) doc?
Comment 2 Christian Kirbach 2005-03-10 12:48:11 UTC
*** Bug 169822 has been marked as a duplicate of this bug. ***
Comment 3 Jon Trowbridge 2005-03-10 23:05:00 UTC
Is any more optimization possible?  Here is a (much larger) document that takes
two minutes to process with beagle-extract-content:

http://www.trowbridge.org/c-sharp-standard.doc
Comment 4 Jon Trowbridge 2005-04-05 05:47:43 UTC
Varadhan: Are there any other easy optimizations left?  If not, feel free to
close this bug.
Comment 5 Veerapuram Varadhan 2005-04-05 15:19:45 UTC
Jon: Not really, though, working on a different-optimization as mentioned in my
commit message.  Closing this bug. ;)