GNOME Bugzilla – Bug 169758
.doc filter is taking a very long time to process a large document
Last modified: 2005-04-05 15:19:45 UTC
Please describe the problem: It is a 364k file and beagle-extract-content reports: fixme:appname = Microsoft Word 10.1 fixme:template = Normal fixme:revisionnumber = 7 fixme:page-count = 265 fixme:word-count = 34894 (Unfortunately I cannot provide the document in question.) It takes a full minute for beagle-extract-content to run to completion. The above metadata appears immediately. I added code to the FilterDOC.IndexText method to print debug spew every 100 or so tokens. The debug messages came very quickly at first, and gradually slowed down as it processed the document. Something must not be scaling very well. A good first step would be to try to reproduce it with other large documents. The file I discuss above is, fortunately, the only large .doc file I have lying around. Steps to reproduce: Actual results: Expected results: Does this happen every time? Other information:
Fixed in CVS. I have tested the fix against the Bruce-eckel's TICPP ebook version (600 pages approx.) and it takes between 93 and 110 seconds as compared to 3400+ seconds. Jon: Can you verify this fix against your special ;-) doc?
*** Bug 169822 has been marked as a duplicate of this bug. ***
Is any more optimization possible? Here is a (much larger) document that takes two minutes to process with beagle-extract-content: http://www.trowbridge.org/c-sharp-standard.doc
Varadhan: Are there any other easy optimizations left? If not, feel free to close this bug.
Jon: Not really, though, working on a different-optimization as mentioned in my commit message. Closing this bug. ;)