Bug 169758 – .doc filter is taking a very long time to process a large document

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 169758 - .doc filter is taking a very long time to process a large document


Summary:	.doc filter is taking a very long time to process a large document


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	0.0.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	Milestone 2
Assigned To:	Veerapuram Varadhan
QA Contact:	Veerapuram Varadhan

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-03-09 19:17 UTC by Jon Trowbridge
Modified:	2005-04-05 15:19 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Jon Trowbridge 2005-03-09 19:17:40 UTC

Please describe the problem:
It is a 364k file and beagle-extract-content reports:
fixme:appname = Microsoft Word 10.1
fixme:template = Normal
fixme:revisionnumber = 7
fixme:page-count = 265
fixme:word-count = 34894
(Unfortunately I cannot provide the document in question.)

It takes a full minute for beagle-extract-content to run to completion.  The
above metadata appears immediately.

I added code to the FilterDOC.IndexText method to print debug spew every 100 or
so tokens.  The debug messages came very quickly at first, and gradually slowed
down as it processed the document.  Something must not be scaling very well.

A good first step would be to try to reproduce it with other large documents. 
The file I discuss above is, fortunately, the only large .doc file I have lying
around.

Steps to reproduce:


Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 Veerapuram Varadhan 2005-03-10 12:45:38 UTC

Fixed in CVS.  I have tested the fix against the Bruce-eckel's TICPP ebook
version (600 pages approx.) and it takes between 93 and 110 seconds as compared
to 3400+ seconds. 

Jon: Can you verify this fix against your special ;-) doc?

Comment 2 Christian Kirbach 2005-03-10 12:48:11 UTC

*** Bug 169822 has been marked as a duplicate of this bug. ***

Comment 3 Jon Trowbridge 2005-03-10 23:05:00 UTC

Is any more optimization possible?  Here is a (much larger) document that takes
two minutes to process with beagle-extract-content:

http://www.trowbridge.org/c-sharp-standard.doc

Comment 4 Jon Trowbridge 2005-04-05 05:47:43 UTC

Varadhan: Are there any other easy optimizations left?  If not, feel free to
close this bug.

Comment 5 Veerapuram Varadhan 2005-04-05 15:19:45 UTC

Jon: Not really, though, working on a different-optimization as mentioned in my
commit message.  Closing this bug. ;)