After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 323276 - Crash While Indexing .doc File
Crash While Indexing .doc File
Status: RESOLVED FIXED
Product: beagle
Classification: Other
Component: General
unspecified
Other All
: Normal normal
: ---
Assigned To: Veerapuram Varadhan
Beagle Bugs
: 330533 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2005-12-05 15:35 UTC by David Richards
Modified: 2007-01-30 21:50 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Offending Document (182.50 KB, application/msword)
2005-12-05 15:36 UTC, David Richards
  Details
Fix for the crash (6.61 KB, patch)
2005-12-05 17:58 UTC, Veerapuram Varadhan
none Details | Review

Description David Richards 2005-12-05 15:35:18 UTC
Please describe the problem:
Current cvs Beagle crashes when trying to index attached .doc file

This is what displays on the command line:

Debug: +file:///home/largo/pd/lesson2.doc
Error: DocumentSummaryInformationStream not found in /home/largo/pd/lesson2.doc

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

Stacktrace:



Steps to reproduce:
1. 
2. 
3. 


Actual results:


Expected results:


Does this happen every time?


Other information:
Comment 1 David Richards 2005-12-05 15:36:32 UTC
Created attachment 55641 [details]
Offending Document
Comment 2 David Richards 2005-12-05 15:56:22 UTC
Possibly related?  Another .doc failed:

Debug: +file:///home/largo/es/pocts/ERP.doc
Error: DocumentSummaryInformationStream not found in /home/largo/es/pocts/ERP.doc

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

Stacktrace:

in <0x4> (wrapper managed-to-native)
Beagle.Filters.FilterDOC:wv1_glue_init_doc_parsing
 (string,Beagle.Filters.FilterDOC/TextHandlerCallback)
in <0xfffffd87> (wrapper managed-to-native)
Beagle.Filters.FilterDOC:wv1_glue_init_doc_
parsing (string,Beagle.Filters.FilterDOC/TextHandlerCallback)
in <0x8f> Beagle.Filters.FilterDOC:DoPull ()
in <0x2e> Beagle.Daemon.Filter:Pull ()
in <0x2f> Beagle.Daemon.Filter:PullFromArray
(System.Collections.ArrayList,System.Text.
StringBuilder)
in <0x26> Beagle.Daemon.Filter:PullTextCarefully
(System.Collections.ArrayList,System.T
ext.StringBuilder)
in <0x13> Beagle.Daemon.Filter:PullText (System.Text.StringBuilder)
in <0xffffff1f> (wrapper delegate-invoke)
System.MulticastDelegate:invoke_bool_StringBu
ilder (System.Text.StringBuilder)
in <0x2f> Beagle.Util.PullingReader:DoPull (int)
in <0x15> Beagle.Util.PullingReader:Read (char[],int,int)
in <0xcf> Lucene.Net.Analysis.Standard.FastCharStream:Refill ()
in <0x1b> Lucene.Net.Analysis.Standard.FastCharStream:ReadChar ()
in <0x18> Lucene.Net.Analysis.Standard.FastCharStream:BeginToken ()
in <0x43> Lucene.Net.Analysis.Standard.StandardTokenizerTokenManager:GetNextToken ()
in <0x4a> Lucene.Net.Analysis.Standard.StandardTokenizer:Jj_ntk ()
in <0x24> Lucene.Net.Analysis.Standard.StandardTokenizer:Next ()
in <0x33> Lucene.Net.Analysis.Standard.StandardFilter:Next ()
in <0x15> Lucene.Net.Analysis.LowerCaseFilter:Next ()
in <0x13> Lucene.Net.Analysis.StopFilter:Next ()
in <0x6c> Beagle.Daemon.NoiseFilter:Next ()
in <0x14> Lucene.Net.Analysis.PorterStemFilter:Next ()
in <0x2f8> Lucene.Net.Index.DocumentWriter:InvertDocument
(Lucene.Net.Documents.Documen
t)
in <0x1e4> Lucene.Net.Index.DocumentWriter:AddDocument
(string,Lucene.Net.Documents.Doc
ument)
in <0x73> Lucene.Net.Index.IndexWriter:AddDocument
(Lucene.Net.Documents.Document,Lucen
e.Net.Analysis.Analyzer)
in <0x17> Lucene.Net.Index.IndexWriter:AddDocument (Lucene.Net.Documents.Document)
in <0xd61> Beagle.Daemon.LuceneIndexingDriver:Flush_Unlocked
(Beagle.Daemon.IndexerRequ
est)
in <0x2a> Beagle.Daemon.LuceneIndexingDriver:Flush (Beagle.Daemon.IndexerRequest)
in <0x69> Beagle.Daemon.BuildIndex:FlushIndexer
(Beagle.Daemon.IIndexer,Beagle.Daemon.I
ndexerRequest)
in <0x2ec> Beagle.Daemon.BuildIndex:IndexWorker ()
in <0x67> (wrapper delegate-invoke) System.MulticastDelegate:invoke_void ()
in <0x1f> Beagle.Util.ExceptionHandlingThread:ThreadStarted ()
in <0xff738700> (wrapper delegate-invoke) System.MulticastDelegate:invoke_void ()
in <0x71d3db7> (wrapper runtime-invoke) System.Object:runtime_invoke_void
(object,intpt
r,intptr,intptr)

Native stacktrace:

        mono(mono_handle_native_sigsegv+0xba) [0x81471da]
        mono [0x81354cf]
        /lib/libpthread.so.0 [0xa3d3e0]
        /usr/local/lib/libwv-1.0.so.3(wvDecodeSimple+0x11ce) [0x131566a]
        /usr/local/lib/libwv-1.0.so.3(wvText+0x3b) [0x131dc5b]
        /usr/local/lib/beagle/libbeagleglue.so(wv1_glue_init_doc_parsing+0x126)
[0x1fdc
1a]
        [0x6dee7a9]
        [0x6dee4f0]
        [0x81f720f]
        [0x81f7068]
        [0x81f6f87]
        [0x81f6f44]
        [0x81f6ef7]
        [0x81f6dd8]
        [0x81f6cde]
        [0x81f28f0]
        [0x81f27b4]
        [0x81f2791]
        [0x81f24dc]
        [0x81f244b]
        [0x81f21f5]
        [0x81f1fcc]
        [0x81f1f6e]
        [0x81f1f0c]
        [0x81f6c9d]
        [0x81f6bfd]
        [0x81f0619]
        [0x81eec0d]
        [0x81ee584]
        [0x81ee508]
        [0x7025fba]
        [0x7024e3b]
        [0x7024a52]
        [0x182a065]
        [0x1828f28]
        [0x1828f58]
        [0x1828f20]
        [0xf615f1]
        mono [0x8135380]
        mono(mono_runtime_invoke+0x27) [0x80d42b7]
        mono(mono_runtime_delegate_invoke+0x3b) [0x80d4a5b]
        mono [0x8098d8b]
        mono [0x8102db7]
        mono [0x810d835]
        /lib/libpthread.so.0 [0xa37b80]
        /lib/libc.so.6(__clone+0x5e) [0x74eb9e]
Comment 3 Veerapuram Varadhan 2005-12-05 16:27:22 UTC
Which version of wv1 do you have?
Comment 4 David Richards 2005-12-05 17:00:34 UTC
wv1 version is wv-1.0.3

I just saw that 1.2 has come out...is that a requirement?  ./configure didn't
report that.
Comment 5 Veerapuram Varadhan 2005-12-05 17:32:03 UTC
Ok, Crash fixed.  Will upload the patch in a minute or two.

wv 1.2 is not an hard requirement.  Beagle Filter is backward compatible. ;-)
However using wv-1.2 is better than wv-1.0.x
Comment 6 Veerapuram Varadhan 2005-12-05 17:58:46 UTC
Created attachment 55647 [details] [review]
Fix for the crash

Attached patch fixes the crash.  However, have to find out on the double-free
error while freeing the wvParseStruct.
Comment 7 Joe Shaw 2006-02-09 18:32:24 UTC
*** Bug 330533 has been marked as a duplicate of this bug. ***
Comment 8 Joe Shaw 2006-02-09 18:33:45 UTC
Varadhan, it's been two months and this is still popping up from time to time.  What's the status of the patch?
Comment 9 Veerapuram Varadhan 2006-02-13 17:09:39 UTC
@joe:  Patch works fine.  Have to run memchecker though.  Let me verify it today and close this by today itself.  Sorry for holding this for a long time. :-)
Comment 10 Debajyoti Bera 2006-02-13 17:16:14 UTC
A related question on filter behaviour: If a filter fails by throwing an exception, is it gracefully handled or it takes the whole indexhelper down with it ? There are all kinds of files out there; it doesnt take much effort to find a file that'll crash a filter !
Comment 11 Veerapuram Varadhan 2006-02-13 17:50:19 UTC
Most of the cases are handled to skip gracefully.  Filters should probably wrap *text pulling* operations within a try-catch and fail gracefully.  I guess, doing a "finished()" inside catch in IndexText(), would probably make this one not crash beagle.
Comment 12 David Richards 2006-02-13 17:54:47 UTC
I would like to see all of these filters do just that.  Report the error and keep going.  For me, I would never see them because they run automated at 2am.  I would never know there was a problem.  Part of my job still is to run these by hand and watch them run every few weeks on the command line to see if they are going ok.
Comment 13 Joe Shaw 2006-02-13 19:29:12 UTC
Exceptions in filters are handled gracefully, and the file is skipped.

If there is a segfault, there's nothing we can do.  That's the benefit we have of using managed (c#) code whenever possible. :)
Comment 14 Veerapuram Varadhan 2006-02-13 22:51:25 UTC
Ok.  The bug actually is triggered from wv, as it (appears to be) tries to read past the EOF.  I have filed a bug in bugzilla.abisource.com.  http://bugzilla.abisource.com/show_bug.cgi?id=10025
Comment 15 Debajyoti Bera 2006-04-12 01:01:54 UTC
> Ok.  The bug actually is triggered from wv, as it (appears to be) tries to read
> past the EOF.  I have filed a bug in bugzilla.abisource.com. 
> http://bugzilla.abisource.com/show_bug.cgi?id=10025

I am all confused (nothing new :P).
The bug is in wv, I understand. Then what is the status of the attached patch ? Is it a workaround/ earlier attempt/obsolete ?
I found something from irclogs:
[21:35] <varadhan> and since the entire wv1 logic is based on callback mechanism.. I guess, I will manually maintain a total size and try to match it witch current charpos that is being read.

Any update on this ?
Comment 16 Joe Shaw 2006-04-14 15:41:02 UTC
I don't know if this is directly related, but our OLE filter was opening files using GSF's mmap() interface, which maps the files as MAP_SHARED rather than MAP_PRIVATE.  This means that if the file changed on disk, it would cause problems if the filter were in the middle of reading from it.  This particularly seems to be a problem when saving these files with OpenOffice, which triggers two inotify events, and causes the file to be indexed twice.

I've just checked this in, so it'd be good if people could test it.
Comment 17 Debajyoti Bera 2006-04-14 17:38:58 UTC
In case the issue still persists, use external filter. I just tried antiword on the attached "buggy" file.
Using this in external-filters.xml,
<filter>
   <mimetype>application/msword</mimetype>
   <extension>.doc</extension>
   <command>antiword</command>
   <arguments>-t %s</arguments>
</filter>

and beagle-extract-content worked fine on the crasher doc file.
Comment 18 Joe Shaw 2006-05-01 21:36:09 UTC
Dave, is this still an issue (not using the external filter workaround)?  I think the fix I did in comment #16 may have solved it.
Comment 19 David Richards 2006-05-02 17:42:54 UTC
confirmed that it's fixed now in CVS.  Failure of wv1 seems to just continue to the next item in the script.  The user that previously could not complete now completed.  I now have .doc support for the whole city, thanks.
Comment 20 David Richards 2006-05-02 17:43:18 UTC
mark close
Comment 21 Joe Shaw 2007-01-30 21:50:20 UTC
These types of bugs are handled more gracefully in 0.2.15; setting up an external filter with antiword shouldn't be necessary.