GNOME Bugzilla – Bug 323276
Crash While Indexing .doc File
Last modified: 2007-01-30 21:50:20 UTC
Please describe the problem: Current cvs Beagle crashes when trying to index attached .doc file This is what displays on the command line: Debug: +file:///home/largo/pd/lesson2.doc Error: DocumentSummaryInformationStream not found in /home/largo/pd/lesson2.doc ================================================================= Got a SIGSEGV while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application. ================================================================= Stacktrace: Steps to reproduce: 1. 2. 3. Actual results: Expected results: Does this happen every time? Other information:
Created attachment 55641 [details] Offending Document
Possibly related? Another .doc failed: Debug: +file:///home/largo/es/pocts/ERP.doc Error: DocumentSummaryInformationStream not found in /home/largo/es/pocts/ERP.doc ================================================================= Got a SIGSEGV while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application. ================================================================= Stacktrace: in <0x4> (wrapper managed-to-native) Beagle.Filters.FilterDOC:wv1_glue_init_doc_parsing (string,Beagle.Filters.FilterDOC/TextHandlerCallback) in <0xfffffd87> (wrapper managed-to-native) Beagle.Filters.FilterDOC:wv1_glue_init_doc_ parsing (string,Beagle.Filters.FilterDOC/TextHandlerCallback) in <0x8f> Beagle.Filters.FilterDOC:DoPull () in <0x2e> Beagle.Daemon.Filter:Pull () in <0x2f> Beagle.Daemon.Filter:PullFromArray (System.Collections.ArrayList,System.Text. StringBuilder) in <0x26> Beagle.Daemon.Filter:PullTextCarefully (System.Collections.ArrayList,System.T ext.StringBuilder) in <0x13> Beagle.Daemon.Filter:PullText (System.Text.StringBuilder) in <0xffffff1f> (wrapper delegate-invoke) System.MulticastDelegate:invoke_bool_StringBu ilder (System.Text.StringBuilder) in <0x2f> Beagle.Util.PullingReader:DoPull (int) in <0x15> Beagle.Util.PullingReader:Read (char[],int,int) in <0xcf> Lucene.Net.Analysis.Standard.FastCharStream:Refill () in <0x1b> Lucene.Net.Analysis.Standard.FastCharStream:ReadChar () in <0x18> Lucene.Net.Analysis.Standard.FastCharStream:BeginToken () in <0x43> Lucene.Net.Analysis.Standard.StandardTokenizerTokenManager:GetNextToken () in <0x4a> Lucene.Net.Analysis.Standard.StandardTokenizer:Jj_ntk () in <0x24> Lucene.Net.Analysis.Standard.StandardTokenizer:Next () in <0x33> Lucene.Net.Analysis.Standard.StandardFilter:Next () in <0x15> Lucene.Net.Analysis.LowerCaseFilter:Next () in <0x13> Lucene.Net.Analysis.StopFilter:Next () in <0x6c> Beagle.Daemon.NoiseFilter:Next () in <0x14> Lucene.Net.Analysis.PorterStemFilter:Next () in <0x2f8> Lucene.Net.Index.DocumentWriter:InvertDocument (Lucene.Net.Documents.Documen t) in <0x1e4> Lucene.Net.Index.DocumentWriter:AddDocument (string,Lucene.Net.Documents.Doc ument) in <0x73> Lucene.Net.Index.IndexWriter:AddDocument (Lucene.Net.Documents.Document,Lucen e.Net.Analysis.Analyzer) in <0x17> Lucene.Net.Index.IndexWriter:AddDocument (Lucene.Net.Documents.Document) in <0xd61> Beagle.Daemon.LuceneIndexingDriver:Flush_Unlocked (Beagle.Daemon.IndexerRequ est) in <0x2a> Beagle.Daemon.LuceneIndexingDriver:Flush (Beagle.Daemon.IndexerRequest) in <0x69> Beagle.Daemon.BuildIndex:FlushIndexer (Beagle.Daemon.IIndexer,Beagle.Daemon.I ndexerRequest) in <0x2ec> Beagle.Daemon.BuildIndex:IndexWorker () in <0x67> (wrapper delegate-invoke) System.MulticastDelegate:invoke_void () in <0x1f> Beagle.Util.ExceptionHandlingThread:ThreadStarted () in <0xff738700> (wrapper delegate-invoke) System.MulticastDelegate:invoke_void () in <0x71d3db7> (wrapper runtime-invoke) System.Object:runtime_invoke_void (object,intpt r,intptr,intptr) Native stacktrace: mono(mono_handle_native_sigsegv+0xba) [0x81471da] mono [0x81354cf] /lib/libpthread.so.0 [0xa3d3e0] /usr/local/lib/libwv-1.0.so.3(wvDecodeSimple+0x11ce) [0x131566a] /usr/local/lib/libwv-1.0.so.3(wvText+0x3b) [0x131dc5b] /usr/local/lib/beagle/libbeagleglue.so(wv1_glue_init_doc_parsing+0x126) [0x1fdc 1a] [0x6dee7a9] [0x6dee4f0] [0x81f720f] [0x81f7068] [0x81f6f87] [0x81f6f44] [0x81f6ef7] [0x81f6dd8] [0x81f6cde] [0x81f28f0] [0x81f27b4] [0x81f2791] [0x81f24dc] [0x81f244b] [0x81f21f5] [0x81f1fcc] [0x81f1f6e] [0x81f1f0c] [0x81f6c9d] [0x81f6bfd] [0x81f0619] [0x81eec0d] [0x81ee584] [0x81ee508] [0x7025fba] [0x7024e3b] [0x7024a52] [0x182a065] [0x1828f28] [0x1828f58] [0x1828f20] [0xf615f1] mono [0x8135380] mono(mono_runtime_invoke+0x27) [0x80d42b7] mono(mono_runtime_delegate_invoke+0x3b) [0x80d4a5b] mono [0x8098d8b] mono [0x8102db7] mono [0x810d835] /lib/libpthread.so.0 [0xa37b80] /lib/libc.so.6(__clone+0x5e) [0x74eb9e]
Which version of wv1 do you have?
wv1 version is wv-1.0.3 I just saw that 1.2 has come out...is that a requirement? ./configure didn't report that.
Ok, Crash fixed. Will upload the patch in a minute or two. wv 1.2 is not an hard requirement. Beagle Filter is backward compatible. ;-) However using wv-1.2 is better than wv-1.0.x
Created attachment 55647 [details] [review] Fix for the crash Attached patch fixes the crash. However, have to find out on the double-free error while freeing the wvParseStruct.
*** Bug 330533 has been marked as a duplicate of this bug. ***
Varadhan, it's been two months and this is still popping up from time to time. What's the status of the patch?
@joe: Patch works fine. Have to run memchecker though. Let me verify it today and close this by today itself. Sorry for holding this for a long time. :-)
A related question on filter behaviour: If a filter fails by throwing an exception, is it gracefully handled or it takes the whole indexhelper down with it ? There are all kinds of files out there; it doesnt take much effort to find a file that'll crash a filter !
Most of the cases are handled to skip gracefully. Filters should probably wrap *text pulling* operations within a try-catch and fail gracefully. I guess, doing a "finished()" inside catch in IndexText(), would probably make this one not crash beagle.
I would like to see all of these filters do just that. Report the error and keep going. For me, I would never see them because they run automated at 2am. I would never know there was a problem. Part of my job still is to run these by hand and watch them run every few weeks on the command line to see if they are going ok.
Exceptions in filters are handled gracefully, and the file is skipped. If there is a segfault, there's nothing we can do. That's the benefit we have of using managed (c#) code whenever possible. :)
Ok. The bug actually is triggered from wv, as it (appears to be) tries to read past the EOF. I have filed a bug in bugzilla.abisource.com. http://bugzilla.abisource.com/show_bug.cgi?id=10025
> Ok. The bug actually is triggered from wv, as it (appears to be) tries to read > past the EOF. I have filed a bug in bugzilla.abisource.com. > http://bugzilla.abisource.com/show_bug.cgi?id=10025 I am all confused (nothing new :P). The bug is in wv, I understand. Then what is the status of the attached patch ? Is it a workaround/ earlier attempt/obsolete ? I found something from irclogs: [21:35] <varadhan> and since the entire wv1 logic is based on callback mechanism.. I guess, I will manually maintain a total size and try to match it witch current charpos that is being read. Any update on this ?
I don't know if this is directly related, but our OLE filter was opening files using GSF's mmap() interface, which maps the files as MAP_SHARED rather than MAP_PRIVATE. This means that if the file changed on disk, it would cause problems if the filter were in the middle of reading from it. This particularly seems to be a problem when saving these files with OpenOffice, which triggers two inotify events, and causes the file to be indexed twice. I've just checked this in, so it'd be good if people could test it.
In case the issue still persists, use external filter. I just tried antiword on the attached "buggy" file. Using this in external-filters.xml, <filter> <mimetype>application/msword</mimetype> <extension>.doc</extension> <command>antiword</command> <arguments>-t %s</arguments> </filter> and beagle-extract-content worked fine on the crasher doc file.
Dave, is this still an issue (not using the external filter workaround)? I think the fix I did in comment #16 may have solved it.
confirmed that it's fixed now in CVS. Failure of wv1 seems to just continue to the next item in the script. The user that previously could not complete now completed. I now have .doc support for the whole city, thanks.
mark close
These types of bugs are handled more gracefully in 0.2.15; setting up an external filter with antiword shouldn't be necessary.