GNOME Bugzilla – Bug 466891
Too many open files Exception with following Segmentation Fault
Last modified: 2007-08-15 19:23:07 UTC
Steps to reproduce: 1. start beagle-build-index with many many email files 2. after some hundred emails where indexed, the first IOExceptions (too many files open) are thrown 3. some files later a segmentation fault stops beagle-build-index Stack trace: NOTE, the email path and file names had to be anonymized due to privacy issues ============================================================================== Debug: +file:///PATH/EMAIL.eml Warn: Unable to filter PATH/EMAIL.eml: System.IO.IOException: Unable to read PATH/EMAIL.eml for parsing mail at Beagle.Filters.FilterMail.DoOpen (System.IO.FileInfo info) [0x00000] at Beagle.Daemon.Filter.DoOpen (System.IO.FileSystemInfo info) [0x00000] at Beagle.Daemon.Filter.Open (System.IO.FileSystemInfo info) [0x00000] Debug: First attempt to index file:///PATH/EMAIL.eml failed System.IO.IOException: Too many open files at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share) [0x00000] at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare) at Beagle.Indexable.StreamFromUri (System.Uri uri) [0x00000] at Beagle.Indexable.ReaderFromUri (System.Uri uri) [0x00000] at Beagle.Indexable.GetTextReader () [0x00000] at Beagle.Daemon.LuceneCommon.BuildDocuments (Beagle.Indexable indexable, Lucene.Net.Documents.Document& primary_doc, Lucene.Net.Documents.Document& secondary_doc) [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush_Unlocked (Beagle.Daemon.IndexerRequest request) [0x00000] Debug: +file:///PATH/EMAIL2.eml#0 Error: Unable to filter file:///PATH/EMAIL1.eml#0 (mimetype=text/plain) System.IO.IOException: Too many open files at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share) [0x00000] at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare) at Beagle.Daemon.Filter.Open (System.IO.FileSystemInfo info) [0x00000] at Beagle.Daemon.Filter.Open (System.String path) [0x00000] at Beagle.Daemon.FilterFactory.FilterIndexable (Beagle.Indexable indexable, Beagle.Daemon.TextCache text_cache, Beagle.Daemon.Filter& filter) [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush_Unlocked (Beagle.Daemon.IndexerRequest request) [0x00000] Debug: +file:///PATH/EMAIL2.eml Debug: No filter for file:///PATH/EMAIL2.eml (PATH/EMAIL2.eml) [application/octet-stream] Debug: +file:///PATH/EMAIL2.eml#2 Error: Unable to filter file:///PATH/EMAIL2.eml#2 (mimetype=application/pdf) System.IO.IOException: Too many open files at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share) [0x00000] at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare) at Beagle.Daemon.Filter.Open (System.IO.FileSystemInfo info) [0x00000] at Beagle.Daemon.Filter.Open (System.String path) [0x00000] at Beagle.Daemon.FilterFactory.FilterIndexable (Beagle.Indexable indexable, Beagle.Daemon.TextCache text_cache, Beagle.Daemon.Filter& filter) [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush_Unlocked (Beagle.Daemon.IndexerRequest request) [0x00000] Debug: +file:///PATH/EMAIL3.eml Debug: No filter for file:///PATH/EMAIL3.eml (PATH/EMAIL3.eml) [application/octet-stream] Debug: +file:///PATH/EMAIL4.eml Warn: Unable to filter PATH/EMAIL4.eml: System.IO.IOException: Unable to read PATH/EMAIL4.eml for parsing mail at Beagle.Filters.FilterMail.DoOpen (System.IO.FileInfo info) [0x00000] at Beagle.Daemon.Filter.DoOpen (System.IO.FileSystemInfo info) [0x00000] at Beagle.Daemon.Filter.Open (System.IO.FileSystemInfo info) [0x00000] Debug: First attempt to index file:///PATH/EMAIL4.eml failed System.IO.IOException: Too many open files at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share) [0x00000] at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare) at Beagle.Indexable.StreamFromUri (System.Uri uri) [0x00000] at Beagle.Indexable.ReaderFromUri (System.Uri uri) [0x00000] at Beagle.Indexable.GetTextReader () [0x00000] at Beagle.Daemon.LuceneCommon.BuildDocuments (Beagle.Indexable indexable, Lucene.Net.Documents.Document& primary_doc, Lucene.Net.Documents.Document& secondary_doc) [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush_Unlocked (Beagle.Daemon.IndexerRequest request) [0x00000] Debug: Second attempt to index file:///PATH/EMAIL4.eml failed, giving up... System.IO.IOException: Too many open files at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share) [0x00000] at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare) at Lucene.Net.Store.FSIndexOutput..ctor (System.IO.FileInfo path) [0x00000] at Lucene.Net.Store.FSDirectory.CreateOutput (System.String name) [0x00000] at Lucene.Net.Index.FieldInfos.Write (Lucene.Net.Store.Directory d, System.String name) [0x00000] at Lucene.Net.Index.SegmentMerger.MergeFields () [0x00000] at Lucene.Net.Index.SegmentMerger.Merge () [0x00000] at Lucene.Net.Index.IndexWriter.MergeSegments (Int32 minSegment, Int32 end) [0x00000] at Lucene.Net.Index.IndexWriter.MergeSegments (Int32 minSegment) [0x00000] at Lucene.Net.Index.IndexWriter.MaybeMergeSegments () [0x00000] at Lucene.Net.Index.IndexWriter.AddDocument (Lucene.Net.Documents.Document doc, Lucene.Net.Analysis.Analyzer analyzer) [0x00000] at Lucene.Net.Index.IndexWriter.AddDocument (Lucene.Net.Documents.Document doc) [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush_Unlocked (Beagle.Daemon.IndexerRequest request) [0x00000] Debug: Encountered exception while indexing: System.IO.IOException: Too many open files at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] at System.IO.FileStream..ctor (System.String name, FileMode mode, FileAccess access, FileShare share) [0x00000] at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare) at Lucene.Net.Store.FSIndexOutput..ctor (System.IO.FileInfo path) [0x00000] at Lucene.Net.Store.FSDirectory.CreateOutput (System.String name) [0x00000] at Lucene.Net.Index.FieldInfos.Write (Lucene.Net.Store.Directory d, System.String name) [0x00000] at Lucene.Net.Index.SegmentMerger.MergeFields () [0x00000] at Lucene.Net.Index.SegmentMerger.Merge () [0x00000] at Lucene.Net.Index.IndexWriter.MergeSegments (Int32 minSegment, Int32 end) [0x00000] at Lucene.Net.Index.IndexWriter.MergeSegments (Int32 minSegment) [0x00000] at Lucene.Net.Index.IndexWriter.FlushRamSegments () [0x00000] at Lucene.Net.Index.IndexWriter.Close () [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush_Unlocked (Beagle.Daemon.IndexerRequest request) [0x00000] at Beagle.Daemon.LuceneIndexingDriver.Flush (Beagle.Daemon.IndexerRequest request) [0x00000] at Beagle.Daemon.BuildIndex.FlushIndexer (IIndexer indexer, Beagle.Daemon.IndexerRequest request) [0x00000] at Beagle.Daemon.BuildIndex.AddToRequest (Beagle.Daemon.IndexerRequest request, Beagle.Indexable indexable) [0x00000] at Beagle.Daemon.BuildIndex.DoIndexing () [0x00000] at Beagle.Daemon.BuildIndex.IndexWorker () [0x00000] Debug: IndexWorker Done libgcc_s.so.1 must be installed for pthread_cancel to work ================================================================= Got a SIGABRT while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application. ================================================================= Stacktrace: Other information: The problem is caused since the filter opens the content file via a system call open but does not call the corresponding system call close.
Created attachment 93709 [details] [review] patch that fixes the bug This fix simply closes the file using the corresponding system call close, the exception and the segmentation fault disappears.
By the way, the code of the FilterMail.cs file did not change between 0.2.14 and 0.2.17 in respect to this bug, so it can be seen as still existing.
Nice catch! ;-) Fixed in r3855.
I don't think this patch is correct; stream.Dispose() should take care of closing the file descriptor. Something else is going on.
Are there any exceptions prior to the "too many open files" error?
nope, before the first IOException is thrown by the FilterMail.cs, indexing is just fine. There are some arbitrary exceptions from various filters complaining about wrong gzip compression or images of wrong format. I don't think that is related?! However, without that fix, beagle-build-index crashes at the very same file, with it, it completes indexing without problems (except the mentioned parsing issues).
Took a look at this. Dispose does not seem to be overwritten for StreamFs nor Stream. So from what i can tell this will call the GObject.Dispose(). Don't know if this calls anything that would close the file though. I am not familiar with glib object destruction process.
I backed out the change in SVN, r3856. I should describe how things are supposed to work: Beagle uses GMime for mail parsing, which is written in C. It uses objects and uses reference counting for memory management. When we create a GMime.StreamFS and pass in the file descriptor, we're passing ownership of that file descriptor to the stream. We no longer have ownership of that fd, so we can't close it. That's why the patch isn't right, and it can have serious side effects. The created GMime.StreamFS has a ref count of 1. When we create a GMime.Parser and pass in the GMime.StreamFS, the parser has a ref count of 1, and the stream's count is increased to 2. When we construct a GMime.Message from the parser, the message gets a ref count of 1, and the parser's count is increased to 2. So if you're keeping score at home, GMime.StreamFS has 2, GMime.Parser has 2, GMime.Message has 1. Calling Dispose() on the stream decreases its ref count to 1, and calling it on the parser decreases its ref count to 1. This is where the indexing process takes place. When it's finished, DoClose() is supposed to be called. That disposes of the GMime.Message, which decreases its ref count to 0. That causes a chain reaction. When GMime.Message's refcount reaches 0, it releases its reference on GMime.Parser which drops to 0, which releases its reference on GMime.StreamFS. When GMime.StreamFS's refcount reaches 0, it closes the file descriptor. So, if file descriptors are being leaked, there's also a very good chance that tons and tons of memory is being leaked as well. By closing the file descriptor early like that, you're probably not actually getting any mail data in the index or possibly worse: random data. Likewise, if the close process actually was working you would be closing a random file (since file descriptors are reused)... you could be closing an important index file. The patch in essence treats the symptom but not the disease. It'd be like giving aspirin to someone with encephalitis... It might make their headache go away, but they're still going to die from a swollen brain. :) I didn't notice originally that you were talking about beagle-build-index; my guess is that it simply isn't doing the Close() process correctly. I'll see if I can duplicate the issue locally.
Putting a console.readline before extract-content returns, and adding print statements in DoClose(), it looks like - DoClose() is called - message.dispose() is called - lsof shows the file descriptor is not closed Something wrong in gmime.
Forgot to add, this is even for a valid email file.
I see this too, although this might be a red herring: it's possible the GC or something didn't run and close the file. I regularly run Beagle over several thousand emails without incident, so I think it's probably more than just something wrong with gmime.
Looks like GLib.Object.Dispose() queues up object unrefs using GLib.Timeout, which means that if a main loop isn't running they'll never get triggered. This seems to be the issue.
This appears to have been fixed in gtk-sharp svn for some time, but I don't think there is a release which incorporates it.
When I am indexing my files with beagled using the "--backend Files --indexing-test-mode" option, it is doing the same thing for me but now it succeeds.
Yeah, beagled uses a main loop, so the objects are getting disposed of properly. dBera mentioned that the file was staying open with beagle-extract-content. There was a small inefficiency in GMime that would keep the file open longer than needed, and I just checked in a fix for that, but that's not the crux of the issue for you. We'll probably need to add some GLib main loop action to beagle-build-index. I think we can just do that in a separate thread. I need to play around with that some more.
I was able to duplicate the problem, and I've checked in a fix (a workaround, really) to both beagle-build-index and beagle-extract-content. r3857.