GNOME Bugzilla – Bug 380328
try to recover from crashes instead of purging the index
Last modified: 2007-04-09 19:52:15 UTC
Summary says it all. Mostly a place holder for patches and crasher test documents.
Even if its a slightly stale checkpoint, it would make beagle usable for those of us with a million files to index.
I checked in a patch svn r3638 which first checks if the index is at all corrupted before deleting it. If the index is actually corrupted (which is highly improbable given the robustness of Lucene), the unfortunately it has to be rebuilt from scratch. Checkpointing isnt possible on desktop machines, its too expensive.
dbera: Is your implementation a "best practice" for determining a corrupt index, or is it a guess as to how a corrupt index would act in reality? For the code, I have one suggestion: index corruption is more likely to happen at the end of the index where documents are appended, so instead of incrementing from 0 to IndexReader.MaxDoc(), I would start at IndexReader.MaxDoc() and decrement to 0. I wonder if there's a way we can only walk some subset of the docs, like all the docs in the most recent Lucene segment.
I read that trying to construct all the documents is one(the) way to check if an index is corrupted. I read it somewhere looooooong ago, dont remember where. We can try the reverse walk. How long (time and memory) does it take on your huge indexes ? But in most cases, the index will be ok (I tried a couple of crash files and tested against them) so we will have to instantiate all of them anyway. I did a doc=null right after creating them so hopefully the GC will get the hint faster. Walking a subset like most recent Lucene segment is definitely an option, but will require better understanding of lucene segments and the cfs files. If you come to know of anything, I will be happy to learn about it.
The rationale behind this is somewhat described here. I am adding this from http://209.85.165.104/search?q=cache:2WxQbob93wcJ:www.usit.uio.no/it/vortex/fokusomrader/metadata/lucene/lucene-technical.html+lucene+corrupt&hl=en&ct=clnk&cd=11&gl=us&client=firefox-a (original document gone, this is google cache) Quoting "... Detecting index corruption Simple tests of Lucene's reaction to intentionally corrupted indexes have been done. The result is almost always thrown IOExceptions when trying to open the damaged indexes. Some corruptions are not detected (ie. stored data contents are altered, probably other smaller alterations to non-structural data). Lucene will not tell us anything useful, if there's a corruption, only the IOException with accompanying stack trace, etc. There are no finely grained exception hierarchies for different classes of Lucene index problems. However, in most cases, the IOException will be because of corruption or locking. If we avoid locking problems altogether, like in the Tavle project, we are left with corruption as the most probable cause. Additional tests might be performed to determine if index-rescue operations should be initiated, or perhaps find other causes as to why Lucene failed to open an index. Rudimentary causes of failure should be checked no matter what. This includes validating that the physical index directory on OS file system exists, write permissions and so on... An easy way to explicitly check the integrity of an index would be to gain exclusive access to it, open it, iterate through all documents, open each document, then close index. If no IOExceptions occur during this process, the index is most probably OK. Otherwise, after eliminating all basic error condition."
Also, this is interesting: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/%3c4265767B.5090307@getopt.org%3e In some use scenarios it's not that simple... Anyway, back to the original question: indexExists() just checks for the presence of the "segments" file, so it says nothing about the index consistency. The best way to make sure the index is valid is to open it, and catch an IOException. To purposefuly break the index you can do several things: * delete the "segments" file itself (this will trash the whole index) * delete one of the segments from the index (should generate exception when opening) * write a bunch of zeros in the middle of a segment file. This should result in an exception - but I'm not sure when; whether during open(), or during actual reading of affected data. You could then do the following: loop through all terms in the index (see IndexReader API), and for each term get its TermPositions. This will have to read the complete index. Looping through all documents, and reading each document, doesn't guarantee that - unstored fields are not loaded into documents.
I updated to this latter process, r3645.