Bug 380328 – try to recover from crashes instead of purging the index

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 380328 - try to recover from crashes instead of purging the index


Summary:	try to recover from crashes instead of purging the index


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	0.2.3
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-11-28 23:23 UTC by Debajyoti Bera
Modified:	2007-04-09 19:52 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Debajyoti Bera 2006-11-28 23:23:09 UTC

Summary says it all. Mostly a place holder for patches and crasher test documents.

Comment 1 Philip Aston 2007-02-01 22:58:31 UTC

Even if its a slightly stale checkpoint, it would make beagle usable for those of us with a million files to index.

Comment 2 Debajyoti Bera 2007-04-07 23:43:26 UTC

I checked in a patch svn r3638 which first checks if the index is at all corrupted before deleting it. If the index is actually corrupted (which is highly improbable given the robustness of Lucene), the unfortunately it has to be rebuilt from scratch. Checkpointing isnt possible on desktop machines, its too expensive.

Comment 3 Joe Shaw 2007-04-08 16:03:45 UTC

dbera: Is your implementation a "best practice" for determining a corrupt index, or is it a guess as to how a corrupt index would act in reality?

For the code, I have one suggestion: index corruption is more likely to happen at the end of the index where documents are appended, so instead of incrementing from 0 to IndexReader.MaxDoc(), I would start at IndexReader.MaxDoc() and decrement to 0.

I wonder if there's a way we can only walk some subset of the docs, like all the docs in the most recent Lucene segment.

Comment 4 Debajyoti Bera 2007-04-08 16:19:38 UTC

I read that trying to construct all the documents is one(the) way to check if an index is corrupted. I read it somewhere looooooong ago, dont remember where.

We can try the reverse walk. How long (time and memory) does it take on your huge indexes ?

But in most cases, the index will be ok (I tried a couple of crash files and tested against them) so we will have to instantiate all of them anyway. I did a doc=null right after creating them so hopefully the GC will get the hint faster.

Walking a subset like most recent Lucene segment is definitely an option, but will require better understanding of lucene segments and the cfs files. If you come to know of anything, I will be happy to learn about it.

Comment 5 Debajyoti Bera 2007-04-08 17:01:15 UTC

The rationale behind this is somewhat described here. I am adding this from http://209.85.165.104/search?q=cache:2WxQbob93wcJ:www.usit.uio.no/it/vortex/fokusomrader/metadata/lucene/lucene-technical.html+lucene+corrupt&hl=en&ct=clnk&cd=11&gl=us&client=firefox-a
(original document gone, this is google cache)

Quoting "...
Detecting index corruption

Simple tests of Lucene's reaction to intentionally corrupted indexes have been done. The result is almost always thrown IOExceptions when trying to open the damaged indexes. Some corruptions are not detected (ie. stored data contents are altered, probably other smaller alterations to non-structural data).

Lucene will not tell us anything useful, if there's a corruption, only the IOException with accompanying stack trace, etc. There are no finely grained exception hierarchies for different classes of Lucene index problems. However, in most cases, the IOException will be because of corruption or locking. If we avoid locking problems altogether, like in the Tavle project, we are left with corruption as the most probable cause. Additional tests might be performed to determine if index-rescue operations should be initiated, or perhaps find other causes as to why Lucene failed to open an index.

Rudimentary causes of failure should be checked no matter what. This includes validating that the physical index directory on OS file system exists, write permissions and so on...

An easy way to explicitly check the integrity of an index would be to gain exclusive access to it, open it, iterate through all documents, open each document, then close index. If no IOExceptions occur during this process, the index is most probably OK. Otherwise, after eliminating all basic error condition."

Comment 6 Joe Shaw 2007-04-08 17:03:18 UTC

Also, this is interesting:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/%3c4265767B.5090307@getopt.org%3e

In some use scenarios it's not that simple... Anyway, back to the 
original question: indexExists() just checks for the presence of the 
"segments" file, so it says nothing about the index consistency. The 
best way to make sure the index is valid is to open it, and catch an 
IOException.

To purposefuly break the index you can do several things:

* delete the "segments" file itself (this will trash the whole index)

* delete one of the segments from the index (should generate exception 
when opening)

* write a bunch of zeros in the middle of a segment file. This should 
result in an exception - but I'm not sure when; whether during open(), 
or during actual reading of affected data. You could then do the 
following: loop through all terms in the index (see IndexReader API), 
and for each term get its TermPositions. This will have to read the 
complete index. Looping through all documents, and reading each 
document, doesn't guarantee that - unstored fields are not loaded into 
documents.

Comment 7 Joe Shaw 2007-04-09 19:52:15 UTC

I updated to this latter process, r3645.