GNOME Bugzilla – Bug 339470
Store textcache files compressed
Last modified: 2006-05-31 18:55:39 UTC
~/.beagle/Indexes is quite small but ~/.beagle/TextCache is generally huge (by orders of magnitude). Well, beagle stores a text snapshot/cached-version of the non-text files in TextCache and uses it to retrieve snippets later. Since TextCache filea re pure text files, storing them compressed instead of plain text files would be lot of savings in storage.
This shouldn't be too difficult to implement, my only question is what are our concerns about performance? Snippeting is already one of the slower processes when it comes to rendering results, are we ok with that delay?
Created attachment 64167 [details] [review] GZip all TextCache Files A rudamentary implementation of a Compressed TextCache.
Ok, that patch doesn't work. I think I have a workaround, but let me test it a little more before I post.
Created attachment 64170 [details] [review] Updated Patch GZips all TextCache files without any real changest to the snippeting infrastructure.
Great. Now the difficult part - testing :). From past experiences, GZIP libraries are usually pretty fast and has very little overhead in reading from a file. So I dont expect reading from gzipped files would incur any noticable overhead. But this needs to be verified.
Yeah, I've been stress testing it as best I can, and havn't noticed any serious performance issues yet. This seems like something Bludgeon would be perfect for, but I've never used it.... Reguardless, atm its crash free for me, with a significant size reduction. We could use Bz2 compression to crunch the files much more, but that has a much greater overhead.
I agree with dBera, I think that verifying the overhead is important. The purpose of the TextCache is solely to retrieve snippets as fast as possible, and retrieving snippets is already one of the slowest things.
Definatly agreed, I've been playing with bludgeon to try and get something to test snippets. In practice, I havn't been able to notice much of a difference. (the primary reason I chose GZip and not Bz2 or Zip is because it handles low-overhead streams the best)
Kevin, you could do the following experiment. Find a large text file (run pdftotext on large PDF papers/thesis ;)). Run the Snippetextraction text from it in a loop for about a 100 times both using in the GZip mode and uncompressed mode. And see the time difference. From my past experience, gzip is very fast and provides negligible overhead in stream decompression.
I dunno how long were planning before the next release, but if we have some time the best stress test might be a few days in CVS. I've done about all the damage I can, and can't seem to generate any serious performance slowdowns (with 0beagle-search that is, however it pulls snippets on demand as opposed to some other frontends which might load them all before displaying) My $0.02, althought the ~/.beagle dir is a good %60 smaller which was startaling, I didn't realize that the TextCache was so large.
Ooops, Mid-air collision there. Lemme give it a shot. I happen to have a copy of ESR's Cathedral and the Bazaar in postscript, that should be a nice text block ;)
Ok, rough test results are as follows (this is over a remote connection, so its by no means authoritative, but GZip'd Snippeting is taking about .6 of a second (on adverage) while uncompressed is closer to .4. I'll attach the file I was extracting from if anyone wants to verify this a little more scientific. (I just hacked up SnippetText.cs)
Created attachment 64246 [details] Textfile used during snippet extraction tests
(In reply to comment #12) > Ok, rough test results are as follows (this is over a remote connection, so its > by no means authoritative, but GZip'd Snippeting is taking about .6 of a second > (on adverage) while uncompressed is closer to .4. I'll attach the file I was Scanning that single uncompressed file took close to 0.4 seconds ?! No way ... you must be doing something like 10/20 iterations of SnippetTest. I tried for a snippet containing a word in the last few lines of the text and it tool 0.2-0.3 seconds for 10 iterations. > extracting from if anyone wants to verify this a little more scientific. (I > just hacked up SnippetText.cs)
> Scanning that single uncompressed file took close to 0.4 seconds ?! > No way ... you must be doing something like 10/20 iterations of SnippetTest. I > tried for a snippet containing a word in the last few lines of the text and it > tool 0.2-0.3 seconds for 10 iterations. I was doing it with the query to the beagle daemon. I was snippeting all results returned for the query 'cathedral bazaar' which only returned that result. The extra overhead was probably the result of the query happening each time, which procudes a lot of extra disk spin. It was also over a remote connection on a wireless laptop. The idea reeks of laggy. I won't get a chance to run this localy until much later tonight. (School computers -> Windows) If someone wants to win a 'super cool' badge its pretty easy to do what we need. Just look at the attached patch and use one of the compressed streams to read a file and extract snippets.
Created attachment 64320 [details] [review] Update Maybe I shouldn't flush the writestream BEFORE writing.... yeah.. that sounds like a good idea....
I havent looked at the patch minutely. But make sure you handle the following issue: Some of the files in textcache are marked self-cached. Self-cached files are _not_ in the textcache store but in the normal filesystem. (They basically mean, for those file, no separate cache is required. Any snippet information can be obtained relatively fast from the original files themselves). I dont remember the whole code structure how textcache handles these, but textcache mustnt try to open a gzip stream on a self-cached file.
(In reply to comment #17) > I havent looked at the patch minutely. But make sure you handle the following > issue: Some of the files in textcache are marked self-cached. Self-cached files > are _not_ in the textcache store but in the normal filesystem. (They basically > mean, for those file, no separate cache is required. Any snippet information > can be obtained relatively fast from the original files themselves). I dont > remember the whole code structure how textcache handles these, but textcache > mustnt try to open a gzip stream on a self-cached file. I already took this into consideration. If you check the patch, the GetReader() method checks that the file is gzipped first. Also this was easy to implement in part due to the setup of the TextCache system, in which we implement a GetReader() and GetWriter() method, which both return streams.
Checking just the mimetype isnt enough. For e.g. if some gzipped file was marked as self-cache (no reason why it should be, but nothing in the software or design prevents that from happening either), then your textcache will try to open it. You need to do some Lookup yourself (in textcache), determine if its marked self-cache and then decide to either leave it alone or open a gzipstream.
Perhaps I am mistaken, but I was under the impression that LookupPath() returned the path to the relevant file, either the textcache or selfcached file. The method is at about line 258, check it out. I think it returns either the path of the actual file, (in which case we will open a regular stream[1]) or it returns the textcache file (in which case we will open a GZipInputStream). *1- Thats in the most recent patch update, I'll attach in a second
I didnt _literally_ mean the LookupPath() method, but one of the lookup methods in textcache that tell you whether this uri is marked self-cached or not :). I blame my school for all the confusion caused by me ;-).
Hahah, my bad for the response being unclear. The LookupPath() method already does check for the self-cached flag. If a file is self cached, then it is expected that GetReader() will return a stream referencing that file. I have a change I need to test quickly, but will upload the new patch as soon as im sure it works.
Created attachment 64346 [details] [review] GZip TextCache files This adds a fix to the FSQ backend to actualy use the compressed reader stuff (like the otherbackends do...) added some debugging output to help people be happy.
Any thoughts? Idealy someone else can try applying the patch and give it a little test. I've been using it all week with no real issues....
Created attachment 64900 [details] [review] Update to fix some snippeting quarks on self_cached files Fixes some self-cached issues in the FSQ backend, removed some debugging output. In addition, I have continued some basic testing of Compressed stream performance v.s. uncompressed streams, and I think were still spending most of our time just waiting for the disk to spin up/in the acutal extraction method. I might just be testing incorrectly at the moment, but even with 100 extractions of the cathedral/bazaar text file, I can't get much more than a .06 to .07 second difference...
Created attachment 65177 [details] [review] Fix Potential Xdgmime error our Xdgmime bindings/ Xdgmime don't handle no-existent paths very well, don't want that to happen here.
Do we really need to use xdgmime to look up the mimetype? IMHO its a small overhead since xdgmime has to read in a part of the file to determine if it is a gzipped file. Couldn't we just name the gzipped files differently?
We could, when I get home I'll give it a try, it was part of my first means of making sure we didn't try to open a Gziped stream on self-cached files, but as I went, I added more conditions. Its probably not needed, although, if we drop it, we would need to name compressed files differently, or (my preference) just require a wipe of ~/.beagle/TextCache and ~/.beagle/Indexies. I dunno, either one is doable.
Created attachment 65927 [details] [review] Remove Xdgmime and redundant path validity checking stuff Ok, removed the unneeded, XdgMime check, and the check for the files existence (as we already have a try/catch there to handle that). This loses the 'I've been running for weeks without error' status on this patch, but I still use this daily without issue.
Created attachment 65957 [details] [review] Proposed solution Cleaned up and reworked logic, based on the last patch from Kevin. This should work fine with old and new TextCache files. Please test it, if no issues pop up, it should be fine to go in.
Created attachment 65959 [details] [review] Proposed solution Removed uneeded modifications in my tree which popped up in the previous patch.
Only thing your missing is the slight rework of snippeting in the FSQ backend, since self-cached files can complicate things. the change is pretty mild, and you can probably pull it straight from one of my older patches.
(In reply to comment #32) > Only thing your missing is the slight rework of snippeting in the FSQ backend, > since self-cached files can complicate things. the change is pretty mild, and > you can probably pull it straight from one of my older patches. > Why do we need to change the snippeting in FSQ?
Hey, I e-mailed Lukas the core of it: >Fsq uses GetSnippetFromFile which does not call GetReader (thus not getting a >gziped stream), so my old patch checked if the file was self-cached, and if it >was, continued with snippet extraction from the file, otherwise, it went the >more traditional route and used GetReader. If we dont do this, then beagle >will throw exceptions whenever snippeting and self-cached file. Here is the portion of the patch that is of interest. Index: ./beagled/FileSystemQueryable/FileSystemQueryable.cs =================================================================== RCS file: /cvs/gnome/beagle/beagled/FileSystemQueryable/FileSystemQueryable.cs,v retrieving revision 1.106 diff -u -r1.106 FileSystemQueryable.cs --- ./beagled/FileSystemQueryable/FileSystemQueryable.cs 29 Apr 2006 15:44:25 -0000 1.106 +++ ./beagled/FileSystemQueryable/FileSystemQueryable.cs 21 May 2006 02:52:36 -0000 @@ -1392,9 +1400,10 @@ // If this is self-cached, use the remapped Uri if (path == TextCache.SELF_CACHE_TAG) - path = hit.Uri.LocalPath; + return SnippetFu.GetSnippetFromFile (query_terms, hit.Uri.LocalPath); - return SnippetFu.GetSnippetFromFile (query_terms, path); + + return SnippetFu.GetSnippet (query_terms, TextCache.UserCache.GetReader(uri)); } override public void Start ()
Created attachment 65988 [details] [review] Final Okay, I see what you ment. Took a little different approach than in your previous patch, which now doesn't need to query the sqlite database twice during path lookup. BTW I didn't get your email :-)
(In reply to comment #35) > Created an attachment (id=65988) [edit] > Final > Okay, I see what you ment. Took a little different approach than in your > previous patch, which now doesn't need to query the sqlite database twice > during path lookup. Sweet, that makes a little more sense. BTW I didn't get your email :-) My bad, I sent it remotely from the metro, looks like it still sitting in my outbox...
Lukas: why comment out the "FileAdvise.FlushCache (stream);" line? That's probably the most important line in the entire file. :)
Created attachment 66007 [details] [review] Final 2 Eeek! Fixed. But then again, it's not that important since it still worked without it. :-)
Checked this in after testing it quite a bit today. Thanks guys!