Bug 339470 – Store textcache files compressed

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 339470 - Store textcache files compressed


Summary:	Store textcache files compressed


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	0.2.5
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Kevin Kubasik
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-04-23 14:06 UTC by Debajyoti Bera
Modified:	2006-05-31 18:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
GZip all TextCache Files (1.41 KB, patch) 2006-04-24 00:35 UTC, Kevin Kubasik	none	Details \| Review
Updated Patch (2.05 KB, patch) 2006-04-24 03:47 UTC, Kevin Kubasik	none	Details \| Review
Textfile used during snippet extraction tests (105.16 KB, text/plain) 2006-04-24 23:24 UTC, Kevin Kubasik		Details
Update (2.06 KB, patch) 2006-04-26 12:58 UTC, Kevin Kubasik	none	Details \| Review
GZip TextCache files (4.24 KB, patch) 2006-04-26 19:47 UTC, Kevin Kubasik	none	Details \| Review
Update to fix some snippeting quarks on self_cached files (5.87 KB, patch) 2006-05-06 04:10 UTC, Kevin Kubasik	none	Details \| Review
Fix Potential Xdgmime error (5.89 KB, patch) 2006-05-10 16:51 UTC, Kevin Kubasik	none	Details \| Review
Remove Xdgmime and redundant path validity checking stuff (3.55 KB, patch) 2006-05-21 02:57 UTC, Kevin Kubasik	none	Details \| Review
Proposed solution (3.15 KB, patch) 2006-05-21 18:08 UTC, Lukas Lipka	none	Details \| Review
Proposed solution (2.01 KB, patch) 2006-05-21 18:12 UTC, Lukas Lipka	none	Details \| Review
Final (4.81 KB, patch) 2006-05-22 10:40 UTC, Lukas Lipka	none	Details \| Review
Final 2 (4.88 KB, patch) 2006-05-22 19:08 UTC, Lukas Lipka	committed	Details \| Review

Description Debajyoti Bera 2006-04-23 14:06:17 UTC

~/.beagle/Indexes is quite small but ~/.beagle/TextCache is generally huge (by orders of magnitude). Well, beagle stores a text snapshot/cached-version of the non-text files in TextCache and uses it to retrieve snippets later. Since TextCache filea re pure text files, storing them compressed instead of plain text files would be lot of savings in storage.

Comment 1 Kevin Kubasik 2006-04-23 23:38:19 UTC

This shouldn't be too difficult to implement, my only question is what are our concerns about performance? Snippeting is already one of the slower processes when it comes to rendering results, are we ok with that delay?

Comment 2 Kevin Kubasik 2006-04-24 00:35:55 UTC

Created attachment 64167 [details] [review]
GZip all TextCache Files

A rudamentary implementation of a Compressed TextCache.

Comment 3 Kevin Kubasik 2006-04-24 01:05:35 UTC

Ok, that patch doesn't work. I think I have a workaround, but let me test it a little more before I post.

Comment 4 Kevin Kubasik 2006-04-24 03:47:26 UTC

Created attachment 64170 [details] [review]
Updated Patch

GZips all TextCache files without any real changest to the snippeting infrastructure.

Comment 5 Debajyoti Bera 2006-04-24 04:43:10 UTC

Great. Now the difficult part - testing :). From past experiences, GZIP libraries are usually pretty fast and has very little overhead in reading from a file. So I dont expect reading from gzipped files would incur any noticable overhead. But this needs to be verified.

Comment 6 Kevin Kubasik 2006-04-24 05:56:01 UTC

Yeah, I've been stress testing it as best I can, and havn't noticed any serious performance issues yet. This seems like something Bludgeon would be perfect for, but I've never used it....

Reguardless, atm its crash free for me, with a significant size reduction. We could use Bz2 compression to crunch the files much more, but that has a much greater overhead.

Comment 7 Joe Shaw 2006-04-24 15:57:02 UTC

I agree with dBera, I think that verifying the overhead is important.  The purpose of the TextCache is solely to retrieve snippets as fast as possible, and retrieving snippets is already one of the slowest things.

Comment 8 Kevin Kubasik 2006-04-24 16:07:08 UTC

Definatly agreed, I've been playing with bludgeon to try and get something to test snippets. In practice, I havn't been able to notice much of a difference. (the primary reason I chose GZip and not Bz2 or Zip is because it handles low-overhead streams the best)

Comment 9 Debajyoti Bera 2006-04-24 22:36:42 UTC

Kevin, you could do the following experiment. Find a large text file (run pdftotext on large PDF papers/thesis ;)). Run the Snippetextraction text from it in a loop for about a 100 times both using in the GZip mode and uncompressed mode. And see the time difference. From my past experience, gzip is very fast and provides negligible overhead in stream decompression.

Comment 10 Kevin Kubasik 2006-04-24 22:38:12 UTC

I dunno how long were planning before the next release, but if we have some time the best stress test might be a few days in CVS. I've done about all the damage I can, and can't seem to generate any serious performance slowdowns (with 0beagle-search that is, however it pulls snippets on demand as opposed to some other frontends which might load them all before displaying) 

My $0.02, althought the ~/.beagle dir is a good %60 smaller which was startaling, I didn't realize that the TextCache was so large.

Comment 11 Kevin Kubasik 2006-04-24 22:42:49 UTC

Ooops, Mid-air collision there. Lemme give it a shot. I happen to have a copy of ESR's Cathedral and the Bazaar in postscript, that should be a nice text block ;)

Comment 12 Kevin Kubasik 2006-04-24 23:23:10 UTC

Ok, rough test results are as follows (this is over a remote connection, so its by no means authoritative, but GZip'd Snippeting is taking about .6 of a second (on adverage) while uncompressed is closer to .4. I'll attach the file I was extracting from if anyone wants to verify this a little more scientific. (I just hacked up SnippetText.cs)

Comment 13 Kevin Kubasik 2006-04-24 23:24:10 UTC

Created attachment 64246 [details]
Textfile used during snippet extraction tests

Comment 14 Debajyoti Bera 2006-04-25 01:35:55 UTC

(In reply to comment #12)
> Ok, rough test results are as follows (this is over a remote connection, so its
> by no means authoritative, but GZip'd Snippeting is taking about .6 of a second
> (on adverage) while uncompressed is closer to .4. I'll attach the file I was

Scanning that single uncompressed file took close to 0.4 seconds ?!
No way ... you must be doing something like 10/20 iterations of SnippetTest. I tried for a snippet containing a word in the last few lines of the text and it tool 0.2-0.3 seconds for 10 iterations.

> extracting from if anyone wants to verify this a little more scientific. (I
> just hacked up SnippetText.cs)

Comment 15 Kevin Kubasik 2006-04-25 11:47:42 UTC

> Scanning that single uncompressed file took close to 0.4 seconds ?!
> No way ... you must be doing something like 10/20 iterations of SnippetTest. I
> tried for a snippet containing a word in the last few lines of the text and it
> tool 0.2-0.3 seconds for 10 iterations.

I was doing it with the query to the beagle daemon. I was snippeting all results returned for the query 'cathedral bazaar' which only returned that result. The extra overhead was probably the result of the query happening each time, which procudes a lot of extra disk spin. It was also over a remote connection on a wireless laptop. The idea reeks of laggy. I won't get a chance to run this localy until much later tonight. (School computers -> Windows) If someone wants to win a 'super cool' badge its pretty easy to do what we need. Just look at the attached patch and use one of the compressed streams to read a file and extract snippets.

Comment 16 Kevin Kubasik 2006-04-26 12:58:05 UTC

Created attachment 64320 [details] [review]
Update

Maybe I shouldn't flush the writestream BEFORE writing.... yeah.. that sounds like a good idea....

Comment 17 Debajyoti Bera 2006-04-26 13:06:35 UTC

I havent looked at the patch minutely. But make sure you handle the following issue: Some of the files in textcache are marked self-cached. Self-cached files are _not_ in the textcache store but in the normal filesystem. (They basically mean, for those file, no separate cache is required. Any snippet information can be obtained relatively fast from the original files themselves). I dont remember the whole code structure how textcache handles these, but textcache mustnt try to open a gzip stream on a self-cached file.

Comment 18 Kevin Kubasik 2006-04-26 14:56:15 UTC

(In reply to comment #17)
> I havent looked at the patch minutely. But make sure you handle the following
> issue: Some of the files in textcache are marked self-cached. Self-cached files
> are _not_ in the textcache store but in the normal filesystem. (They basically
> mean, for those file, no separate cache is required. Any snippet information
> can be obtained relatively fast from the original files themselves). I dont
> remember the whole code structure how textcache handles these, but textcache
> mustnt try to open a gzip stream on a self-cached file.

I already took this into consideration. If you check the patch, the GetReader() method checks that the file is gzipped first. Also this was easy to implement in part due to the setup of the TextCache system, in which we implement a GetReader() and GetWriter() method, which both return streams.

Comment 19 Debajyoti Bera 2006-04-26 15:06:37 UTC

Checking just the mimetype isnt enough. For e.g. if some gzipped file was marked as self-cache (no reason why it should be, but nothing in the software or design prevents that from happening either), then your textcache will try to open it. You need to do some Lookup yourself (in textcache), determine if its marked self-cache and then decide to either leave it alone or open a gzipstream.

Comment 20 Kevin Kubasik 2006-04-26 19:03:25 UTC

Perhaps I am mistaken, but I was under the impression that LookupPath() returned the path to the relevant file, either the textcache or selfcached file. The method is at about line 258, check it out. I think it returns either the path of the actual file, (in which case we will open a regular stream[1]) or it returns the textcache file (in which case we will open a GZipInputStream). 

*1- Thats in the most recent patch update, I'll attach in a second

Comment 21 Debajyoti Bera 2006-04-26 19:20:08 UTC

I didnt _literally_ mean the LookupPath() method, but one of the lookup methods in textcache that tell you whether this uri is marked self-cached or not :). I blame my school for all the confusion caused by me ;-).

Comment 22 Kevin Kubasik 2006-04-26 19:27:16 UTC

Hahah, my bad for the response being unclear. The LookupPath() method already does check for the self-cached flag. If a file is self cached, then it is expected that GetReader() will return a stream referencing that file. I have a change I need to test quickly, but will upload the new patch as soon as im sure it works.

Comment 23 Kevin Kubasik 2006-04-26 19:47:21 UTC

Created attachment 64346 [details] [review]
GZip TextCache files

This adds a fix to the FSQ backend to actualy use the compressed reader stuff (like the otherbackends do...) added some debugging output to help people be happy.

Comment 24 Kevin Kubasik 2006-04-28 12:35:13 UTC

Any thoughts? Idealy someone else can try applying the patch and give it a little test. I've been using it all week with no real issues....

Comment 25 Kevin Kubasik 2006-05-06 04:10:39 UTC

Created attachment 64900 [details] [review]
Update to fix some snippeting quarks on self_cached files

Fixes some self-cached issues in the FSQ backend, removed some debugging output. 

In addition, I have continued some basic testing of Compressed stream performance v.s. uncompressed streams, and I think were still spending most of our time just waiting for the disk to spin up/in the acutal extraction method. I might just be testing incorrectly at the moment, but even with 100 extractions of the cathedral/bazaar text file, I can't get much more than a .06 to .07 second difference...

Comment 26 Kevin Kubasik 2006-05-10 16:51:23 UTC

Created attachment 65177 [details] [review]
Fix Potential Xdgmime error

our Xdgmime bindings/ Xdgmime don't handle no-existent paths very well, don't want that to happen here.

Comment 27 Lukas Lipka 2006-05-17 17:08:41 UTC

Do we really need to use xdgmime to look up the mimetype? IMHO its a small overhead since xdgmime has to read in a part of the file to determine if it is a gzipped file. Couldn't we just name the gzipped files differently?

Comment 28 Kevin Kubasik 2006-05-17 17:17:04 UTC

We could, when I get home I'll give it a try, it was part of my first means of making sure we didn't try to open a Gziped stream on self-cached files, but as I went, I added more conditions. Its probably not needed, although, if we drop it, we would need to name compressed files differently, or (my preference) just require a wipe of ~/.beagle/TextCache and ~/.beagle/Indexies. I dunno, either one is doable.

Comment 29 Kevin Kubasik 2006-05-21 02:57:55 UTC

Created attachment 65927 [details] [review]
Remove Xdgmime and redundant path validity checking stuff

Ok, removed the unneeded, XdgMime check, and the check for the files existence (as we already have a try/catch there to handle that). 

This loses the 'I've been running for weeks without error' status on this patch, but I still use this daily without issue.

Comment 30 Lukas Lipka 2006-05-21 18:08:25 UTC

Created attachment 65957 [details] [review]
Proposed solution

Cleaned up and reworked logic, based on the last patch from Kevin. This should work fine with old and new TextCache files. Please test it, if no issues pop up, it should be fine to go in.

Comment 31 Lukas Lipka 2006-05-21 18:12:00 UTC

Created attachment 65959 [details] [review]
Proposed solution

Removed uneeded modifications in my tree which popped up in the previous patch.

Comment 32 Kevin Kubasik 2006-05-21 18:52:41 UTC

Only thing your missing is the slight rework of snippeting in the FSQ backend, since self-cached files can complicate things. the change is pretty mild, and you can probably pull it straight from one of my older patches.

Comment 33 Lukas Lipka 2006-05-21 19:12:38 UTC

(In reply to comment #32)
> Only thing your missing is the slight rework of snippeting in the FSQ backend,
> since self-cached files can complicate things. the change is pretty mild, and
> you can probably pull it straight from one of my older patches.
> 

Why do we need to change the snippeting in FSQ?

Comment 34 Kevin Kubasik 2006-05-21 20:14:50 UTC

Hey, I e-mailed Lukas the core of it:

>Fsq uses GetSnippetFromFile which does not call GetReader (thus not getting a >gziped stream), so my old patch checked if the file was self-cached, and if it >was, continued with snippet extraction from the file, otherwise, it went the >more traditional route and used GetReader. If we dont do this, then beagle >will throw exceptions whenever snippeting and self-cached file.

Here is the portion of the patch that is of interest.

Index: ./beagled/FileSystemQueryable/FileSystemQueryable.cs
===================================================================
RCS file: /cvs/gnome/beagle/beagled/FileSystemQueryable/FileSystemQueryable.cs,v
retrieving revision 1.106
diff -u -r1.106 FileSystemQueryable.cs
--- ./beagled/FileSystemQueryable/FileSystemQueryable.cs	29 Apr 2006 15:44:25 -0000	1.106
+++ ./beagled/FileSystemQueryable/FileSystemQueryable.cs	21 May 2006 02:52:36 -0000  
@@ -1392,9 +1400,10 @@
 
 			// If this is self-cached, use the remapped Uri
 			if (path == TextCache.SELF_CACHE_TAG)
-				path = hit.Uri.LocalPath;
+				return SnippetFu.GetSnippetFromFile (query_terms, hit.Uri.LocalPath);
 
-			return SnippetFu.GetSnippetFromFile (query_terms, path);
+			
+			return SnippetFu.GetSnippet (query_terms, TextCache.UserCache.GetReader(uri));
 		}
 
 		override public void Start ()

Comment 35 Lukas Lipka 2006-05-22 10:40:47 UTC

Created attachment 65988 [details] [review]
Final

Okay, I see what you ment. Took a little different approach than in your previous patch, which now doesn't need to query the sqlite database twice during path lookup. BTW I didn't get your email :-)

Comment 36 Kevin Kubasik 2006-05-22 16:53:39 UTC

(In reply to comment #35)
> Created an attachment (id=65988) [edit]
> Final
> Okay, I see what you ment. Took a little different approach than in your
> previous patch, which now doesn't need to query the sqlite database twice
> during path lookup.
Sweet, that makes a little more sense.


 BTW I didn't get your email :-)

My bad, I sent it remotely from the metro, looks like it still sitting in my outbox...

Comment 37 Joe Shaw 2006-05-22 18:44:54 UTC

Lukas: why comment out the "FileAdvise.FlushCache (stream);" line?  That's probably the most important line in the entire file. :)

Comment 38 Lukas Lipka 2006-05-22 19:08:54 UTC

Created attachment 66007 [details] [review]
Final 2

Eeek! Fixed. But then again, it's not that important since it still worked without it. :-)

Comment 39 Joe Shaw 2006-05-31 18:55:39 UTC

Checked this in after testing it quite a bit today.  Thanks guys!