GNOME Bugzilla – Bug 595389
Auto-vacuum Camel DB on expunge
Last modified: 2014-12-03 08:14:27 UTC
See http://rbtcollins.wordpress.com/2009/08/05/06-aug-2009/ for details. There are other probable performance-impacting issues (like SELECTing more columns than needed), but this is a potentially big one. I have, right now, no idea on how big the default cache should be. I will test some scenarios. But -- in my case, with an IMAP folders.db of about 90M, raising the cache to a (arbitrarily large) value of 50000 *did* make a lot of difference.
While I agree that a 2MB cache is probably too small for all but the most basic Evolution usage, a hard-coded value -- no matter what we pick -- is always going to suck for someone. Would it be possible to gradually increase a folders.db's cache size over time based on some heuristic? In other words, create a new folders.db with a small initial cache size (2MB, 8MB, whatever) but then double it each time it hits some condition. What that condition should be requires some research. I'm open to ideas.
I thought I had an option to set this as an environment variable. Dont remember fully now.
I originally thought of having a new field in the setup of the account for the buffer size, but I am sure this would be a good idea, as Matthew points out :-( An evironment variable is useful for debugging purposes only, as far as I can understand. We could still implement it in such a way that if this environment variable is defined, then we use it, instead of whatever other value, computed, or defaulted. There are also other pragma options we might want to look at, since we are going there: * auto-vacuum * incremental_vacuum (if auto_vacuum=2) * synchronous * page_size (auto|incremental)-vacuum both could help on the vaccum-ing process for extremely active Evolution users; obviously, this adds in the question of *how* to vacuum..., and what is the user impact. page_size will probably affect the cache size. It is possible to dynamically resize the cache (use 'pragma cache=N'for that). But, still, a good question is what heuristic to use... I am not familiar with the DB structure, and usage, so I cannot comment.
(In reply to comment #3) > (auto|incremental)-vacuum both could help on the vaccum-ing process for > extremely active Evolution users; obviously, this adds in the question of > *how* to vacuum..., and what is the user impact. I have a feeling trying to vacuum the database in the background is just going to be problematic, at least while Camel remains multi-threaded. Incremental vacuums may be feasible in the future. Simplest approach I can think of is to just get it done when Evolution starts. Show a progress dialog explaining what's going on before anything else is shown, and maybe add a --skip-db-maint command-line option to bypass it. The first run may take awhile, but I think users generally restart Evolution frequently enough (unfortunately) that the database probably won't accumulate vast amounts of garbage from session to session. So subsequent runs should be pretty quick, I'd think. If we go this way maybe we could throw in a database integrity check at startup too (PRAGMA integrity_check or quick_check -- maybe user selectable, but phrased to be understandable)? I think I already pitched that idea in another disk-summary bug...
(In reply to comment #3) > An evironment variable is useful for debugging purposes only, as far as I > can understand. We could still implement it in such a way that if this > environment variable is defined, then we use it, instead of whatever other > value, computed, or defaulted. That's a good idea for testing and tuning. In fact SQLite lets you override the default cache size for the current session only: environment variable -> PRAGMA cache_size.
Well. After a gentle prod from Milan, I started really looking at the code. And almost immediately found the bug title to be wrong -- it is not 2000 pages we use for cache, but 100: (from camel-db.c @ camel_db_open) if (g_getenv("CAMEL_SQLITE_DEFAULT_CACHE_SIZE")!=NULL) cache = g_strdup_printf ("PRAGMA cache_size=%s", g_getenv("CAMEL_SQLITE_DEFAULT_CACHE_SIZE")); else cache = g_strdup ("PRAGMA cache_size=100"); camel_db_command (cdb, cache, NULL); So, actually, unless over-ridden by CAMEL_SQLITE_DEFAULT_CACHE_SIZE, we actually will open the db with a cache size of 100 pages. I have not followed the calculations for the page size, but on my machine it is 1024 bytes -- so the cache size in use gets to be a mere 100kB. Hum. I will test a bit, and see what happens, and may ask some others to test via evolution-list. Marking as NEW, right now.
We should kill the 'else' branch in that code immediately. Start with what SQLite defaults to and tune it from there.
Created attachment 146068 [details] [review] else branch removal on camel_db_open Here it is.
1. I do not see how to get vacuum to be run automagically -- SQLite does not provide any official API to see how fragmented a database is. Additionally, 'pragma auto_vacuum' [1] does not defragment the DB, and is stated to risk increasing fragmentation. Without a full understanding of how we use the tables, I do not feel comfortable going this way. I found a link on mozilla about vacuuming [2]; it looks like an open issue, right now, with SQLite. So. There does not seem to be a way, right now, to automate *when* to vacuum. The only option is can think of is to provide a way to do that, user-selectable. Not good, though. [NOTE] 'pragma freelist_count' will return the number of free/unused pages, while 'pragma page_count' will return the number of used pages (see [1] for details). So, perhaps we could vacuum the DBs when the ratio (free/used) goes greater than a pre-set value. We would have to do that either at startup or at shutdown, since vacuuming makes the DBs unaccessible. 2. Another option might be to 'sqlite analyse' [3] every so often. The 'analyse' command creates/updates an internal table (sqlite_stat1) that is used, internally, by SQLite to optimise queries. Not sure how this would affect speed, though. 3. So we are pretty much left, right now, on giving Evo a larger cache size. How large, though, I do not know yet. SQLite documentation states the cache is only used when needed -- so, if only (say) 1000 pages are needed, only 1000 pages will be allocated. At least we should take out the restriction on only 100 pages ASAP, while these options are looked at. [1] http://www.sqlite.org/pragma.html#syntax [1] https://wiki.mozilla.org/Firefox/Projects/Places_Vacuum [2] http://www.sqlite.org/lang_analyze.html
Patch committed to master and gnome-2-28: http://git.gnome.org/cgit/evolution-data-server/commit/?id=181fe09402e1dbbac33b1553edb70d5b51439b89 http://git.gnome.org/cgit/evolution-data-server/commit/?h=gnome-2-28&id=0fa2e95b03a1748fda49ff0cf59438833454489e So what else needs to happen to close this bug?
I am writing a vacuum option on startup. We can get it done off this bug, if you wish.
Created attachment 146362 [details] [review] implements 'vaccum' -- must be tested OK. This patch is a first approach to dynamically vaccum-ing the SQLite databases. For it to try, the environment variable CAMEL_SQLITE_FREE_PAGE_RATIO must be specified with a value between 1 and 99 (the percentage to free/used pages). The code prints debug lines starting with the string 'freelist', making it easier to zero in to the calls. Also, any 'vacuum' commands are enclosed in START/END macros to print the time spent on vacuum. I have no idea of what would be a good value for the free page ratio, so courageous testers may try with whatever they want. 30 seems nice ;-). Depending on what we find we can adjust it to (say) a gconf value, and add nice warnings.
Created attachment 146374 [details] [review] new version, adds check for denominator=0, correct indendation I had forgotten my vi is set to replace tabs... also, Milan spotted a potential divide-by-zero.
hum. Should I g_mutex_lock the DB while going through a vacuum?
I would prefer the vacuum command be split into a separate function, so we have more control over when and if it runs (and yes you should be holding the DB lock while vacuuming). We'll want to vacuum all the databases concurrently. I would also prefer that -Evolution- decides whether to vacuum, since this is an application-specific policy decision, and so that we can present a dialog first. The logic should be, roughly: In Evolution's mail backend, at startup: - Decide which databases need to be vacuumed. - Ask the user if it's okay to vacuum, or just tell him we're doing it. - Kick off all the vacuum commands, each in a separate thread. - Wait for vacuuming to finish and close the dialog. - Proceed with account loading. Would be worth looking into whether SQLite has an async API with cancellation, so we can avoid using threads altogether here.
Asyncish API's: http://www.sqlite.org/c3ref/progress_handler.html "If the progress callback returns non-zero, the operation is interrupted. This feature can be used to implement a "Cancel" button on a GUI progress dialog box." You may still need threads for GUI responsiveness, but you can definitely provide cancellation. concurrency: I recommend only vacuuming one DB at a time for the following reasons: - vacuum can use more than twice the DB's start space (e.g. it took ~1.5GB spare working space to vacuum my 900MB folders.db) - vacuum is IO heavy: it reads every page in the DB, writes every page in the DB to the journal for recovery, and finally writes the compacted pages out. Its CPU cheap. You're better off vacuuming every DB serially, at least for desktop class machines. You might get good performance with one thread per CPU for some recent machines. Performance tuning: Another useful thing to do is 'reindex;' which is pretty cheap to do, and corrects a number of index issues that can accumulate - and that vacuum does not fix. A quick_check will report if this is useful.
hggdh, are you working on an updated version of the patch?
I finished this in a way to vacuum the database on folder's (or store's) expunge, if the free page count is more than 5% of the total pages. It'll make it no so often, but still being done. Mine 300MB folders.db took couple seconds to finish. Created commit 00054ce in eds master (3.13.9+) [1] [1] https://git.gnome.org/browse/evolution-data-server/commit/?id=00054ce