GNOME Bugzilla – Bug 555937
The database should be regularly ANALYZE'd
Last modified: 2009-03-26 20:35:16 UTC
The ANALYZE statement might improve performance with sqlite indexes. It is run in BansheeDbFormatMigrator.cs when a migration add an index, but it is not run with a fresh database. I've seen one example of a banshee.db where after ANALYZE, a query went from 300ms execution time to 22ms.
Do you mean just after migrating from 0.13.2, or just after a fresh db w/o any migration, or should we do this after every import of more than N (200?) songs? You planning on working on this?
I meant a fresh w/o any migration, but, on second thoughts, it's probably useless to ANALYZE an empty db. So, as you suggested, we should do an ANALYZE when more than N (to be determined) tracks were imported. Updating summary to reflect that. I have no immediate plans to work on this, so feel free to have a go at it ! ;)
That is awesome. We should do this ASAP.
Created attachment 125373 [details] [review] Patch to ANALYZE the DB at startup This patch ANALYZEs the database if the number of tracks that have been modified since the last ANALYZE is greater than some configurable threshold. The default threshold is 100. I have no idea what an appropriate one would be. It adds two entries to CoreConfiguration: - Database.AnalyzeThreshold (number of tracks that require updating before triggering ANALYZE) - Database.LastAnalyzed (timestamp of the last ANALYZE, used to build the count) Please test! It is useful to back up your database first, not because I think this patch will eat your database, but it would be good to have something to compare.
aaron@giggidy:~/svn/banshee$ (for ((i=0;i<100;i++)); do (time $SQLITE ~/.config/banshee-1/banshee.db 'select * from CoreTracks' 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }' 0.14137 aaron@giggidy:~/svn/banshee$ (for ((i=0;i<100;i++)); do (time $SQLITE ~/.config/banshee-1/banshee.db.orig 'select * from CoreTracks' 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }' 0.14247 I don't get much difference: OLD: 0.14247 NEW: 0.14137 Maybe there's more difference with INSERT or UPDATE, but I really can't be bothered to test there right now :) Bertrand: can you provide a test case where ANALYZE really helps?
Very inconclusive for me; I did two runs each: OLD: 0.14374, 0.14518 NEW: 0.14998, 0.14221
From what I understand, ANALYZE helps sqlite make better index choices. So the query has to use indexes for it to make any difference. I'm using the following query, modified from --debug-sql : SELECT * FROM CoreTracks,CoreAlbums,CoreArtists WHERE CoreArtists.ArtistID = CoreTracks.ArtistID AND CoreAlbums.AlbumID = CoreTracks.AlbumID AND CoreTracks.PrimarySourceID = 1; The data from "analyze" is stored in the sqlite_stat1 table. From my tests, it seems it's worse to have bad data than none at all. So a scenario would be : 1. Create an empty DB 2. Import a few tracks 3. analyze the DB (banshee does it in some schema migrations) 4. Import lots of tracks I have a test DB (from a user, so I can send it by e-mail if needed), with 862 tracks in CoreTracks : - banshee_carp3_orig.db : the original file from the user, sqlite_stat1 says CoreTracks has 146 lines - banshee_carp3_analyzed.db : the same db, after ANALYZE (for ((i=0;i<100;i++)); do (time sqlite3 banshee_carp3_orig.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }' 0.18002 (for ((i=0;i<100;i++)); do (time sqlite3 banshee_carp3_analyzed.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }' 0.03384
That looks pretty significant to me. Maybe this is a good idea to add? I am also contemplating doing something similar with VACUUM.
I do think the patch should be applied. At the very least, we'll have one less thing to ask users who are complaining about banshee being slow. Btw, the patch works fine for me. A quick test about VACUUM : With another DB from a user (I'm starting a collection), the original file was 12MB, and 8.5MB after VACUUM. I used the same query as before. (for ((i=0;i<100;i++)); do (time sqlite3 banshee_dn_analyzed.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,"");print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }' 0.01611 (for ((i=0;i<100;i++)); do (time sqlite3 banshee_dn_analyzed_vacuumed.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,"");print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }' 0.01497 So an improvement of less than 10% in this case.
This patch is pretty complicated. I mentioned to Aaron that we could just run ANALYZE every 10th startup or something. Here are some numbers for general consideration: size of d | count(coreTracks)| ANALYZE | 2nd ANALYZE | VACUUM | size after 7.6M 806 0.257s 0.144s 3.096s 4.5M 31M 22582 0.946s 0.197s 6.121s 29M 2.6M 1212 0.179s 0.129s 0.905s 2.6M 3.2M 7 0.092s 0.105s 0.653s 1.8M 6.5M 9193 0.228s 0.108s 1.333s 6.3M 68M 31391 6.239s 0.265s 11.185s 65M 110M 58099 17.320s 0.445s 19.169s 108M
*** Bug 573071 has been marked as a duplicate of this bug. ***
This idea just hit me while reading Gabriel's comment on bug #573071 : ANALYZE creates a table named sqlite_stat1 and populates it with some stats. I has 3 columns : - tbl : table name - idx : index name - stat : a space separated list of integers. The first number seems to be the number of records in the table when ANALYZE was last run. We could use that to figure out if we should run ANALYZE again because the number of records has changed significantly since the last time it was run. I might have a go at adapting Aaron's patch to do that, if the weather stays bad this weekend ;)
Nice, yeah, we can also check that the table even exists (doesn't when you first create a db, afaict). Assuming we've ANALYZE'd everything before, we can actually run analyze for a particular table as needed when its counts differ.
Created attachment 130176 [details] banshee --debug-sql following ANALYZE Bertrand / Gabriel, Attached is the log from banshee --debug-sql after I ran "sqlite3 banshee.db" with "VACUUM; ANALYZE" options.
Created attachment 130236 [details] [review] Analyze the DB if needed This patch modifies Aaron's patch to implement the idea in comment #12 and the suggestion in comment #13. I've explicitly listed the tables that have indexes, instead of looking them up in sqlite_master, for the following reasons : - there might be useless indexes left in the DB, from 0.X for example. - I wanted the tables listed with the most relevant ones first, to decide as quickly as possible if ANALYZE is needed. The ANALYZE is always done for the whole DB, I don't think doing it for each table separately would bring much. On my system, this adds about 5ms to a 2s startup time when ANALYZE is not run.
Committed, with a few style changes suggested by Gabriel.