Bug 555937 – The database should be regularly ANALYZE'd

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 555937 - The database should be regularly ANALYZE'd


Summary:	The database should be regularly ANALYZE'd


Status:	RESOLVED FIXED

Product:	banshee
Classification:	Other
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	High normal
Target Milestone:	1.x
Assigned To:	Banshee Maintainers
QA Contact:	Banshee Maintainers

URL:
Whiteboard:

Duplicates:	573071 (view as bug list)
Depends on:
Blocks:

Reported:	2008-10-11 18:21 UTC by Bertrand Lorentz
Modified:	2009-03-26 20:35 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch to ANALYZE the DB at startup (4.22 KB, patch) 2008-12-27 02:24 UTC, Aaron Bockover	none	Details \| Review
banshee --debug-sql following ANALYZE (70.72 KB, text/plain) 2009-03-06 03:38 UTC, Dave B		Details
Analyze the DB if needed (4.18 KB, patch) 2009-03-07 16:13 UTC, Bertrand Lorentz	committed	Details \| Review

Description Bertrand Lorentz 2008-10-11 18:21:50 UTC

The ANALYZE statement might improve performance with sqlite indexes.
It is run in BansheeDbFormatMigrator.cs when a migration add an index, but it is not run with a fresh database.

I've seen one example of a banshee.db where after ANALYZE, a query went from 300ms execution time to 22ms.

Comment 1 Gabriel Burt 2008-10-14 14:58:17 UTC

Do you mean just after migrating from 0.13.2, or just after a fresh db w/o any migration, or should we do this after every import of more than N (200?) songs?  You planning on working on this?

Comment 2 Bertrand Lorentz 2008-10-14 17:45:48 UTC

I meant a fresh w/o any migration, but, on second thoughts, it's probably useless to ANALYZE an empty db.
So, as you suggested, we should do an ANALYZE when more than N (to be determined) tracks were imported.
Updating summary to reflect that.

I have no immediate plans to work on this, so feel free to have a go at it ! ;)

Comment 3 Aaron Bockover 2008-12-26 21:00:12 UTC

That is awesome. We should do this ASAP.

Comment 4 Aaron Bockover 2008-12-27 02:24:49 UTC

Created attachment 125373 [details] [review]
Patch to ANALYZE the DB at startup

This patch ANALYZEs the database if the number of tracks that have been modified since the last ANALYZE is greater than some configurable threshold.

The default threshold is 100. I have no idea what an appropriate one would be. It adds two entries to CoreConfiguration:

 - Database.AnalyzeThreshold (number of tracks that require updating before triggering ANALYZE)
 - Database.LastAnalyzed (timestamp of the last ANALYZE, used to build the count)

Please test! It is useful to back up your database first, not because I think this patch will eat your database, but it would be good to have something to compare.

Comment 5 Aaron Bockover 2008-12-27 02:31:57 UTC

aaron@giggidy:~/svn/banshee$ (for ((i=0;i<100;i++)); do (time $SQLITE ~/.config/banshee-1/banshee.db 'select * from CoreTracks' 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }'
0.14137

aaron@giggidy:~/svn/banshee$ (for ((i=0;i<100;i++)); do (time $SQLITE ~/.config/banshee-1/banshee.db.orig 'select * from CoreTracks' 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }'
0.14247

I don't get much difference:

 OLD: 0.14247
 NEW: 0.14137

Maybe there's more difference with INSERT or UPDATE, but I really can't be bothered to test there right now :)

Bertrand: can you provide a test case where ANALYZE really helps?

Comment 6 Sandy Armstrong 2008-12-27 02:35:20 UTC

Very inconclusive for me; I did two runs each:

OLD: 0.14374, 0.14518
NEW: 0.14998, 0.14221

Comment 7 Bertrand Lorentz 2008-12-27 10:54:02 UTC

From what I understand, ANALYZE helps sqlite make better index choices. So the query has to use indexes for it to make any difference.

I'm using the following query, modified from --debug-sql :

SELECT * FROM CoreTracks,CoreAlbums,CoreArtists WHERE CoreArtists.ArtistID = CoreTracks.ArtistID AND CoreAlbums.AlbumID = CoreTracks.AlbumID AND CoreTracks.PrimarySourceID = 1;

The data from "analyze" is stored in the sqlite_stat1 table. From my tests, it seems it's worse to have bad data than none at all. So a scenario would be :
1. Create an empty DB
2. Import a few tracks
3. analyze the DB (banshee does it in some schema migrations)
4. Import lots of tracks

I have a test DB (from a user, so I can send it by e-mail if needed), with 862 tracks in CoreTracks :
- banshee_carp3_orig.db : the original file from the user, sqlite_stat1 says CoreTracks has 146 lines
- banshee_carp3_analyzed.db : the same db, after ANALYZE

(for ((i=0;i<100;i++)); do (time sqlite3 banshee_carp3_orig.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }'
0.18002
(for ((i=0;i<100;i++)); do (time sqlite3 banshee_carp3_analyzed.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,""); print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }'
0.03384

Comment 8 Aaron Bockover 2009-01-07 19:11:00 UTC

That looks pretty significant to me. Maybe this is a good idea to add? I am also contemplating doing something similar with VACUUM.

Comment 9 Bertrand Lorentz 2009-01-07 20:06:38 UTC

I do think the patch should be applied. At the very least, we'll have one less thing to ask users who are complaining about banshee being slow.

Btw, the patch works fine for me.

A quick test about VACUUM :
With another DB from a user (I'm starting a collection), the original file was 12MB, and 8.5MB after VACUUM. I used the same query as before.

(for ((i=0;i<100;i++)); do (time sqlite3 banshee_dn_analyzed.db  < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,"");print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }'

0.01611

(for ((i=0;i<100;i++)); do (time sqlite3 banshee_dn_analyzed_vacuumed.db < query.sql 1>/dev/null) 2>&1 | awk '/^real/{sub(/^real[0-9\t]+m/,""); sub(/s$/,"");print}'; done;) | awk '{sum = sum + $0; } END { print sum / NR }'

0.01497

So an improvement of less than 10% in this case.

Comment 10 Gabriel Burt 2009-01-14 15:45:41 UTC

This patch is pretty complicated.  I mentioned to Aaron that we could just run ANALYZE every 10th startup or something.

Here are some numbers for general consideration:

size of d | count(coreTracks)| ANALYZE | 2nd ANALYZE | VACUUM | size after
7.6M           806             0.257s    0.144s         3.096s  4.5M
31M          22582             0.946s    0.197s         6.121s  29M
2.6M          1212             0.179s    0.129s         0.905s  2.6M
3.2M             7             0.092s    0.105s         0.653s  1.8M
6.5M          9193             0.228s    0.108s         1.333s  6.3M
68M          31391             6.239s    0.265s        11.185s  65M
110M         58099             17.320s   0.445s        19.169s  108M

Comment 11 Bertrand Lorentz 2009-03-05 06:49:01 UTC

*** Bug 573071 has been marked as a duplicate of this bug. ***

Comment 12 Bertrand Lorentz 2009-03-05 19:59:33 UTC

This idea just hit me while reading Gabriel's comment on bug #573071 :

ANALYZE creates a table named sqlite_stat1 and populates it with some stats. I has 3 columns :
- tbl : table name
- idx : index name
- stat : a space separated list of integers. The first number seems to be the number of records in the table when ANALYZE was last run.

We could use that to figure out if we should run ANALYZE again because the number of records has changed significantly since the last time it was run.
I might have a go at adapting Aaron's patch to do that, if the weather stays bad this weekend ;)

Comment 13 Gabriel Burt 2009-03-05 21:10:40 UTC

Nice, yeah, we can also check that the table even exists (doesn't when you first create a db, afaict).  Assuming we've ANALYZE'd everything before, we can actually run analyze for a particular table as needed when its counts differ.

Comment 14 Dave B 2009-03-06 03:38:00 UTC

Created attachment 130176 [details]
banshee --debug-sql following ANALYZE

Bertrand / Gabriel,

Attached is the log from banshee --debug-sql after I ran "sqlite3 banshee.db" with "VACUUM; ANALYZE" options.

Comment 15 Bertrand Lorentz 2009-03-07 16:13:32 UTC

Created attachment 130236 [details] [review]
Analyze the DB if needed

This patch modifies Aaron's patch to implement the idea in comment #12 and the suggestion in comment #13.
I've explicitly listed the tables that have indexes, instead of looking them up in sqlite_master, for the following reasons :
- there might be useless indexes left in the DB, from 0.X for example.
- I wanted the tables listed with the most relevant ones first, to decide as quickly as possible if ANALYZE is needed.
The ANALYZE is always done for the whole DB, I don't think doing it for each table separately would bring much.

On my system, this adds about 5ms to a 2s startup time when ANALYZE is not run.

Comment 16 Bertrand Lorentz 2009-03-26 20:35:16 UTC

Committed, with a few style changes suggested by Gabriel.