Bug 590044 – Drop index on all columns of sqlite database

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 590044 - Drop index on all columns of sqlite database


Summary:	Drop index on all columns of sqlite database


Status:	RESOLVED FIXED

Product:	evolution-data-server
Classification:	Platform
Component:	Mailer
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal trivial
Target Milestone:	---
Assigned To:	evolution-mail-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:	evolution[disk-summary]

Depends on:
Blocks:

Reported:	2009-07-28 17:48 UTC by Romuald Brunet
Modified:	2009-08-13 07:44 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
Patch proposal (803 bytes, patch) 2009-07-28 17:49 UTC, Romuald Brunet	committed	Details \| Review

Description Romuald Brunet 2009-07-28 17:48:35 UTC

Hello

I've been using Evolution for quite some time now, with a good number
of emails (4 millions total)

After a bit of exploring the folders.db sqlite file, I've found out
that there is an index on every table named SINDEX-table that is an
index for *every* column in the table.

In practice, that means that the index will never be used, since it
would require a query mixing all those fields (or at least, the first
ones in order : uid, flags, size, .. )

It also means that the index will take unnecessary place on the
filesystem (in my case, that was around 100MB by the time I removed
them), and slow down inserts into the database (in my case, when
moving a lots of mails from a folder to another).

I've patched my evolution version and I'm working with it since at
least 6 months without any problem. So I thought I might as well
provide it to you :)

Comment 1 Romuald Brunet 2009-07-28 17:49:09 UTC

Created attachment 139405 [details] [review]
Patch proposal

Comment 2 Matthew Barnes 2009-07-28 18:13:16 UTC

Romuald's observations are correct.  Certainly in its present form that index is not doing any good.

Srini, what was the intent of this index?  Should we drop it until we can do some serious analysis of what indices (if any) are really needed?

Comment 3 Srinivasa Ragavan 2009-08-02 19:02:06 UTC

Matt, my (initial) experience was that, it was slower without this index. I'm not a SQL expert. But with this, I could speed up the message list loading. 

Better we may have to timestamp it on empty disk cache (page-cache drop) and check this out. 

Romauld, did you timestamp it the queries to pick all fields, on a no page-cache/disk cache situation with and without index? I remember it was faster and which is why I had added it, but it was old and I don't remember the exact details.

Comment 4 Matthew Barnes 2009-08-02 20:06:00 UTC

Correct me if I'm wrong here but as I understand it a database index acts as a kind of hash table.  The tuple of fields you're indexing form the unique key.  The index cannot used without a complete key.

Searching an index of all fields in a table requires a SELECT expression specifying a value for every field in the table.  Such expressions are not very common.  Much more common are queries on one or two fields (e.g. "Subject or Sender contains"), which would not use the index because you can't form a complete key from such expressions.

Also, indexing free-form text fields like "subject" or even an email address is a dubious proposition because you're usually not searching for an exact match.  It's usually a fuzzy match that you want -- subject LIKE "%database%" -- and I don't think indexes are of any help there.

I have not taken any real measurements, but my guess is the index is only causing additional and unnecessary overhead as Romuald stated.

Comment 5 Srinivasa Ragavan 2009-08-03 03:23:57 UTC

Hmm, as I said I wasn't an expert, so I believe what you guys say. But when the indices were same as the query that we pass, it was faster. Specially the message info reads. Just do a small test and take these then.

Comment 6 Reid Thompson 2009-08-03 16:16:49 UTC

(In reply to comment #5)
> Hmm, as I said I wasn't an expert, so I believe what you guys say. But when the
> indices were same as the query that we pass, it was faster. Specially the
> message info reads. Just do a small test and take these then.
> 

http://www.sqlite.org/optoverview.html

In some instances LIKE queries can use an index.
I'm in agreement with Romuald and Matthew that the index on every table named SINDEX-table that is an index for *every* column in the table is likely just bloat.  Copied below from the above url ( there's more on sqlite query optimization on the page if you're interested ) -- the last 3-4 paragraphs are pertinent to the conversation about this specific index.

1.0 WHERE clause analysis

The WHERE clause on a query is broken up into "terms" where each term is separated from the others by an AND operator.

All terms of the WHERE clause are analyzed to see if they can be satisfied using indices. Terms that cannot be satisfied through the use of indices become tests that are evaluated against each row of the relevant input tables. No tests are done for terms that are completely satisfied by indices. Sometimes one or more terms will provide hints to indices but still must be evaluated against each row of the input tables.

The analysis of a term might cause new "virtual" terms to be added to the WHERE clause. Virtual terms can be used with indices to restrict a search. But virtual terms never generate code that is tested against input rows.

To be usable by an index a term must be of one of the following forms:


      column = expression
      column > expression
      column >= expression
      column < expression
      column <= expression
      expression = column
      expression > column
      expression >= column
      expression < column
      expression <= column
      column IN (expression-list)
      column IN (subquery)
      column IS NULL

If an index is created using a statement like this:

      CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);

Then the index might be used if the initial columns of the index (columns a, b, and so forth) appear in WHERE clause terms. All index columns must be used with the = or IN operators except for the right-most column which can use inequalities. For the right-most column of an index that is used, there can be up to two inequalities that must sandwich the allowed values of the column between two extremes.

It is not necessary for every column of an index to appear in a WHERE clause term in order for that index to be used. But there can not be gaps in the columns of the index that are used. Thus for the example index above, if there is no WHERE clause term that constraints column c, then terms that constraint columns a and b can be used with the index but not terms that constraint columns d through z. Similarly, no index column will be used (for indexing purposes) that is to the right of a column that is constrained only by inequalities. For the index above and WHERE clause like this:

      ... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello'

Only columns a, b, and c of the index would be usable. The d column would not be usable because it occurs to the right of c and c is constrained only by inequalities.

Comment 7 Reid Thompson 2009-08-03 16:19:54 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Hmm, as I said I wasn't an expert, so I believe what you guys say. But when the
> > indices were same as the query that we pass, it was faster. Specially the
> > message info reads. Just do a small test and take these then.
> > 
> 
> http://www.sqlite.org/optoverview.html
> 
> In some instances LIKE queries can use an index.
> I'm in agreement with Romuald and Matthew that the index on every table named
> SINDEX-table that is an index for *every* column in the table is likely just
> bloat.  Copied below from the above url ( there's more on sqlite query
> optimization on the page if you're interested ) -- the last 3-4 paragraphs are
> pertinent to the conversation about this specific index.
> 
> 1.0 WHERE clause analysis
> 
> The WHERE clause on a query is broken up into "terms" where each term is
> separated from the others by an AND operator.
> 
> All terms of the WHERE clause are analyzed to see if they can be satisfied
> using indices. Terms that cannot be satisfied through the use of indices become
> tests that are evaluated against each row of the relevant input tables. No
> tests are done for terms that are completely satisfied by indices. Sometimes
> one or more terms will provide hints to indices but still must be evaluated
> against each row of the input tables.
> 
> The analysis of a term might cause new "virtual" terms to be added to the WHERE
> clause. Virtual terms can be used with indices to restrict a search. But
> virtual terms never generate code that is tested against input rows.
> 
> To be usable by an index a term must be of one of the following forms:
> 
> 
>       column = expression
>       column > expression
>       column >= expression
>       column < expression
>       column <= expression
>       expression = column
>       expression > column
>       expression >= column
>       expression < column
>       expression <= column
>       column IN (expression-list)
>       column IN (subquery)
>       column IS NULL
> 
> If an index is created using a statement like this:
> 
>       CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
> 
> Then the index might be used if the initial columns of the index (columns a, b,
> and so forth) appear in WHERE clause terms. All index columns must be used with
> the = or IN operators except for the right-most column which can use
> inequalities. For the right-most column of an index that is used, there can be
> up to two inequalities that must sandwich the allowed values of the column
> between two extremes.
> 
> It is not necessary for every column of an index to appear in a WHERE clause
> term in order for that index to be used. But there can not be gaps in the
> columns of the index that are used. Thus for the example index above, if there
> is no WHERE clause term that constraints column c, then terms that constraint
> columns a and b can be used with the index but not terms that constraint
> columns d through z. Similarly, no index column will be used (for indexing
> purposes) that is to the right of a column that is constrained only by
> inequalities. For the index above and WHERE clause like this:
> 
>       ... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello'
> 
> Only columns a, b, and c of the index would be usable. The d column would not
> be usable because it occurs to the right of c and c is constrained only by
> inequalities. 
> 

I haven't looked at the indices on the sqlite tables, so this may be a moot comment -- In general, you should only create indexes on a table to match your sql query parameters.

Comment 8 Matthew Barnes 2009-08-03 16:54:25 UTC

Thanks for the pointer, that's some good reading there!

Comment 9 Reid Thompson 2009-08-03 18:26:50 UTC

(In reply to comment #8)
> Thanks for the pointer, that's some good reading there!
> 

If anyone has specific questions, the sqlite mailing list is very responsive.

Comment 10 Milan Crha 2009-08-12 20:11:36 UTC

I tried the patch and I'm totally for it. My test on a 60K+ messages folder showed the folder selecting is significantly quicker. Before the patch the folder selection, or update on the new mail arrival, meant few seconds (slightly less than 10 I guess) chewing my HDD. With the patch no chewing any more, only 1 or 2 seconds on higher CPU usage, which is nothing important, with compare of the without-patch behaviour. Also, with this patch the folders.db is less than half in size (after the vacuum).

I thought of adding a UID index on the table, as we use this one in a WHERE clause, but I realized that I do not notice any difference with and without it. Thus no such thing.

I'm so brave to mark the patch as accepted. :)

Comment 11 Matthew Barnes 2009-08-12 20:37:02 UTC

That's proof enough for me.  Let's get this in 2.28.

Comment 12 Milan Crha 2009-08-13 07:44:41 UTC

Created commit 10752ca in eds master (2.29.1+)
Created commit 2009001 in eds gnome-2-28 (2.27.91+)

Note that for the first selection/creation of a folder, when the drop command is in effect, it takes some time until the drop is done, but after that it's noticeably better.