GNOME Bugzilla – Bug 590044
Drop index on all columns of sqlite database
Last modified: 2009-08-13 07:44:41 UTC
Hello I've been using Evolution for quite some time now, with a good number of emails (4 millions total) After a bit of exploring the folders.db sqlite file, I've found out that there is an index on every table named SINDEX-table that is an index for *every* column in the table. In practice, that means that the index will never be used, since it would require a query mixing all those fields (or at least, the first ones in order : uid, flags, size, .. ) It also means that the index will take unnecessary place on the filesystem (in my case, that was around 100MB by the time I removed them), and slow down inserts into the database (in my case, when moving a lots of mails from a folder to another). I've patched my evolution version and I'm working with it since at least 6 months without any problem. So I thought I might as well provide it to you :)
Created attachment 139405 [details] [review] Patch proposal
Romuald's observations are correct. Certainly in its present form that index is not doing any good. Srini, what was the intent of this index? Should we drop it until we can do some serious analysis of what indices (if any) are really needed?
Matt, my (initial) experience was that, it was slower without this index. I'm not a SQL expert. But with this, I could speed up the message list loading. Better we may have to timestamp it on empty disk cache (page-cache drop) and check this out. Romauld, did you timestamp it the queries to pick all fields, on a no page-cache/disk cache situation with and without index? I remember it was faster and which is why I had added it, but it was old and I don't remember the exact details.
Correct me if I'm wrong here but as I understand it a database index acts as a kind of hash table. The tuple of fields you're indexing form the unique key. The index cannot used without a complete key. Searching an index of all fields in a table requires a SELECT expression specifying a value for every field in the table. Such expressions are not very common. Much more common are queries on one or two fields (e.g. "Subject or Sender contains"), which would not use the index because you can't form a complete key from such expressions. Also, indexing free-form text fields like "subject" or even an email address is a dubious proposition because you're usually not searching for an exact match. It's usually a fuzzy match that you want -- subject LIKE "%database%" -- and I don't think indexes are of any help there. I have not taken any real measurements, but my guess is the index is only causing additional and unnecessary overhead as Romuald stated.
Hmm, as I said I wasn't an expert, so I believe what you guys say. But when the indices were same as the query that we pass, it was faster. Specially the message info reads. Just do a small test and take these then.
(In reply to comment #5) > Hmm, as I said I wasn't an expert, so I believe what you guys say. But when the > indices were same as the query that we pass, it was faster. Specially the > message info reads. Just do a small test and take these then. > http://www.sqlite.org/optoverview.html In some instances LIKE queries can use an index. I'm in agreement with Romuald and Matthew that the index on every table named SINDEX-table that is an index for *every* column in the table is likely just bloat. Copied below from the above url ( there's more on sqlite query optimization on the page if you're interested ) -- the last 3-4 paragraphs are pertinent to the conversation about this specific index. 1.0 WHERE clause analysis The WHERE clause on a query is broken up into "terms" where each term is separated from the others by an AND operator. All terms of the WHERE clause are analyzed to see if they can be satisfied using indices. Terms that cannot be satisfied through the use of indices become tests that are evaluated against each row of the relevant input tables. No tests are done for terms that are completely satisfied by indices. Sometimes one or more terms will provide hints to indices but still must be evaluated against each row of the input tables. The analysis of a term might cause new "virtual" terms to be added to the WHERE clause. Virtual terms can be used with indices to restrict a search. But virtual terms never generate code that is tested against input rows. To be usable by an index a term must be of one of the following forms: column = expression column > expression column >= expression column < expression column <= expression expression = column expression > column expression >= column expression < column expression <= column column IN (expression-list) column IN (subquery) column IS NULL If an index is created using a statement like this: CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z); Then the index might be used if the initial columns of the index (columns a, b, and so forth) appear in WHERE clause terms. All index columns must be used with the = or IN operators except for the right-most column which can use inequalities. For the right-most column of an index that is used, there can be up to two inequalities that must sandwich the allowed values of the column between two extremes. It is not necessary for every column of an index to appear in a WHERE clause term in order for that index to be used. But there can not be gaps in the columns of the index that are used. Thus for the example index above, if there is no WHERE clause term that constraints column c, then terms that constraint columns a and b can be used with the index but not terms that constraint columns d through z. Similarly, no index column will be used (for indexing purposes) that is to the right of a column that is constrained only by inequalities. For the index above and WHERE clause like this: ... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello' Only columns a, b, and c of the index would be usable. The d column would not be usable because it occurs to the right of c and c is constrained only by inequalities.
(In reply to comment #6) > (In reply to comment #5) > > Hmm, as I said I wasn't an expert, so I believe what you guys say. But when the > > indices were same as the query that we pass, it was faster. Specially the > > message info reads. Just do a small test and take these then. > > > > http://www.sqlite.org/optoverview.html > > In some instances LIKE queries can use an index. > I'm in agreement with Romuald and Matthew that the index on every table named > SINDEX-table that is an index for *every* column in the table is likely just > bloat. Copied below from the above url ( there's more on sqlite query > optimization on the page if you're interested ) -- the last 3-4 paragraphs are > pertinent to the conversation about this specific index. > > 1.0 WHERE clause analysis > > The WHERE clause on a query is broken up into "terms" where each term is > separated from the others by an AND operator. > > All terms of the WHERE clause are analyzed to see if they can be satisfied > using indices. Terms that cannot be satisfied through the use of indices become > tests that are evaluated against each row of the relevant input tables. No > tests are done for terms that are completely satisfied by indices. Sometimes > one or more terms will provide hints to indices but still must be evaluated > against each row of the input tables. > > The analysis of a term might cause new "virtual" terms to be added to the WHERE > clause. Virtual terms can be used with indices to restrict a search. But > virtual terms never generate code that is tested against input rows. > > To be usable by an index a term must be of one of the following forms: > > > column = expression > column > expression > column >= expression > column < expression > column <= expression > expression = column > expression > column > expression >= column > expression < column > expression <= column > column IN (expression-list) > column IN (subquery) > column IS NULL > > If an index is created using a statement like this: > > CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z); > > Then the index might be used if the initial columns of the index (columns a, b, > and so forth) appear in WHERE clause terms. All index columns must be used with > the = or IN operators except for the right-most column which can use > inequalities. For the right-most column of an index that is used, there can be > up to two inequalities that must sandwich the allowed values of the column > between two extremes. > > It is not necessary for every column of an index to appear in a WHERE clause > term in order for that index to be used. But there can not be gaps in the > columns of the index that are used. Thus for the example index above, if there > is no WHERE clause term that constraints column c, then terms that constraint > columns a and b can be used with the index but not terms that constraint > columns d through z. Similarly, no index column will be used (for indexing > purposes) that is to the right of a column that is constrained only by > inequalities. For the index above and WHERE clause like this: > > ... WHERE a=5 AND b IN (1,2,3) AND c>12 AND d='hello' > > Only columns a, b, and c of the index would be usable. The d column would not > be usable because it occurs to the right of c and c is constrained only by > inequalities. > I haven't looked at the indices on the sqlite tables, so this may be a moot comment -- In general, you should only create indexes on a table to match your sql query parameters.
Thanks for the pointer, that's some good reading there!
(In reply to comment #8) > Thanks for the pointer, that's some good reading there! > If anyone has specific questions, the sqlite mailing list is very responsive.
I tried the patch and I'm totally for it. My test on a 60K+ messages folder showed the folder selecting is significantly quicker. Before the patch the folder selection, or update on the new mail arrival, meant few seconds (slightly less than 10 I guess) chewing my HDD. With the patch no chewing any more, only 1 or 2 seconds on higher CPU usage, which is nothing important, with compare of the without-patch behaviour. Also, with this patch the folders.db is less than half in size (after the vacuum). I thought of adding a UID index on the table, as we use this one in a WHERE clause, but I realized that I do not notice any difference with and without it. Thus no such thing. I'm so brave to mark the patch as accepted. :)
That's proof enough for me. Let's get this in 2.28.
Created commit 10752ca in eds master (2.29.1+) Created commit 2009001 in eds gnome-2-28 (2.27.91+) Note that for the first selection/creation of a folder, when the drop command is in effect, it takes some time until the drop is done, but after that it's noticeably better.