GNOME Bugzilla – Bug 699597
Autocomplete using GAL is slow
Last modified: 2014-10-09 12:24:23 UTC
If I switch to the Contacts pane in the UI and do a simple 'Name contains...' search in the GAL it's very fast. But an 'Any field contains...' search takes about 12 seconds. That's expected, since the latter is searching fields that aren't in the database summary. However: if I enter a partial name/address into the To: field in the composer, it takes the *same* 12 seconds to come up with a list of possible completions. That's less acceptable. I'm guessing that the autocomplete query is trying to match on some field that *isn't* in the sqlitedb summary, and thus it's falling back to the slow search that builds *all* the VCARDs and runs the search over them all. Whatever the default autocomplete in the composer is doing, we should make sure those fields are indexed...
The search is of the form ' (or (beginswith "nickname" "asd") (beginswith "email" "asd") (beginswith "full_name" "asd") (beginswith "file_as" "asd") )' It looks like the GAL summary contains 'email_1', 'email_2', 'email_3' and 'email_4' fields, but no 'email'. That's why it falls back to a full search.
Is the problem clear or still unsolved ? Looks obvious to me... the summary was created with email_[1-4] fields stored in the summary, instead of with the multivalued 'email' field. The query asks for the multivalued 'email' field, which is not in the summary... so it falls back on parsing the full vcards to do the search. Is this undesirable ? sure. Are new summaries created with this email_[1-4] fields ? if so, that should probably be changed to use the new default summary format. Is this happening specifically for SQLite databases which were created in previous versions and then upgraded ? In any case, to improve performance of newer searches performed on older db's, perhaps some hack should be added to the SQLite ? i.e.: if (query == "email") { check (email_1 & email_2 & email_3 & email_4); } But even then, it would be incorrect, because the multivalued "email" field can refer to other email addresses which are not email_1, email_2, email_3 or email_4. I.e. the 'email' field can refer to any email... emails can be expressed in vcards in any of the following ways: EMAIL:my@address.com EMAIL;TYPE=work:my@address.com EMAIL;TYPE=home,farm:my@address.com EMAIL;TYPE=grasshopper:grass@hopper.com
Yes, this was originally generated with an older version of Evolution and then upgraded. It looks like this: CREATE TABLE 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' ( uid TEXT PRIMARY KEY, rev TEXT, file_as TEXT, nickname TEXT, full_name TEXT, given_name TEXT, family_name TEXT, email_1 TEXT, email_2 TEXT, email_3 TEXT, email_4 TEXT, is_list INTEGER, list_show_addresses INTEGER, wants_html INTEGER, vcard TEXT, bdata TEXT); Is there no migration path on upgrade to fix this? If I remove the contacts.db file and allow it to be recreated, it looks like this: CREATE TABLE 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' ( uid TEXT PRIMARY KEY, Rev TEXT, file_as TEXT, nickname TEXT, full_name TEXT, given_name TEXT, family_name TEXT, is_list INTEGER, list_show_addresses INTEGER, wants_html INTEGER, vcard TEXT, bdata TEXT); We just use e_book_backend_sqlitedb_new() to create it. How come it ended up with *no* email fields this time? And once we fix this so it actually has the expected 'email' field, I'm not sure it'll get any *data* in it. We still only ever populate the E_CONTACT_EMAIL_1 field for individuals (although it looks like we do fill in E_CONTACT_EMAIL for distribution lists... which means that once all this is fixed, a search for 'dwoodhou' might end up listing every distribution list that I'm on?) But one thing at a time... do we just to implement a migration path for the sqlitedb to update the summary fields, and why isn't it getting the right fields even when it's newly-created? Oops, that's two :)
OK, that "fixed" it. It's now trying to use sqlite for this query instead of falling back to the "slow" method of generating all the vcards and analysing them, which took all of 12 seconds. It is now a *lot* slower — evolution-addressbook-factory is eating 100% of CPU time and doesn't seem to be making a lot of progress at all.
+ Trace 231950
FWIW if I take a copy of contacts.db and run that same query using sqlite3 from the command line, I get much the same behaviour: $ time sqlite3 contacts2.db "SELECT DISTINCT summary.uid, vcard, bdata FROM 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' AS summary, 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' AS multi WHERE ( (summary.nickname IS NOT NULL AND (summary.nickname LIKE 'dwoodhou%')) OR (summary.uid = multi.uid AND multi.field = 'email' AND (multi.value LIKE 'dwoodhou%')) OR (summary.full_name IS NOT NULL AND summary.full_name LIKE 'dwoodhou%') OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dwoodhou%') OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dwoodhou%') OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dwoodhou%') OR (summary.file_as IS NOT NULL AND (summary.file_as LIKE 'dwoodhou%')) )" Now I have *two* CPUs pegged in sqlite code (on different files), and neither of them appears to be making any progress.
Aha, the problem here is the search query. It currently looks like this: SELECT DISTINCT summary.uid, vcard, bdata FROM 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' AS summary, 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' AS multi WHERE ( (summary.nickname IS NOT NULL AND (summary.nickname LIKE 'dwoodhou%')) OR (summary.uid = multi.uid AND multi.field = 'email' AND (multi.value LIKE 'dwoodhou%')) OR (summary.full_name IS NOT NULL AND summary.full_name LIKE 'dwoodhou%') OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dwoodhou%') OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dwoodhou%') OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dwoodhou%') OR (summary.file_as IS NOT NULL AND (summary.file_as LIKE 'dwoodhou%')) ); However, that's massively inefficient, and also wrong. What it does is build a data set consisting of every possible combination of one row from the 'summary' table, and one row from the 'multi' table. In my case, there are about 216022 rows in each. So that's a data set of 46 milliard combinations. Then it looks for the one where 'summary.nickname like dwoodhou%' and finds 216022 of them. It output 216022 *identical* results (uid, vcard, bdata). And then the 'DISTINCT' kicks in and it filters them all back down to a single line of output. (I use the past tense here with artistic licence. None of these queries has *actually* completed yet; they've been eating CPU for the last hour or two.) This is not only stunningly inefficient, but it's also broken. In the case where there are *no* entries in the 'multi' table, the search will operate on *no* data and find nothing. sqlite> create table foo (uid text primary key, name text); sqlite> create table foo_lists (uid text, email text); sqlite> insert into foo VALUES('alice', 'alice'); sqlite> insert into foo VALUES('bob', 'bob'); sqlite> insert into foo VALUES('charlie', 'charlie'); sqlite> select * from foo as foo, foo_lists as foo_lists; sqlite> What you actually want here is a LEFT OUTER JOIN: sqlite> select * from foo LEFT OUTER JOIN foo_lists ON foo.uid = foo_lists.uid; alice|alice|| bob|bob|| charlie|charlie|| sqlite> sqlite> insert into foo_lists values('alice', 'alice2@example.com'); sqlite> insert into foo_lists values('alice', 'alice@example.com'); sqlite> select * from foo LEFT OUTER JOIN foo_lists ON foo.uid = foo_lists.uid; alice|alice|alice|alice2@example.com alice|alice|alice|alice@example.com bob|bob|| charlie|charlie|| sqlite> This is the query you want: SELECT DISTINCT summary.uid, vcard, bdata FROM 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' as summary LEFT OUTER JOIN 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' as multi ON summary.uid = multi.uid WHERE ( (summary.nickname IS NOT NULL AND (summary.nickname LIKE 'dwoodhou%')) OR (multi.field = 'email' AND (multi.value LIKE 'dwoodhou%')) OR (summary.full_name IS NOT NULL AND summary.full_name LIKE 'dwoodhou%') OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dwoodhou%') OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dwoodhou%') OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dwoodhou%') OR (summary.file_as IS NOT NULL AND (summary.file_as LIKE 'dwoodhou%')) );
Created attachment 244423 [details] [review] Fix sqlite to use JOIN This should fix it.
By dropping all the 'foo IS NOT NULL' checks which seem redundant, I can drop the query time even further. It's about 1.16 seconds with them, and drops to about 1.06 seconds without.
Still not sure about the original upgrade problem though. Looking back at my original contacts.db file before I moved it out of the way to let it get recreated, it *did* have the 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' table. Just nothing *in* it. Despite the fact that the GAL has been updated a number of times since the Evolution upgrade, and new entries have been inserted. Why didn't the table get populated?
Created attachment 244496 [details] [review] Patch fixed to apply OUTER LEFT JOIN on all related queries This is essentially David Woodhouse's patch modified so that is also works for fetching uids (modified all related queries to use the OUTER LEFT JOIN). With this patch the test cases pass in tests/libebook/client/
(In reply to comment #9) > Still not sure about the original upgrade problem though. Looking back at my > original contacts.db file before I moved it out of the way to let it get > recreated, it *did* have the 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global > Address List_lists' table. Just nothing *in* it. Despite the fact that the GAL > has been updated a number of times since the Evolution upgrade, and new entries > have been inserted. > > Why didn't the table get populated? This is because create_contacts_table() is a little bit messy for now, there are a hand full of requirements and to get it right/clean, it really needs some refactoring... In particular, notice how create_contacts_table() calls introspect_summary() after the 'CREATE TABLE IF NOT EXISTS' statements, this has the side effect of creating the '_lists' table even if it's not going to be used. https://git.gnome.org/browse/evolution-data-server/tree/addressbook/libedata-book/e-book-backend-sqlitedb.c#n882 The introspect_summary() function will load up the internal state of EBookBackendSqliteDB so that it knows which fields the SQLite is really populated with (i.e. the fields which were in the summary when the addressbook was first created). Since your addressbook was originally created without the 'email' field and with the 'email_1', 'email_2', 'email_3' and 'email_4' fields in the summary, that's what it will use. Also, note that even though the default schema has changed since 3.8, there is not a schema upgrade... (we don't really have any indication as to whether the old schema is explicitly configured and should be kept, or if we should for some reason drop the old schema and use a new one). However there is a renormalization process at upgrade time (this doesnt change the schema but does change the stored data which needs to be stored in lower case with no accents, so that we can use case sensitive LIKE statements).
(In reply to comment #11) > (we don't really have any indication > as to whether the old schema is explicitly configured and should be > kept, or if we should for some reason drop the old schema and use a > new one). Apart from the explicit list of fields which is passed to e_book_backend_sqlitedb_new_internal(), you mean?
Comment on attachment 244496 [details] [review] Patch fixed to apply OUTER LEFT JOIN on all related queries I've committed the SQL patch. Leaving the bug open for the original problem — we need to fix migration so that it doesn't end up doing the full query.
From IRC discussion, we have agreed that the way forward is to check for the old email_[1234] fields in the summary, and to upgrade the summary to use the 'emails' multivalue field if we find them. <tristan> I would say, first get to the point in create_contacts_table() *after* introspecting the summary setup... and then if the summary contains email_[1-4] fields, and no mention of 'email' in the multivalues introspected list... then do the migration... and update the multivalues field to mention 'email' so that it's properly introspected on the next go
We've run into some issues with this change. It would seem that while using OUTER LEFT JOIN for extremely huge addressbooks improves performance dramatically (as the former inner join did not make sense in this case). It has actually had negative impacts on regular sized addressbooks, such as those used for personal email accounts of contact lists on hand phones. This is because while using the OUTER LEFT JOIN statement, not all rows on the right hand table of the JOIN statement are guaranteed to exist in the resultint set... Because of the way the tables are now joined, the indexes in the multi-valued table (on the 'value' column) are no longer of any use. # # Exact email search with OUTER LEFT JOIN # # Note full table scan # SELECT DISTINCT summary.uid, vcard, bdata FROM 'folder_id' AS summary LEFT OUTER JOIN 'folder_id_lists' AS multi ON summary.uid = multi.uid WHERE (multi.field = 'email' AND (multi.value = 'micheal@jackson.com')) DEBUG BEGIN: NAME: 'selectid' VALUE: 0 NAME: 'order' VALUE: 0 NAME: 'from' VALUE: 0 NAME: 'detail' VALUE: SCAN TABLE folder_id AS summary (~1000000 rows) DEBUG END DEBUG BEGIN: NAME: 'selectid' VALUE: 0 NAME: 'order' VALUE: 1 NAME: 'from' VALUE: 1 NAME: 'detail' VALUE: SEARCH TABLE folder_id_lists AS multi USING INDEX VALINDEX (value=?) (~2 rows) DEBUG END DEBUG BEGIN: NAME: 'selectid' VALUE: 0 NAME: 'order' VALUE: 0 NAME: 'from' VALUE: 0 NAME: 'detail' VALUE: USE TEMP B-TREE FOR DISTINCT DEBUG END DEBUG STATEMENT END # # Exact email search with regular inner join # # Note binary search on VALINDEX # SELECT DISTINCT summary.uid, vcard, bdata FROM 'folder_id' AS summary, 'folder_id_lists' AS multi WHERE (summary.uid = multi.uid AND multi.field = 'email' AND (multi.value = 'micheal@jackson.com')) DEBUG BEGIN: NAME: 'selectid' VALUE: 0 NAME: 'order' VALUE: 0 NAME: 'from' VALUE: 1 NAME: 'detail' VALUE: SEARCH TABLE folder_id_lists AS multi USING INDEX VALINDEX (value=?) (~2 rows) DEBUG END DEBUG BEGIN: NAME: 'selectid' VALUE: 0 NAME: 'order' VALUE: 1 NAME: 'from' VALUE: 0 NAME: 'detail' VALUE: SEARCH TABLE folder_id AS summary USING INDEX sqlite_autoindex_folder_id_1 (uid=?) (~1 rows) DEBUG END DEBUG BEGIN: NAME: 'selectid' VALUE: 0 NAME: 'order' VALUE: 0 NAME: 'from' VALUE: 0 NAME: 'detail' VALUE: USE TEMP B-TREE FOR DISTINCT DEBUG END DEBUG STATEMENT END: Success A suggestion has been made to change this query so that a nested select is used, at least in the case where the query constraints are only interested in the multi-valued table. The suggested patch makes such queries run as follows: SELECT DISTINCT summary.uid, vcard, bdata FROM ( SELECT DISTINCT multi.uid FROM 'folder_id_lists' AS multi WHERE (multi.field = 'email' AND multi.value = 'micheal@jackson.com') ) AS multi LEFT OUTER JOIN 'folder_id' AS summary ON summary.uid = multi.uid This is an awkward statement for our query generator but will suffice as an optimization at least for cases where only the multi-valued table is in use. I'd like to measure the results of this properly before proceeding, in the meanwhile I'm open to comments or suggestions on alternative ways to optimize this. Note that SQLite does not support FULL JOINs
I wont close this just yet, as I'm not sure what GAL is at all (I suspect it has something to do with EWS, but then it would be EWS, right ? not GAL ?) Anyway, some explanation of what's happened in master which should be of help: o EBookBackendSqliteDB is deprecated in favor of the new rewrite; EBookSqlite. To fix this, you will need to port code to use EBookSqlite o EBookSqlite automatically configures the 'email' attribute-list contact field in an upgraded database previously created with EBookBackendSqliteDB It does so for addressbooks which are found to have been using the default configuration previously. The check to do this is a bit silly but should work fine, we check if email_[1-4] was in the summary configuration and if so, we replace that with 'email' o Queries now have a new form: SELECT <data> FROM <summary_table> AS summary JOIN <email_list_table> AS email_list ON +email_list.uid = summary.uid WHERE <constraints> The JOIN statement above will be repeated (or omitted) for each table which needs to be in context for the <constraints> The '+' in the JOIN statement ensures that email_list.uid is NOT eligable to participate in the indexing which will later occur in <constraints>... typically prefix indexing which would occur on the email_list.value column. This is the correct technique that I was recommended to use by the experts on sqlite-users mailing list. Benchmarks of the results with the new code base can be found here: https://people.gnome.org/~tvb/eds-benchmarks-30-11-2013/ There are two additional optimizations which can help out if these queries start to become slow when, you know... David decides to upgrade his addressbook from having only 220,000 contacts to about 5 or 20 million contacts. Those are: o Instead of constraints formed as WHERE <column> LIKE 'foo%' They can be formed as: WHERE <column> BETWEEN 'foo' AND 'foo}' The above is of course inappropriate for EDS as it makes assumptions on what kind of input will be given to queries, but it's worth noting. Another possible optimization is, instead of using the above '+' statement, to create indexes on the attribute lists table as such: CREATE INDEX 'field_uid_value' ON <email_list_table> (uid, value) By creating this special index, SQLite will create an index which combines both email_list.uid and email_list.value This would probably be faster than only indexing on email_list.value, you know, for addressbooks with millions and millions of contacts ;-)
GAL is the EWS Global Address List — the company database. I've ported it to EBookSqlite and done some testing. Firstly, in response to your comment #15: That's broken. You *needed* the OUTER LEFT JOIN. I've committed a (failing) test case which demonstrates the problem: https://git.gnome.org/browse/evolution-data-server/commit/?id=5f9f5b52 Brief explanation: an inner join will only include results where there is a *match* between the tables. If a record in folder_id doesn't have a corresponding record in folder_id_email_list, then it won't be seen. In that commit you can see some performance analysis of the join approaches, and basically they *all* suck. The way to write this query and get it to work fast is by using UNION, so the searches which are on simple non-list fields in the folder_id table can just work directly on the folder_id table. And use the indices on it (which we really ought to create, btw). And as an added bonus, if you're now *only* using the 'folder_id JOIN folder_id_email_list' construct for looking up email addresses, then you *can* use an inner join: SELECT DISTINCT summary.uid, summary.vcard, summary.bdata FROM 'folder_id' AS summary JOIN 'folder_id_email_list' AS email_list ON +email_list.uid = summary.uid WHERE (email_list.value IS NOT NULL AND email_list.value LIKE 'dave%') UNION SELECT summary.uid, summary.vcard, summary.bdata FROM 'folder_id' AS summary WHERE ((summary.nickname IS NOT NULL AND summary.nickname LIKE 'dave%') OR ((summary.full_name IS NOT NULL AND summary.full_name LIKE 'dave%') OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dave%') OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dave%') OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dave%')) OR (summary.file_as IS NOT NULL AND summary.file_as LIKE 'dave%'));
*** Bug 736516 has been marked as a duplicate of this bug. ***
Created attachment 286092 [details] [review] Hack to use UNION where it's simple This patch is a hack to make it use UNION for queries which are sufficiently similar to the addressbook autocompletion. If all the constraints are ORed together, and if all the substantive constraints are 'beginswith' (although that could be relaxed actually), and if it's using both auxiliary tables *and* fields in the primary 'folder_id' table, then render the query using UNION instead. It's a hack, but it does give me a *dramatic* speedup for autocompletion in the UI. Left here as a proof of concept and a reminder for when we're in a position to fully do the performance analysis again...
A further comment about the LEFT OUTER JOIN, as I'm trying to summarise our collected knowledge about this until we have time to address it in detail and run the benchmark suites... Note that the LEFT OUTER JOIN fixes the problem of omitting results where the record in folder_id doesn't *have* any overlap with the auxiliary table (i.e. no email or no phone number, depending on the search). There are some queries where that wasn't a problem. For example, one 'hot' query might be finding a vcard which contains a given phone number. And you *don't* have to use an outer join for that; the inner join is perfectly sufficient since you obviously aren't going to *want* to return any records that don't have a phone number associated with them. So alongside the aux_mask in the PreflightContext, you could also have a boolean which indicates whether any of the 'basic' fields from folder_id have been used in queries. If *not*, then you can get away with an inner join — which may address many of the performance issues you cared about in comment #15? Only if you're querying both the basic fields *and* the auxiliary fields do you need the OUTER LEFT JOIN. (And in fact not even in some of those cases but it rapidly gets complex for diminishing returns to get more specific.)
Created attachment 286230 [details] [review] Add missing indices This adds the missing indices on family_name, given_name, nickname and file_as. Is this the right way to do it?
At http://git.infradead.org/users/dwmw2/evolution-data-server.git there are three patches, implementing the suggestions in comments #21, #20 and #19 in that order: EBookSqlite: Add indices on family_name, nickname, given_name and file_as EBookSqlite: Fix queries to use LEFT JOIN where appropriate EBookSqlite: Use UNION for autocomplete queries I've fixed up the phonebook-benchmarks suite to work with recent EDS and added the autocomplete query to it. I've run with three parallel versions: 3.12, "master" which has only the LEFT JOIN fix, and "custom" which has the UNION special case too. The results seem fairly promising — going back to LEFT JOIN obviously slows things down in some cases, but it's *correct* while the inner join was broken. And the UNION trick fixes up the majority of real-world queries that needed LEFT JOIN anyway, to be orders of magnitude faster than they were.
Pushed to master as commit 40b6d7f.