Bug 699597 – Autocomplete using GAL is slow

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 699597 - Autocomplete using GAL is slow


Summary:	Autocomplete using GAL is slow


Status:	RESOLVED FIXED

Product:	evolution-data-server
Classification:	Platform
Component:	Contacts
Version:	3.12.x (obsolete)
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	evolution-addressbook-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:

Duplicates:	736516 (view as bug list)
Depends on:
Blocks:

Reported:	2013-05-03 16:03 UTC by David Woodhouse
Modified:	2014-10-09 12:24 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fix sqlite to use JOIN (995 bytes, patch) 2013-05-16 15:14 UTC, David Woodhouse	none	Details \| Review
Patch fixed to apply OUTER LEFT JOIN on all related queries (2.38 KB, patch) 2013-05-17 09:59 UTC, Tristan Van Berkom	committed	Details \| Review
Hack to use UNION where it's simple (5.63 KB, patch) 2014-09-12 20:40 UTC, David Woodhouse	none	Details \| Review
Add missing indices (2.67 KB, patch) 2014-09-15 19:47 UTC, David Woodhouse	none	Details \| Review

Description David Woodhouse 2013-05-03 16:03:08 UTC

If I switch to the Contacts pane in the UI and do a simple 'Name contains...' search in the GAL it's very fast. But an 'Any field contains...' search takes about 12 seconds. That's expected, since the latter is searching fields that aren't in the database summary.

However: if I enter a partial name/address into the To: field in the composer, it takes the *same* 12 seconds to come up with a list of possible completions.
That's less acceptable.

I'm guessing that the autocomplete query is trying to match on some field that *isn't* in the sqlitedb summary, and thus it's falling back to the slow search that builds *all* the VCARDs and runs the search over them all.

Whatever the default autocomplete in the composer is doing, we should make sure those fields are indexed...

Comment 1 David Woodhouse 2013-05-16 12:02:30 UTC

The search is of the form ' (or (beginswith "nickname"  "asd") (beginswith "email"  "asd") (beginswith "full_name"  "asd") (beginswith "file_as"  "asd") )'

It looks like the GAL summary contains 'email_1', 'email_2', 'email_3' and 'email_4' fields, but no 'email'. That's why it falls back to a full search.

Comment 2 Tristan Van Berkom 2013-05-16 12:25:45 UTC

Is the problem clear or still unsolved ?

Looks obvious to me... the summary was created with email_[1-4] fields
stored in the summary, instead of with the multivalued 'email' field.

The query asks for the multivalued 'email' field, which is not in the
summary... so it falls back on parsing the full vcards to do the search.

Is this undesirable ? sure.

Are new summaries created with this email_[1-4] fields ? if so, that
should probably be changed to use the new default summary format.

Is this happening specifically for SQLite databases which were created
in previous versions and then upgraded ?

In any case, to improve performance of newer searches performed on
older db's, perhaps some hack should be added to the SQLite ?

i.e.:

if (query == "email") { 
   check (email_1 & email_2 & email_3 & email_4);
}

But even then, it would be incorrect, because the multivalued "email"
field can refer to other email addresses which are not email_1, email_2, email_3 or email_4.

I.e. the 'email' field can refer to any email... emails can be expressed
in vcards in any of the following ways:

EMAIL:my@address.com
EMAIL;TYPE=work:my@address.com
EMAIL;TYPE=home,farm:my@address.com
EMAIL;TYPE=grasshopper:grass@hopper.com

Comment 3 David Woodhouse 2013-05-16 13:07:39 UTC

Yes, this was originally generated with an older version of Evolution and then upgraded. It looks like this:

CREATE TABLE 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' ( uid TEXT PRIMARY KEY, rev TEXT, file_as TEXT, nickname TEXT, full_name TEXT, given_name TEXT, family_name TEXT, email_1 TEXT, email_2 TEXT, email_3 TEXT, email_4 TEXT, is_list INTEGER, list_show_addresses INTEGER, wants_html INTEGER, vcard TEXT, bdata TEXT);

Is there no migration path on upgrade to fix this?

If I remove the contacts.db file and allow it to be recreated, it looks like this:

CREATE TABLE 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' ( uid TEXT PRIMARY KEY, Rev TEXT, file_as TEXT, nickname TEXT, full_name TEXT, given_name TEXT, family_name TEXT, is_list INTEGER, list_show_addresses INTEGER, wants_html INTEGER, vcard TEXT, bdata TEXT);

We just use e_book_backend_sqlitedb_new() to create it. How come it ended up with *no* email fields this time?

And once we fix this so it actually has the expected 'email' field, I'm not sure it'll get any *data* in it. We still only ever populate the E_CONTACT_EMAIL_1 field for individuals (although it looks like we do fill in E_CONTACT_EMAIL for distribution lists... which means that once all this is fixed, a search for 'dwoodhou' might end up listing every distribution list that I'm on?)

But one thing at a time... do we just to implement a migration path for the sqlitedb to update the summary fields, and why isn't it getting the right fields even when it's newly-created? Oops, that's two :)

Comment 4 David Woodhouse 2013-05-16 13:56:06 UTC

OK, that "fixed" it. It's now trying to use sqlite for this query instead of falling back to the "slow" method of generating all the vcards and analysing them, which took all of 12 seconds.

It is now a *lot* slower — evolution-addressbook-factory is eating 100% of CPU time and doesn't seem to be making a lot of progress at all.

+ Trace 231950

#0 __memcpy_ssse3_back
at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S line 2649
#1 memcpy
at /usr/include/bits/string3.h line 51
#2 sqlite3VdbeMemGrow
at sqlite3.c line 57670
#3 sqlite3VdbeMemMakeWriteable
at sqlite3.c line 57702
#4 sqlite3VdbeExec
at sqlite3.c line 65174
#5 sqlite3Step
at sqlite3.c line 62394
#6 sqlite3_step
at sqlite3.c line 62468
#7 sqlite3_exec
#8 book_backend_sql_exec_real
#9 book_backend_sql_exec
#10 book_backend_sqlitedb_search_query
#11 e_book_backend_sqlitedb_search
#12 fetch_from_offline
#13 e_book_backend_ews_start_view
at e-book-backend-ews.c line 2676
#14 bookview_start_thread
at e-data-book-view.c line 235
#15 g_thread_proxy
at gthread.c line 798
#16 start_thread
at pthread_create.c line 308
#17 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 113

Comment 5 David Woodhouse 2013-05-16 13:57:16 UTC

FWIW if I take a copy of contacts.db and run that same query using sqlite3 from the command line, I get much the same behaviour:

$ time  sqlite3 contacts2.db  "SELECT DISTINCT summary.uid, vcard, bdata FROM 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' AS summary, 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' AS multi WHERE ( (summary.nickname IS NOT NULL AND (summary.nickname LIKE 'dwoodhou%')) OR (summary.uid = multi.uid AND multi.field = 'email' AND (multi.value LIKE 'dwoodhou%')) OR (summary.full_name IS NOT NULL AND summary.full_name LIKE 'dwoodhou%') OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dwoodhou%') OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dwoodhou%') OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dwoodhou%') OR (summary.file_as IS NOT NULL AND (summary.file_as LIKE 'dwoodhou%')) )"


Now I have *two* CPUs pegged in sqlite code (on different files), and neither of them appears to be making any progress.

Comment 6 David Woodhouse 2013-05-16 15:03:31 UTC

Aha, the problem here is the search query. It currently looks like this:


SELECT DISTINCT summary.uid, vcard, bdata
FROM 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' AS summary,
     'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' AS multi
WHERE ( (summary.nickname IS NOT NULL AND (summary.nickname LIKE 'dwoodhou%'))
     OR (summary.uid = multi.uid AND multi.field = 'email' AND (multi.value LIKE 'dwoodhou%')) 
     OR (summary.full_name IS NOT NULL AND summary.full_name LIKE 'dwoodhou%')
     OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dwoodhou%')
     OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dwoodhou%')
     OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dwoodhou%')
     OR (summary.file_as IS NOT NULL AND (summary.file_as LIKE 'dwoodhou%'))
 );


However, that's massively inefficient, and also wrong. What it does is build a data set consisting of every possible combination of one row from the 'summary' table, and one row from the 'multi' table. In my case, there are about 216022 rows in each. So that's a data set of 46 milliard combinations.

Then it looks for the one where 'summary.nickname like dwoodhou%' and finds 216022 of them. It output 216022 *identical* results (uid, vcard, bdata). And then the 'DISTINCT' kicks in and it filters them all back down to a single line of output.

(I use the past tense here with artistic licence. None of these queries has *actually* completed yet; they've been eating CPU for the last hour or two.)

This is not only stunningly inefficient, but it's also broken. In the case where there are *no* entries in the 'multi' table, the search will operate on *no* data and find nothing.

sqlite> create table foo (uid text primary key, name text);
sqlite> create table foo_lists (uid text, email text);
sqlite> insert into foo VALUES('alice', 'alice');
sqlite> insert into foo VALUES('bob', 'bob');
sqlite> insert into foo VALUES('charlie', 'charlie');
sqlite> select * from foo as foo, foo_lists as foo_lists;
sqlite>

What you actually want here is a LEFT OUTER JOIN:

sqlite> select * from foo LEFT OUTER JOIN foo_lists ON foo.uid = foo_lists.uid;
alice|alice||
bob|bob||
charlie|charlie||
sqlite>

sqlite> insert into foo_lists values('alice', 'alice2@example.com');
sqlite> insert into foo_lists values('alice', 'alice@example.com');
sqlite> select * from foo LEFT OUTER JOIN foo_lists ON foo.uid = foo_lists.uid;
alice|alice|alice|alice2@example.com
alice|alice|alice|alice@example.com
bob|bob||
charlie|charlie||
sqlite>

This is the query you want:

SELECT DISTINCT summary.uid, vcard, bdata
FROM 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List' as summary LEFT OUTER JOIN 
     'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' as multi ON summary.uid = multi.uid
WHERE ( (summary.nickname IS NOT NULL AND (summary.nickname LIKE 'dwoodhou%'))
     OR (multi.field = 'email' AND (multi.value LIKE 'dwoodhou%')) 
     OR (summary.full_name IS NOT NULL AND summary.full_name LIKE 'dwoodhou%')
     OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dwoodhou%')
     OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dwoodhou%')
     OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dwoodhou%')
     OR (summary.file_as IS NOT NULL AND (summary.file_as LIKE 'dwoodhou%'))
 );

Comment 7 David Woodhouse 2013-05-16 15:14:29 UTC

Created attachment 244423 [details] [review]
Fix sqlite to use JOIN

This should fix it.

Comment 8 David Woodhouse 2013-05-16 15:45:42 UTC

By dropping all the 'foo IS NOT NULL' checks which seem redundant, I can drop the query time even further. It's about 1.16 seconds with them, and drops to about 1.06 seconds without.

Comment 9 David Woodhouse 2013-05-16 15:58:48 UTC

Still not sure about the original upgrade problem though. Looking back at my original contacts.db file before I moved it out of the way to let it get recreated, it *did* have the 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global Address List_lists' table. Just nothing *in* it. Despite the fact that the GAL has been updated a number of times since the Evolution upgrade, and new entries have been inserted.

Why didn't the table get populated?

Comment 10 Tristan Van Berkom 2013-05-17 09:59:36 UTC

Created attachment 244496 [details] [review]
Patch fixed to apply OUTER LEFT JOIN on all related queries

This is essentially David Woodhouse's patch modified so that is
also works for fetching uids (modified all related queries to use
the OUTER LEFT JOIN).

With this patch the test cases pass in tests/libebook/client/

Comment 11 Tristan Van Berkom 2013-05-17 10:14:58 UTC

(In reply to comment #9)
> Still not sure about the original upgrade problem though. Looking back at my
> original contacts.db file before I moved it out of the way to let it get
> recreated, it *did* have the 'afe69797-f428-4fb8-bc52-b5bdb3559d8a:Global
> Address List_lists' table. Just nothing *in* it. Despite the fact that the GAL
> has been updated a number of times since the Evolution upgrade, and new entries
> have been inserted.
> 
> Why didn't the table get populated?

This is because create_contacts_table() is a little bit messy for now,
there are a hand full of requirements and to get it right/clean, it
really needs some refactoring...

In particular, notice how create_contacts_table() calls introspect_summary()
after the 'CREATE TABLE IF NOT EXISTS' statements, this has the side effect
of creating the '_lists' table even if it's not going to be used.

https://git.gnome.org/browse/evolution-data-server/tree/addressbook/libedata-book/e-book-backend-sqlitedb.c#n882

The introspect_summary() function will load up the internal state of
EBookBackendSqliteDB so that it knows which fields the SQLite is really
populated with (i.e. the fields which were in the summary when the addressbook
was first created).

Since your addressbook was originally created without the 'email' field
and with the 'email_1', 'email_2', 'email_3' and 'email_4' fields in the
summary, that's what it will use.

Also, note that even though the default schema has changed since 3.8,
there is not a schema upgrade... (we don't really have any indication
as to whether the old schema is explicitly configured and should be
kept, or if we should for some reason drop the old schema and use a
new one).

However there is a renormalization process at upgrade time (this doesnt
change the schema but does change the stored data which needs to be stored
in lower case with no accents, so that we can use case sensitive LIKE 
statements).

Comment 12 David Woodhouse 2013-05-17 10:20:58 UTC

(In reply to comment #11)
>  (we don't really have any indication
> as to whether the old schema is explicitly configured and should be
> kept, or if we should for some reason drop the old schema and use a
> new one).

Apart from the explicit list of fields which is passed to e_book_backend_sqlitedb_new_internal(), you mean?

Comment 13 David Woodhouse 2013-05-20 13:21:49 UTC

Comment on attachment 244496 [details] [review]
Patch fixed to apply OUTER LEFT JOIN on all related queries

I've committed the SQL patch. Leaving the bug open for the original problem — we need to fix migration so that it doesn't end up doing the full query.

Comment 14 David Woodhouse 2013-05-20 13:23:33 UTC

From IRC discussion, we have agreed that the way forward is to check for the old email_[1234] fields in the summary, and to upgrade the summary to use the 'emails' multivalue field if we find them.

 <tristan> I would say, first get to the point in create_contacts_table() *after* introspecting the summary setup... and then if the summary contains email_[1-4] fields, and no mention of 'email' in the multivalues introspected list... then do the migration... and update the multivalues field to mention 'email' so that it's properly introspected on the next go

Comment 15 Tristan Van Berkom 2013-11-13 09:57:53 UTC

We've run into some issues with this change.

It would seem that while using OUTER LEFT JOIN for extremely huge
addressbooks improves performance dramatically (as the former inner
join did not make sense in this case).

It has actually had negative impacts on regular sized addressbooks,
such as those used for personal email accounts of contact lists
on hand phones.

This is because while using the OUTER LEFT JOIN statement, not
all rows on the right hand table of the JOIN statement are guaranteed
to exist in the resultint set... Because of the way the tables are
now joined, the indexes in the multi-valued table (on the 'value' column)
are no longer of any use.

#
# Exact email search with OUTER LEFT JOIN
#
# Note full table scan
#
SELECT DISTINCT summary.uid, vcard, bdata FROM 'folder_id' AS summary LEFT OUTER JOIN 'folder_id_lists' AS multi ON summary.uid = multi.uid WHERE (multi.field = 'email' AND (multi.value = 'micheal@jackson.com'))
  DEBUG BEGIN:
    NAME: 'selectid' VALUE: 0
    NAME: 'order' VALUE: 0
    NAME: 'from' VALUE: 0
    NAME: 'detail' VALUE: SCAN TABLE folder_id AS summary (~1000000 rows)
  DEBUG END
  DEBUG BEGIN:
    NAME: 'selectid' VALUE: 0
    NAME: 'order' VALUE: 1
    NAME: 'from' VALUE: 1
    NAME: 'detail' VALUE: SEARCH TABLE folder_id_lists AS multi USING INDEX VALINDEX (value=?) (~2 rows)
  DEBUG END
  DEBUG BEGIN:
    NAME: 'selectid' VALUE: 0
    NAME: 'order' VALUE: 0
    NAME: 'from' VALUE: 0
    NAME: 'detail' VALUE: USE TEMP B-TREE FOR DISTINCT
  DEBUG END
DEBUG STATEMENT END

#
# Exact email search with regular inner join
#
# Note binary search on VALINDEX
#
SELECT DISTINCT summary.uid, vcard, bdata FROM 'folder_id' AS summary, 'folder_id_lists' AS multi WHERE (summary.uid = multi.uid AND multi.field = 'email' AND (multi.value = 'micheal@jackson.com'))
  DEBUG BEGIN:
    NAME: 'selectid' VALUE: 0
    NAME: 'order' VALUE: 0
    NAME: 'from' VALUE: 1
    NAME: 'detail' VALUE: SEARCH TABLE folder_id_lists AS multi USING INDEX VALINDEX (value=?) (~2 rows)
  DEBUG END
  DEBUG BEGIN:
    NAME: 'selectid' VALUE: 0
    NAME: 'order' VALUE: 1
    NAME: 'from' VALUE: 0
    NAME: 'detail' VALUE: SEARCH TABLE folder_id AS summary USING INDEX sqlite_autoindex_folder_id_1 (uid=?) (~1 rows)
  DEBUG END
  DEBUG BEGIN:
    NAME: 'selectid' VALUE: 0
    NAME: 'order' VALUE: 0
    NAME: 'from' VALUE: 0
    NAME: 'detail' VALUE: USE TEMP B-TREE FOR DISTINCT
  DEBUG END
DEBUG STATEMENT END: Success

A suggestion has been made to change this query so that a nested
select is used, at least in the case where the query constraints 
are only interested in the multi-valued table.

The suggested patch makes such queries run as follows:

SELECT 
        DISTINCT summary.uid, vcard, bdata 
FROM 
        (
          SELECT 
                DISTINCT multi.uid 
          FROM 
                'folder_id_lists' AS multi 
          WHERE 
                (multi.field = 'email' AND
                 multi.value = 'micheal@jackson.com')
        ) AS multi 
LEFT OUTER JOIN 
        'folder_id' AS summary ON summary.uid = multi.uid

This is an awkward statement for our query generator but will
suffice as an optimization at least for cases where only
the multi-valued table is in use.

I'd like to measure the results of this properly before
proceeding, in the meanwhile I'm open to comments or
suggestions on alternative ways to optimize this.

Note that SQLite does not support FULL JOINs

Comment 16 Tristan Van Berkom 2013-12-01 14:12:45 UTC

I wont close this just yet, as I'm not sure what GAL is at all
(I suspect it has something to do with EWS, but then it would be
EWS, right ? not GAL ?)

Anyway, some explanation of what's happened in master which should
be of help:

  o EBookBackendSqliteDB is deprecated in favor of the new
    rewrite; EBookSqlite.

    To fix this, you will need to port code to use EBookSqlite

  o EBookSqlite automatically configures the 'email' attribute-list
    contact field in an upgraded database previously created
    with EBookBackendSqliteDB

    It does so for addressbooks which are found to have been
    using the default configuration previously.

    The check to do this is a bit silly but should work fine,
    we check if email_[1-4] was in the summary configuration
    and if so, we replace that with 'email'

  o Queries now have a new form:

    SELECT <data> FROM <summary_table> AS summary
    JOIN <email_list_table> AS email_list ON +email_list.uid = summary.uid
    WHERE <constraints>

    The JOIN statement above will be repeated (or omitted) for each 
    table which needs to be in context for the <constraints>

    The '+' in the JOIN statement ensures that email_list.uid is
    NOT eligable to participate in the indexing which will later
    occur in <constraints>... typically prefix indexing which would
    occur on the email_list.value column.

    This is the correct technique that I was recommended to use
    by the experts on sqlite-users mailing list.


Benchmarks of the results with the new code base can be found here:
  https://people.gnome.org/~tvb/eds-benchmarks-30-11-2013/

There are two additional optimizations which can help out if these
queries start to become slow when, you know... David decides to upgrade
his addressbook from having only 220,000 contacts to about 5 or 20 million
contacts.

Those are:

  o Instead of constraints formed as

        WHERE <column> LIKE 'foo%'

    They can be formed as:

        WHERE <column> BETWEEN 'foo' AND 'foo}'

    The above is of course inappropriate for EDS as
    it makes assumptions on what kind of input will
    be given to queries, but it's worth noting.

    Another possible optimization is, instead of using
    the above '+' statement, to create indexes on the attribute
    lists table as such:

    CREATE INDEX 'field_uid_value' ON <email_list_table> (uid, value)

    By creating this special index, SQLite will create an index
    which combines both email_list.uid and email_list.value 

    This would probably be faster than only indexing on email_list.value,
    you know, for addressbooks with millions and millions of contacts ;-)

Comment 17 David Woodhouse 2014-09-12 20:35:50 UTC

GAL is the EWS Global Address List — the company database. I've ported it to EBookSqlite and done some testing.

Firstly, in response to your comment #15: That's broken. You *needed* the OUTER LEFT JOIN. I've committed a (failing) test case which demonstrates the problem:

https://git.gnome.org/browse/evolution-data-server/commit/?id=5f9f5b52

Brief explanation: an inner join will only include results where there is a *match* between the tables. If a record in folder_id doesn't have a corresponding record in folder_id_email_list, then it won't be seen.

In that commit you can see some performance analysis of the join approaches, and basically they *all* suck. The way to write this query and get it to work fast is by using UNION, so the searches which are on simple non-list fields in the folder_id table can just work directly on the folder_id table. And use the indices on it (which we really ought to create, btw).

And as an added bonus, if you're now *only* using the 'folder_id JOIN folder_id_email_list' construct for looking up email addresses, then you *can* use an inner join:

SELECT DISTINCT summary.uid, summary.vcard, summary.bdata
  FROM  'folder_id' AS summary
  JOIN  'folder_id_email_list' AS email_list
  ON    +email_list.uid = summary.uid
  WHERE (email_list.value IS NOT NULL AND email_list.value LIKE 'dave%')
UNION
SELECT summary.uid, summary.vcard, summary.bdata
  FROM  'folder_id' AS summary
  WHERE ((summary.nickname IS NOT NULL AND summary.nickname LIKE 'dave%')
      OR ((summary.full_name IS NOT NULL AND summary.full_name LIKE 'dave%')
       OR (summary.family_name IS NOT NULL AND summary.family_name LIKE 'dave%')
       OR (summary.given_name IS NOT NULL AND summary.given_name LIKE 'dave%')
       OR (summary.nickname IS NOT NULL AND summary.nickname LIKE 'dave%'))
      OR (summary.file_as IS NOT NULL AND summary.file_as LIKE 'dave%'));

Comment 18 David Woodhouse 2014-09-12 20:36:29 UTC

*** Bug 736516 has been marked as a duplicate of this bug. ***

Comment 19 David Woodhouse 2014-09-12 20:40:16 UTC

Created attachment 286092 [details] [review]
Hack to use UNION where it's simple

This patch is a hack to make it use UNION for queries which are sufficiently similar to the addressbook autocompletion.

If all the constraints are ORed together, and if all the substantive constraints are 'beginswith' (although that could be relaxed actually), and if it's using both auxiliary tables *and* fields in the primary 'folder_id' table, then render the query using UNION instead.

It's a hack, but it does give me a *dramatic* speedup for autocompletion in the UI. Left here as a proof of concept and a reminder for when we're in a position to fully do the performance analysis again...

Comment 20 David Woodhouse 2014-09-12 21:20:45 UTC

A further comment about the LEFT OUTER JOIN, as I'm trying to summarise our collected knowledge about this until we have time to address it in detail and run the benchmark suites...

Note that the LEFT OUTER JOIN fixes the problem of omitting results where the record in folder_id doesn't *have* any overlap with the auxiliary table (i.e. no email or no phone number, depending on the search).

There are some queries where that wasn't a problem. For example, one 'hot' query might be finding a vcard which contains a given phone number. And you *don't* have to use an outer join for that; the inner join is perfectly sufficient since you obviously aren't going to *want* to return any records that don't have a phone number associated with them.

So alongside the aux_mask in the PreflightContext, you could also have a boolean which indicates whether any of the 'basic' fields from folder_id have been used in queries. If *not*, then you can get away with an inner join — which may address many of the performance issues you cared about in comment #15? Only if you're querying both the basic fields *and* the auxiliary fields do you need the OUTER LEFT JOIN. (And in fact not even in some of those cases but it rapidly gets complex for diminishing returns to get more specific.)

Comment 21 David Woodhouse 2014-09-15 19:47:20 UTC

Created attachment 286230 [details] [review]
Add missing indices

This adds the missing indices on family_name, given_name, nickname and file_as.
Is this the right way to do it?

Comment 22 David Woodhouse 2014-09-25 23:25:50 UTC

At http://git.infradead.org/users/dwmw2/evolution-data-server.git there are three patches, implementing the suggestions in comments #21, #20 and #19 in that order:

      EBookSqlite: Add indices on family_name, nickname, given_name and file_as
      EBookSqlite: Fix queries to use LEFT JOIN where appropriate
      EBookSqlite: Use UNION for autocomplete queries

I've fixed up the phonebook-benchmarks suite to work with recent EDS and added the autocomplete query to it. I've run with three parallel versions: 3.12, "master" which has only the LEFT JOIN fix, and "custom" which has the UNION special case too.

The results seem fairly promising — going back to LEFT JOIN obviously slows things down in some cases, but it's *correct* while the inner join was broken. And the UNION trick fixes up the majority of real-world queries that needed LEFT JOIN anyway, to be orders of magnitude faster than they were.

Comment 23 David Woodhouse 2014-10-09 12:24:23 UTC

Pushed to master as commit 40b6d7f.