Bug 772522 – Search stops working after several tries

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 772522 - Search stops working after several tries


Summary:	Search stops working after several tries


Status:	RESOLVED FIXED

Product:	geary
Classification:	Other
Component:	search
Version:	master
Hardware:	Other Linux

Importance:	Urgent normal
Target Milestone:	0.11.3
Assigned To:	Michael Gratton
QA Contact:	Geary Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-10-06 17:04 UTC by Gautier Pelloux-Prayer
Modified:	2016-12-15 08:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
geary search hanging logs (19.45 KB, text/plain) 2016-10-06 17:04 UTC, Gautier Pelloux-Prayer		Details
full threads backtrace (25.36 KB, text/plain) 2016-10-07 12:57 UTC, Gautier Pelloux-Prayer		Details
Workaround for the problem that may mis-index your messages (2.04 KB, patch) 2016-10-13 03:53 UTC, Michael Gratton	none	Details \| Review
Proposed workaround for master/geary-0.11 (5.71 KB, patch) 2016-10-14 01:19 UTC, Michael Gratton	none	Details \| Review

Description Gautier Pelloux-Prayer 2016-10-06 17:04:43 UTC

Created attachment 337083 [details]
geary search hanging logs

Here are the steps, using a gmail account:

0) Select your gmail account inbox for instance
1) Type <Ctrl>+S
2) Search for a random word -> emails are filtered matching the given word
3) Remove the word -> all most recent emails reappear in the conversations view
4) Repeat steps 2/3 until step 3 fails: all most recent emails do NOT reappear, and the spinner in bottom right corner of the left sidebar is spinning infinitely.
5) Change folder, or do whatever else: nothing happens, as is Geary service died somehow...

In logs, I can see:

> [deb] 18:55:00 50,896683 db-connection.vala:407: Connection.exec_transaction: Unable to commit transaction: (Connection.exec /home/XXX/.local/share/geary/XXX@gmail.com/geary.db) [err=5] - database is locked (COMMIT TRANSACTION)
> [deb] 18:56:00 59,714167 db-database.vala:278: Warning: unable to open database connection to /home/XXX/.local/share/geary/XXX@gmail.com/geary.db, cancelling AsyncJob: (Connection.exec /home/XXX/.local/share/geary/XXX@gmail.com/geary.db) [err=5] - database is locked (PRAGMA synchronous=normal)

For info I am running geary master on Debian, my account folder /home/XXX/.local/share/geary/XXX@gmail.com is 280MB and quite old.

What can I do for help?

Comment 1 Michael Gratton 2016-10-06 23:28:35 UTC

Hey Gautier,

I see something similar - here it's mainly noticeable when typing additional characters on an existing search. It seems less likely to happen when a background sync isn't running, and seems to happen more often after commit f80f52f, where the delay to start searching after the last key press when down from 250ms to 100ms.

My current theory is that something related to Geary's search isn't properly re-entrant, so when triggering a new search, by say changing the text while an existing search is still running, causes the deadlock. It might be SQLite's FTS implementation, or something that Geary is doing.

Debugging this is made a bit more annoying by Bug 713469, Bug 765515, and that SQLite apparently pauses for a while on first access of the MessageSearchTable FTS table when it first reads it (I found a ref to that in the code somewhere, not sure if its' still relevant though).

Something that might be useful is to run Geary under GDB and interrupt it (Ctrl+C in the GDB terminal) when search has stalled as you describe. Then do a "thread apply all bt" and see what threads are executing calls in SQLite at the time, or are execute search-related functions but are blocked waiting on semaphores or similar.

Comment 2 Gautier Pelloux-Prayer 2016-10-07 12:57:31 UTC

Created attachment 337163 [details]
full threads backtrace

> Something that might be useful is to run Geary under GDB and interrupt it (Ctrl+C in the GDB terminal) when search has stalled as you describe. Then do a "thread apply all bt" and see what threads are executing calls in SQLite at the time, or are execute search-related functions but are blocked waiting on semaphores or similar.

I've like 18 threads stuck in pthread_cond_wait, while 1 thread is stuck within sqlite3 FTS internals, so yeah there is something going one badly here.

In the same time, there is at least one bug in Geary: 

1) clearing the search text does not cancel the last search. Easy to reproduce by typing a single letter matching lots of mails ("r" for me), waiting 5 secs and removing it: the spinner continues as well as the query (but it never finishes).
2) (not sure here) when you modify the search text and the previous search is not finished, the new one is never triggered. Indeed you must type fast enough (<100ms probably)

These 2 items make me wonder if cancellable actually works for:

> imap-db-search-folder.vala:197
> private async void do_search_async(Geary.SearchQuery query, Gee.Collection<Geary.EmailIdentifier>? add_ids, Gee.Collection<ImapDB.SearchEmailIdentifier>? remove_ids, Cancellable? cancellable) throws Error"

I am not much confident with vala so I am unsure how it's supposed to work, and since the vala documentation (http://www.valadoc.org/#!wiki=index) is broken, it makes quite hard the debugging.

Still investigating though, since this bug is quite annoying in order to fully use Geary yet.

Comment 3 Michael Gratton 2016-10-08 14:22:23 UTC

Looking at this a bit, I may have figured out how to reproduce this.

1. Perform a search that causes a "Error highlighting search results" message to be reported in the debug log. The specific text string will vary depending on your database, but if you follow the steps in the description above you should come across it

2. Perform one more search after that - it should hang. Interrupting the process at that point should reveal a thread calling SQLite fts3* methods, and further down that same stack be executing in imap-db-account.vala (for me on line 1340).

I have a feeling that the SQLite error in the first step is causing subsequent FTS queries to hang.

Comment 4 Michael Gratton 2016-10-08 14:24:03 UTC

Oh PS: http://www.valadate.org:8000/ is working a bit better than valadoc.org at the moment.

Comment 5 Gautier Pelloux-Prayer 2016-10-10 11:12:52 UTC

I never have any "Error highlighting search results" error but that's one way to have a deadlock I think.
I executed my problematic request directly within sqlite3 (http://pastebin.com/sUCDxNMC) and it hung for more than 10 minutes (obviously, d* is a pretty vast regex!). Either geary should not search for single characters regex OR maybe we should use a timeout (https://www.sqlite.org/c3ref/busy_timeout.html for instance?) to avoid these endless requests?

Comment 6 Gautier Pelloux-Prayer 2016-10-10 11:28:56 UTC

> maybe we should use a timeout
Nevermind that's already in use and that's the thing that has an issue on my side I believe. Reducing it from 60 sec to 5 sec, I get the same unrecoverable errors:

>  [deb] 13:26:53 2,707434 db-connection.vala:407: Connection.exec_transaction: Unable to commit transaction: (Connection.exec /home/xxx/.local/share/geary/xxx@damsy.net/geary.db) [err=5] - database is locked (COMMIT TRANSACTION)
 [deb] 13:26:54 1,277926 imap-engine-minimal-folder.vala:641: Fetching STATUS for remote Gmail:xxx@gmail.com:INBOX (open_count=1 remote_opened=true) from local
 [deb] 13:26:54 0,001555 imap-engine-minimal-folder.vala:645: Fetching information for remote folder Gmail:xxx@gmail.com:INBOX (open_count=1 remote_opened=true)
 [deb] 13:26:54 0,000049 imap-engine-minimal-folder.vala:649: Opening remote folder INBOX
 [deb] 13:26:54 0,133552 imap-folder.vala:197: INBOX EXISTS 9289
 [deb] 13:26:54 0,000129 imap-folder.vala:232: INBOX RECENT 0
 [deb] 13:26:54 0,000577 imap-engine-minimal-folder.vala:182: Gmail:xxx@gmail.com:INBOX (open_count=1 remote_opened=true): Begin normalizing remote and local folders
 [deb] 13:26:54 0,002025 imap-engine-minimal-folder.vala:288: Gmail:xxx@gmail.com:INBOX (open_count=1 remote_opened=true): No messages added/removed since last opened, normalization completed
 [deb] 13:26:54 0,031774 imap-engine-email-prefetcher.vala:115: Listing all emails needing prefetching in Gmail:xxx@gmail.com:INBOX (open_count=1 remote_opened=true)...
 [deb] 13:26:54 0,005580 imap-engine-email-prefetcher.vala:118: Listed all emails needing prefetching in Gmail:xxx@gmail.com:INBOX (open_count=1 remote_opened=true)
 [deb] 13:26:58 3,570144 db-connection.vala:384: Connection.exec_transaction: unable to BEGIN IMMEDIATE: (Connection.exec /home/xxx/.local/share/geary/xxx@damsy.net/geary.db) [err=5] - database is locked (BEGIN IMMEDIATE)
 [deb] 13:26:58 0,000011 db-transaction-async-job.vala:42: AsyncJob: transaction completed with error: (Connection.exec /home/xxx/.local/share/geary/xxx@damsy.net/geary.db) [err=5] - database is locked (BEGIN IMMEDIATE)
 [deb] 13:26:58 0,038079 db-connection.vala:407: Connection.exec_transaction: Unable to commit transaction: (Connection.exec /home/xxx/.local/share/geary/xxx@damsy.net/geary.db) [err=5] - database is locked (COMMIT TRANSACTION)

I'll try to investigate which code triggers that timeout and try to figure out how to unlock the database properly…

Comment 7 Gautier Pelloux-Prayer 2016-10-11 15:49:22 UTC

OK, I finally investigated this and found somethings:

1) We cannot stop the first sqlite3 FTS3 search: 
    a) sqlite3 busy timeout is for modifiers statement only (UPDATE/INSERT/DELETE), not for SELECT ones.
    b) there is a sqlite3_interrupt() API, which could save us here, but unfortunately it does not interrupt the first FTS3 request from my tests (http://pastebin.com/1EtUGG03)
    c) there is nothing in Vala to abort a long-task thread (here db.exec()) and cancellable does not work since db.exec() is blocking.

BUT

2) I found the reason of my poor performances, or at least a fix: in sqlite, if  I rebuild my FTS3 search table using:

> insert into MessageSearchTable(MessageSearchTable) values('rebuild');

This fix my slow performance problems. By the way, I think it should fix yours too Mike because before running it, I got:

> insert into MessageSearchTable(MessageSearchTable) values('integrity-check');
> database disk image is malformed

So the solution is, I believe somewhat related to database vacuum:

1) we may invoke 'integrity-check' at geary startup for instance, and in case of error we should invoke 'rebuild' and recheck integrity?
2) there is also a 'optimize' command, maybe it would be useful to run it from time to time if there is no integrity check for instance?

Anyway rebuilding my database reduced the request d* time from … eternity? to < 200ms!

Comment 8 Gautier Pelloux-Prayer 2016-10-11 15:58:04 UTC

For info, here's what the rebuild reduces:

> select count(*) from MessageSearchTable_segments; //13932 vs 4118
> select count(*) from MessageSearchTable_segdir; //202 vs 66

Comment 9 Michael Gratton 2016-10-12 05:52:07 UTC

(In reply to Gautier Pelloux-Prayer from comment #7)
> OK, I finally investigated this and found somethings:

Gautier, as we like to say down under, you are a legend! Running the rebuild fixes the issues I was seeing as well.

Now that you point out those commands, it seems from the docs that the custom tokenizer changes Geary makes may be the cause Bug 765515.

From the docs, it also seems that these commands could be long running. So it would be good to get a sense of whether these can be run on a regular basis and/or in the background, or if they should be done infrequently.

The optimize command is explicitly mentioned as being expensive, so may be only worth doing when Geary intermittently executes a VACCUM at startup.

We should ensure that any DB migration scripts that alter the tokenizer also run the rebuild command. It would also be worth adding in a new DB version that does a rebuild for everyone anyway, since a number of users will be suffering this problem.

Finally, it would be good to get some timings of how long it takes to run integrity-check on large databases, to see if it's feasible to run at every startup. It might be worth recruiting people on the mailing list with large databases to time running it.

For me, on an SSD, with a warm cache, it takes from ~0.3s for a 655MB account database through ~2.1s for a 3.5GB database, which might make it a bit too slow.

Comment 10 Federico Bruni 2016-10-12 06:24:16 UTC

Gautier, you rock!
Search became totally unusable for me in the last month (the only thing working was the "search message from this user", introduced by GTK+).
Now it works again.

Here's what the rebuild reduces for me:

> select count(*) from MessageSearchTable_segments; //8554 vs 3357
> select count(*) from MessageSearchTable_segdir; //165 vs 90

Comment 11 Gautier Pelloux-Prayer 2016-10-12 12:28:37 UTC

Hmm, strangely for me today I cannot execute these commands anymore.

> insert into MessageSearchTable(MessageSearchTable) values('integrity-check');
> Error: database disk image is malformed
> insert into MessageSearchTable(MessageSearchTable) values('rebuild');
> Error: database disk image is malformed
> vacuum;
> Error: database disk image is malformed

(done on a backup .db file).

Also my geary.db for Geary is (again!) corrupted, less than 12hours after rebuilding it… it might be due to Geary, or maybe my environment.

I am not sure what's going on yet, so maybe we should wait a bit before asking benchmarks on the mailing list!

Comment 12 Michael Gratton 2016-10-12 22:32:41 UTC

Yeah, you're right. Mine were pretty quicky re-corrupted as well.

A few more data points below. These were made by doing a rebuild, starting Geary, making the observation, then quitting Geary and doing an integrity-check.

1. The problem doesn't seem to be Geary.ImapDB.Account::populate_search_table_async - commenting that out still sometimes yields a corrupt MessageSearchTable.

2. I have not observed MST becoming corrupt if no new messages are received.

3. I have observed MST becoming corrupt after some new messages are received, but not others:
  a) One message I received that did show a corrupt MST afterwards was a reasonably lengthy HTML spam. Forwarding that to another account afterwards also yielded the other account's DB becoming corrupted.
  b) I also sent myself a few short, terse, text/plain messages that after receiving them did not end up with a corrupt MST.

Comment 13 Michael Gratton 2016-10-12 22:42:35 UTC

4. I just received some more HTML email that didn't yield a corrupt MST, so it's not HTML that is the problem.

I'll see if I can work out what's up with that particular message in 3a.

Comment 14 Michael Gratton 2016-10-13 03:53:30 UTC

Created attachment 337549 [details] [review]
Workaround for the problem that may mis-index your messages

Okay, think I've found the source of the bug.

For most messages appended to a folder, one of the UPDATE statements (for the subject, from or cc/bcc) in Geary.ImapDB.Folder::do_merge_email_in_search_table() will cause the FTS table to be corrupted. This makes it seem like a SQLite bug, and I've reported it to the SQLite developers.

A workaround is to instead of performing the UPDATEs, just drop the row and re-INSERT it. The attached patch does just that, however it may cause messages to be incorrectly-indexed, so I would suggest only using it to verify it fixes the problem for you at the moment.

I'm not too familiar with this part of the engine codebase, but strangely it seems that when a message is appended to a folder by the server, after Geary adds it to the FTS table it then also updates it once or twice as well. When adding the message for this first time, all FTS fields are blank, and they are subsequently filled in by latter updates. It would be good to work out if some of these can be avoided to reduce the file I/O overhead.

Comment 15 Michael Gratton 2016-10-14 01:19:22 UTC

Created attachment 337679 [details] [review]
Proposed workaround for master/geary-0.11

I'm proposing this as the official workaround to prevent the FTS table being corrupted.

Guys, can you try applying this and let me know if stops the FTS table from being corrupted further? Note it won't fix an already-corrupt table (we'll need a DB migration for that) so don't forget to manually run a rebuild command after applying this patch and rebuilding.

Comment 16 Gautier Pelloux-Prayer 2016-10-14 12:10:48 UTC

> Guys, can you try applying this and let me know if stops the FTS table from being corrupted further?

So far so good, tested it since yesterday and none of my accounts got corrupted yet! I'll inform you if that changes.

Comment 17 Federico Bruni 2016-10-17 08:37:22 UTC

Same here: tested since Friday on two accounts and none got corrupted.

Comment 18 Michael Gratton 2016-10-21 21:19:49 UTC

Thanks guys. Commit b375a41 and commit 3e721f pushed to master, with the workaround and a DB upgrade, respectively.

I'll merge this to geary-0.11 after that's had some testing there.

Comment 19 Michael Gratton 2016-12-15 08:06:29 UTC

These have been merged to geary-0.11 as commit b9b69b9, commit 358afb7 and commit 97cd173.