Bug 640054 – sqlite summary hang ...

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 640054 - sqlite summary hang ...


Summary:	sqlite summary hang ...


Status:	RESOLVED FIXED

Product:	evolution
Classification:	Applications
Component:	Mailer
Version:	2.32.x (obsolete)
Hardware:	Other Linux

Importance:	Normal critical
Target Milestone:	---
Assigned To:	evolution-mail-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:	evolution[disk-summary]

Duplicates:	651823 (view as bug list)
Depends on:
Blocks:

Reported:	2011-01-20 12:22 UTC by Michael Meeks
Modified:	2011-09-13 07:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Should fix the bug. (7.98 KB, patch) 2011-03-29 13:52 UTC, Chenthill P	none	Details \| Review
same patch for 2.32 ... (7.94 KB, patch) 2011-03-31 15:58 UTC, Michael Meeks	none	Details \| Review
deadlock on startup (9.81 KB, text/plain) 2011-04-06 09:20 UTC, Michael Meeks		Details
updated patch for 2.32.1 omitting deadlock ... (9.90 KB, patch) 2011-04-06 09:58 UTC, Michael Meeks	none	Details \| Review
lock ordering deadlock ... (3.86 KB, text/plain) 2011-04-08 10:50 UTC, Michael Meeks		Details
Fixes deadlock at comment #14. (674 bytes, patch) 2011-05-25 11:14 UTC, Chenthill P	none	Details \| Review
Updated patch. (14.58 KB, patch) 2011-05-26 10:40 UTC, Chenthill P	committed	Details \| Review
patch that applies to gnome-2-32 branch.. (haven tested on 2.32 though) (14.81 KB, patch) 2011-05-26 11:12 UTC, Chenthill P	none	Details \| Review
One more case where a transaction was not initiated before writing into db in vee folder. (1.26 KB, patch) 2011-05-26 11:38 UTC, Chenthill P	none	Details \| Review
patch that applies to gnome-2-32 branch (17.72 KB, patch) 2011-06-07 11:02 UTC, Chenthill P	none	Details \| Review

Description Michael Meeks 2011-01-20 12:22:03 UTC

I get this fairly regularly - as you can see stability of 2.32 has improved - as is visible in the huge thread number (due to the IMAP-X thread joining leak no doubt):

+ Trace 225605


all other threads quiescent.

(gdb) thread 2323
(gdb) finish ...
Run till exit from #0  0xb68d0191 in sqlite3_exec () from /usr/lib/libsqlite3.so.0
cdb_sql_exec (db=0xa80945c0, stmt=0xa6987a88 "UPDATE 'Sent_version' SET version='2'", error=0x0) at camel-db.c:372
372             while (ret == SQLITE_BUSY || ret == SQLITE_LOCKED || ret == -1) {
(gdb) finish
Run till exit from #0  cdb_sql_exec (db=0xa80945c0, stmt=0xa6987a88 "UPDATE 'Sent_version' SET version='2'", error=0x0) at camel-db.c:372
(evolution:15846): libebook-DEBUG: GDBus connection is closed, remote peer vanished

This simply never completes ... It seems that:

Run till exit from #0  0xb68d0111 in sqlite3_exec () from /usr/lib/libsqlite3.so.0
cdb_sql_exec (db=0xa80945c0, stmt=0xa6987a88 "UPDATE 'Sent_version' SET version='2'", error=0x0) at camel-db.c:372
372             while (ret == SQLITE_BUSY || ret == SQLITE_LOCKED || ret == -1) {
(gdb) l
367             gint   ret = -1;
368
369             d(g_print("Camel SQL Exec:\n%s\n", stmt));
370
371             ret = sqlite3_exec(db, stmt, 0, 0, &errmsg);
372             while (ret == SQLITE_BUSY || ret == SQLITE_LOCKED || ret == -1) {
373                     if (errmsg) {
374                             sqlite3_free (errmsg);
375                             errmsg = NULL;
376                     }
(gdb) p ret
$1 = 5

ie. SQLITE_BUSY is constantly returned here.

Which is rather odd - I'm only running one evo, and - it can only be busy with itself: is there some lock ordering problem ? or a sqlite bug ?

(gdb) thread 2339

has the same problem SQLITE_BUSY coming back from sqlite3_exec in a tight loop.

Comment 1 Michael Meeks 2011-01-28 15:41:24 UTC

This really does look like a sqlite bug (to me); I'm using sqlite3 from the latest openSUSE:11.3 obs repo. Here we are with debugging symbols.

The hang always ends up with "Moving messages ..." "Storing folder ..." in the status bar - I guess these two conflict / race etc.

Is it possible hackers dont' see this because they cheat with:

export SQLITE_TRANSLOCK=1

If so - they should not ! ;-)

+ Trace 225741

Thread 752 (Thread 0xa11fcb70 (LWP 11677))

#0 sqlite3BtreeBeginTrans
at sqlite3.c line 40682
#1 sqlite3VdbeExec
at sqlite3.c line 55684
#2 sqlite3Step
at sqlite3.c line 51342
#3 sqlite3_step
at sqlite3.c line 51402
#4 sqlite3_exec
at sqlite3.c line 76844
#5 cdb_sql_exec
at camel-db.c line 377
#6 camel_db_write_folder_version
at camel-db.c line 1612
#7 camel_db_prepare_message_info_table
at camel-db.c line 1670
#8 save_message_infos_to_db
at camel-folder-summary.c line 2192
#9 camel_folder_summary_save_to_db
at camel-folder-summary.c line 2256
#10 local_summary_sync
at camel-local-summary.c line 435
#11 mbox_summary_sync
at camel-mbox-summary.c line 1039
#12 camel_local_summary_sync
at camel-local-summary.c line 313
#13 local_sync
at camel-local-folder.c line 524
#14 camel_folder_sync
at camel-folder.c line 1124
#15 mail_send_message
at mail-ops.c line 686
#16 send_queue_exec
at mail-ops.c line 791
#17 mail_msg_proxy
at mail-mt.c line 469
#18 g_thread_pool_thread_proxy
at gthreadpool.c line 319
#19 g_thread_create_proxy
at gthread.c line 1897
#20 start_thread
at pthread_create.c line 297

Comment 2 Matthew Barnes 2011-01-28 17:25:43 UTC

FWIW, when I can find time I'd like to redesign this part of Camel such that each folders.db has a single dedicated transaction processing thread with exclusive access to the database.  Database commands from other threads would then be serialized into a single command queue.  That would eliminate database deadlocks and I think help simplify Camel logic.

Comment 3 JP Rosevear 2011-02-16 18:19:16 UTC

I see this in openSUSE Factory with 2.32.1 - basically its immediate as soon as you try and filter anything.

Comment 4 Chenthill P 2011-03-29 13:51:06 UTC

Getting started with sqlite cache for address-book, started with working on this bug. http://www.sqlite.org/faq.html#q5 mentions clearly that there can be multiple readers and all instances should ensure that there is only one writer at any point. 

We were not adhering to it, we use two connections (cdb_r, cdb_w) to the db which allowed reading while something is written to the db causing the BUSY error to be returned on some operations..

In the patch which I have made, we use a RW lock to allow parallel reading while locking all operation while a thread wants to write the data to db file. I have not changed the ABI so that this patch could be absorbed to older versions. The lock in CamelDB is unused and cdb_w just points to cdb_r.

Comment 5 Chenthill P 2011-03-29 13:52:50 UTC

Created attachment 184583 [details] [review]
Should fix the bug.

The patch also removes trans_lock which was not necessary. It would be nice if someone else also could test the patch..

Comment 6 Michael Meeks 2011-03-29 14:20:05 UTC

nice catch :-) the patch at least looks simple and 'correct' not closing the db twice during finalize looks nice too. I'll try to build some test packages when I have 11.4 installed.

Comment 7 Michael Meeks 2011-03-31 15:57:52 UTC

Well - I ran fine with this for some time - and then got the hang again: this time on openSUSE 11.4 with:

evolution-2.32.1-9.12.1.i586
libcamel-1_2-19-2.32.1-1.i586

the latter patched with a tweaked version of your patch (to cope with the whitespace changes there).

It hung with two stores happening concurrently [ in the status bar ]:

+ Trace 226517


and all other threads quiescent.

Of course the fact we have "thread 1231" is really still a bad sign - we are failing to join threads [ still, still, still ], and leaking address space - which in the end will cause excessively weird behaviour.

Comment 8 Michael Meeks 2011-03-31 15:58:24 UTC

Created attachment 184797 [details] [review]
same patch for 2.32 ...

Comment 9 Chenthill P 2011-04-01 07:50:57 UTC

By any chance is there any other process like tracker accessing the db ?

Comment 10 Michael Meeks 2011-04-01 08:56:21 UTC

That is a good question; and I have no idea :-) I have:

ldd /usr/lib/evolution/2.32/plugins/liborg-freedesktop-Tracker-evolution-plugin.so
...
	libcamel-provider-1.2.so.19 => /usr/lib/libcamel-provider-1.2.so.19 (0xb769b000)
	libcamel-1.2.so.19 => /usr/lib/libcamel-1.2.so.19 (0xb7614000)

Which at least in theory could be doing this. I suspect that doing some cross-process RW locking to work well with tracker is going to be exciting to say the least. But I'll investigate that next time I get a hang.

Comment 11 Michael Meeks 2011-04-06 09:03:48 UTC

I just got this again with the rwlock patch:

+ Trace 226611

Thread 21200 (Thread 0xa14fdb70 (LWP 27299))

#0 memcpy
from /lib/libc.so.6
#1 ??
from /usr/lib/libsqlite3.so.0
#2 ??
from /usr/lib/libsqlite3.so.0
#3 ??
from /usr/lib/libsqlite3.so.0
#4 ??
from /usr/lib/libsqlite3.so.0
#5 ??
from /usr/lib/libsqlite3.so.0
#6 ??
from /usr/lib/libsqlite3.so.0
#7 sqlite3_exec
from /usr/lib/libsqlite3.so.0
#8 cdb_sql_exec
at camel-db.c line 392
#9 camel_db_write_folder_version
at camel-db.c line 1581
#10 camel_db_prepare_message_info_table
at camel-db.c line 1639
#11 save_message_infos_to_db
at camel-folder-summary.c line 2190
#12 camel_folder_summary_save_to_db
at camel-folder-summary.c line 2254
#13 local_summary_sync
at camel-local-summary.c line 435
#14 mbox_summary_sync
at camel-mbox-summary.c line 1039
#15 camel_local_summary_sync
at camel-local-summary.c line 313
#16 local_sync
at camel-local-folder.c line 524
#17 camel_folder_sync
at camel-folder.c line 1124
#18 ??
from /usr/lib/evolution/2.32/libevolution-mail.so.0
#19 ??
from /usr/lib/evolution/2.32/libevolution-mail.so.0
#20 ??
from /lib/libglib-2.0.so.0
#21 ??
from /lib/libglib-2.0.so.0
#22 start_thread
from /lib/libpthread.so.0
#23 clone
from /lib/libc.so.6


all other threads quiescent / normal.

I poked at the tracker log:

.local/share/tracker/tracker-store.log

and saw:

06 Apr 2011, 09:54:33: Tracker: ---> [213568|evolution|17210] Failed, 1.1: DROP GRAPH is not supported
06 Apr 2011, 09:54:36: Tracker: ---> [213569|evolution|17210] Failed, 1.1: DROP GRAPH is not supported
06 Apr 2011, 09:54:42: Tracker: ---> [213570|evolution|17210] Failed, 1.1: DROP GRAPH is not supported
06 Apr 2011, 09:55:21: Tracker: ---> [213571|evolution|17210] Failed, 1.1: DROP GRAPH is not supported
05 Apr 2011, 09:55:21: Tracker: ---> [213572|evolution|17210] Failed, 1.1: DROP GRAPH is not supported

It seems to poke it several times each minute - creating ample scope for a nasty interaction.

Then again - looking at the above - it strongly appears (to me) that we have two threads inside write paths concurrently:

a "COMMIT" and an "UPDATE" - so - are you certain that we are not just missing some locking ? (or am I mis-understanding the SQL ?).

Comment 12 Michael Meeks 2011-04-06 09:20:11 UTC

Created attachment 185271 [details]
deadlock on startup

I also get a deadlock on startup with this ( which is somewhat irritating - making it hard to work ;-). I attach the trace as an attachment, since the in-line trace magic seems to discard multiple threads [ argh ! ]

Comment 13 Michael Meeks 2011-04-06 09:58:53 UTC

Created attachment 185275 [details] [review]
updated patch for 2.32.1 omitting deadlock ...

The attached up-dated patch for 2.32 fixes the deadlock caused by taking a read lock while owning the write lock. I guess we really want to be able to recursively re-take the write / read lock as well here.

That introduces a nasty EStaticRWLock - which should loose the 'static' IMHO, so it is a nicely encapsulated impl. detail, and be moved into some e-library-of-fun-things I guess ;-)

Anyhow - that means I can read my mail again ;-)

HTH.

Comment 14 Michael Meeks 2011-04-08 10:50:41 UTC

Created attachment 185513 [details]
lock ordering deadlock ...

So - another deadlock - this time (I guess) a simple lock ordering problem, due to a confused lock hierarchy I suspect.

Comment 15 Michael Meeks 2011-05-24 13:43:52 UTC

So - to confirm; switching back to a pristine (un-patched) evolution - and with *no* tracker anything on the system, I get:

+ Trace 227237

Thread 2229 (Thread 0xa20f6b70 (LWP 26183))

#0 memset
from /lib/libc.so.6
#1 ??
from /usr/lib/libsqlite3.so.0
#2 ??
from /usr/lib/libsqlite3.so.0
#3 ??
from /usr/lib/libsqlite3.so.0
#4 ??
from /usr/lib/libsqlite3.so.0
#5 ??
from /usr/lib/libsqlite3.so.0
#6 ??
from /usr/lib/libsqlite3.so.0
#7 ??
from /usr/lib/libsqlite3.so.0
#8 sqlite3_exec
from /usr/lib/libsqlite3.so.0
#9 ??
from /usr/lib/libcamel-1.2.so.19
#10 camel_db_prepare_message_info_table
from /usr/lib/libcamel-1.2.so.19
#11 ??
from /usr/lib/libcamel-provider-1.2.so.19
#12 camel_folder_summary_save_to_db
from /usr/lib/libcamel-provider-1.2.so.19
#13 ??
from /usr/lib/evolution-data-server-1.2/camel-providers/libcamellocal.so
#14 ??
from /usr/lib/evolution-data-server-1.2/camel-providers/libcamellocal.so
#15 camel_local_summary_sync
from /usr/lib/evolution-data-server-1.2/camel-providers/libcamellocal.so
#16 ??
from /usr/lib/evolution-data-server-1.2/camel-providers/libcamellocal.so
#17 camel_folder_sync
from /usr/lib/libcamel-provider-1.2.so.19
#18 ??
from /usr/lib/evolution/2.32/libevolution-mail.so.0
#19 ??
from /usr/lib/evolution/2.32/libevolution-mail.so.0
#20 ??
from /lib/libglib-2.0.so.0
#21 ??
from /lib/libglib-2.0.so.0
#22 start_thread
from /lib/libpthread.so.0
#23 clone
from /lib/libc.so.6


Of course - it is entirely possible that this will complete - however it started:

Tue May 24 14:24:01 BST 2011
and it is now:
Tue May 24 14:38:25 BST 2011

14 minutes+ seems a bit long for doing a 'Move' at the same time as a 'Store' ;-) then again, in this case I just foolishly switched folders and its generating a message list (or trying to), so we get some CPU usage; normally I'd expect to be a ~0% in this state.

Comment 16 Chenthill P 2011-05-25 09:15:05 UTC

A(In reply to comment #14)
> Created an attachment (id=185513) [details]
> lock ordering deadlock ...
> 
> So - another deadlock - this time (I guess) a simple lock ordering problem, due
> to a confused lock hierarchy I suspect.

Found it, but the issue looks to be deeper than a lock sequence. There seems to be a confusion in locking between protecting the loaded_infos vs freezing them for a while when its re-reloaded between time intervals. Looking at the usage atm to come up with good solution..

Comment 17 Chenthill P 2011-05-25 11:14:25 UTC

Created attachment 188578 [details] [review]
Fixes deadlock at comment #14.

Comment 18 Chenthill P 2011-05-25 11:28:20 UTC


(In reply to comment #11)
> I just got this again with the rwlock patch:
> 
> 

Am able to see a read while a write is going on. Will provide a patch for the same..

Comment 19 Chenthill P 2011-05-26 10:40:58 UTC

Created attachment 188650 [details] [review]
Updated patch.

commit c5348d470881ed5da94462f23e354c70c5b2f4cb
Author: Chenthill Palanisamy <pchenthill@novell.com>
Date:   Thu May 26 15:56:07 2011 +0530

    Bug 640054 - CamelDB: do not read the db while a trasaction is in progress.
     Fix the dead-lock caused due to transaction (DB WRITE_LOCK) and summary lock.

commit 72eab52a8bd9d31da0ed1e2fa2f53597d714c66a
Author: Chenthill Palanisamy <pchenthill@novell.com>
Date:   Thu May 26 13:39:30 2011 +0530

    CamelDB: Ensure that begin_transaction is called before adding queries to a transaction

commit 9dd50efc2b7b242fa51bb5b09f86db327c37f81d
Author: Chenthill Palanisamy <pchenthill@novell.com>
Date:   Thu May 26 13:25:32 2011 +0530

    CamelFolderSummary: Remove undefined function camel_folder_summary_header_load from header file

commit 9d944e69ed19f0ff6322914fe88ab7e2261818ae
Author: Chenthill Palanisamy <pchenthill@novell.com>
Date:   Thu May 26 13:23:49 2011 +0530

    CamelDB: Use camel_db_select while retrieving the folder version

commit 78b0894ed791a2f171f784056d3bdd392b7ac7c2
Author: Chenthill Palanisamy <pchenthill@novell.com>
Date:   Wed May 25 13:10:53 2011 +0530

    CamelDB: use a RW lock to prevent reads while write is in progress

Incremental commits on master. It would be clean if the transactions are controlled by camel-db.c and un-necessary CamelMIR(FIR)->message-info translations are removed.

Comment 20 Chenthill P 2011-05-26 11:12:49 UTC

Created attachment 188654 [details] [review]
patch that applies to gnome-2-32 branch.. (haven tested on 2.32 though)

Comment 21 Chenthill P 2011-05-26 11:38:01 UTC

Created attachment 188655 [details] [review]
One more case where a transaction was not initiated before writing into db in vee folder.

Patch that applies to 2.32. Patch pushed to master.

Comment 22 Chenthill P 2011-05-27 05:41:32 UTC

Let me know if they work fine. Will get the patches committed in gnome-3.0 branch as well..

Comment 23 Michael Meeks 2011-05-27 10:38:25 UTC

Thus far, no recurrence - after a day of use with just the #20 patch :-) nice work ! (testing ongoing though ...)

Comment 24 Chenthill P 2011-06-02 10:19:46 UTC

Patches pushed to gnome-3-0 branch.

Comment 25 Milan Crha 2011-06-06 11:47:59 UTC

I suppose this is behind bug #651823

Comment 26 Michael Meeks 2011-06-07 05:54:43 UTC

Ho hum - unfortunately, I get this deadlock with chen's full patch - essentially the same as comment 12 - hopefully my work in comment 13 is useful to fix it.

Deadlock:

+ Trace 227403

Comment 27 Milan Crha 2011-06-07 06:41:23 UTC

Then please try change from bug #651823

Comment 28 Chenthill P 2011-06-07 11:02:11 UTC

Created attachment 189382 [details] [review]
patch that applies to gnome-2-32 branch

gnome-3-0 branch - http://git.gnome.org/browse/evolution-data-server/commit/?h=gnome-3-0&id=fd286b57ecaf5d68cf15c4852c1b2922cb5e0370 .

master -
http://git.gnome.org/browse/evolution-data-server/commit/?id=6bc75c47f516301364d653140fcf82193d0a2d52
http://git.gnome.org/browse/evolution-data-server/commit/?id=00f20c9b4c129b5e0fcf11c19b67588ae03e2a88
http://git.gnome.org/browse/evolution-data-server/commit/?id=2d7a7779bdc4161f2bd4cea1d301ca43ebd53e96

Comment 29 Chenthill P 2011-06-07 11:02:46 UTC

*** Bug 651823 has been marked as a duplicate of this bug. ***

Comment 30 Chenthill P 2011-09-13 07:06:43 UTC

closing the bug.