GNOME Bugzilla – Bug 640054
sqlite summary hang ...
Last modified: 2011-09-13 07:06:43 UTC
I get this fairly regularly - as you can see stability of 2.32 has improved - as is visible in the huge thread number (due to the IMAP-X thread joining leak no doubt):
+ Trace 225605
all other threads quiescent. (gdb) thread 2323 (gdb) finish ... Run till exit from #0 0xb68d0191 in sqlite3_exec () from /usr/lib/libsqlite3.so.0 cdb_sql_exec (db=0xa80945c0, stmt=0xa6987a88 "UPDATE 'Sent_version' SET version='2'", error=0x0) at camel-db.c:372 372 while (ret == SQLITE_BUSY || ret == SQLITE_LOCKED || ret == -1) { (gdb) finish Run till exit from #0 cdb_sql_exec (db=0xa80945c0, stmt=0xa6987a88 "UPDATE 'Sent_version' SET version='2'", error=0x0) at camel-db.c:372 (evolution:15846): libebook-DEBUG: GDBus connection is closed, remote peer vanished This simply never completes ... It seems that: Run till exit from #0 0xb68d0111 in sqlite3_exec () from /usr/lib/libsqlite3.so.0 cdb_sql_exec (db=0xa80945c0, stmt=0xa6987a88 "UPDATE 'Sent_version' SET version='2'", error=0x0) at camel-db.c:372 372 while (ret == SQLITE_BUSY || ret == SQLITE_LOCKED || ret == -1) { (gdb) l 367 gint ret = -1; 368 369 d(g_print("Camel SQL Exec:\n%s\n", stmt)); 370 371 ret = sqlite3_exec(db, stmt, 0, 0, &errmsg); 372 while (ret == SQLITE_BUSY || ret == SQLITE_LOCKED || ret == -1) { 373 if (errmsg) { 374 sqlite3_free (errmsg); 375 errmsg = NULL; 376 } (gdb) p ret $1 = 5 ie. SQLITE_BUSY is constantly returned here. Which is rather odd - I'm only running one evo, and - it can only be busy with itself: is there some lock ordering problem ? or a sqlite bug ? (gdb) thread 2339 has the same problem SQLITE_BUSY coming back from sqlite3_exec in a tight loop.
This really does look like a sqlite bug (to me); I'm using sqlite3 from the latest openSUSE:11.3 obs repo. Here we are with debugging symbols. The hang always ends up with "Moving messages ..." "Storing folder ..." in the status bar - I guess these two conflict / race etc. Is it possible hackers dont' see this because they cheat with: export SQLITE_TRANSLOCK=1 If so - they should not ! ;-)
+ Trace 225741
Thread 752 (Thread 0xa11fcb70 (LWP 11677))
FWIW, when I can find time I'd like to redesign this part of Camel such that each folders.db has a single dedicated transaction processing thread with exclusive access to the database. Database commands from other threads would then be serialized into a single command queue. That would eliminate database deadlocks and I think help simplify Camel logic.
I see this in openSUSE Factory with 2.32.1 - basically its immediate as soon as you try and filter anything.
Getting started with sqlite cache for address-book, started with working on this bug. http://www.sqlite.org/faq.html#q5 mentions clearly that there can be multiple readers and all instances should ensure that there is only one writer at any point. We were not adhering to it, we use two connections (cdb_r, cdb_w) to the db which allowed reading while something is written to the db causing the BUSY error to be returned on some operations.. In the patch which I have made, we use a RW lock to allow parallel reading while locking all operation while a thread wants to write the data to db file. I have not changed the ABI so that this patch could be absorbed to older versions. The lock in CamelDB is unused and cdb_w just points to cdb_r.
Created attachment 184583 [details] [review] Should fix the bug. The patch also removes trans_lock which was not necessary. It would be nice if someone else also could test the patch..
nice catch :-) the patch at least looks simple and 'correct' not closing the db twice during finalize looks nice too. I'll try to build some test packages when I have 11.4 installed.
Well - I ran fine with this for some time - and then got the hang again: this time on openSUSE 11.4 with: evolution-2.32.1-9.12.1.i586 libcamel-1_2-19-2.32.1-1.i586 the latter patched with a tweaked version of your patch (to cope with the whitespace changes there). It hung with two stores happening concurrently [ in the status bar ]:
+ Trace 226517
and all other threads quiescent. Of course the fact we have "thread 1231" is really still a bad sign - we are failing to join threads [ still, still, still ], and leaking address space - which in the end will cause excessively weird behaviour.
Created attachment 184797 [details] [review] same patch for 2.32 ...
By any chance is there any other process like tracker accessing the db ?
That is a good question; and I have no idea :-) I have: ldd /usr/lib/evolution/2.32/plugins/liborg-freedesktop-Tracker-evolution-plugin.so ... libcamel-provider-1.2.so.19 => /usr/lib/libcamel-provider-1.2.so.19 (0xb769b000) libcamel-1.2.so.19 => /usr/lib/libcamel-1.2.so.19 (0xb7614000) Which at least in theory could be doing this. I suspect that doing some cross-process RW locking to work well with tracker is going to be exciting to say the least. But I'll investigate that next time I get a hang.
I just got this again with the rwlock patch:
+ Trace 226611
Thread 21200 (Thread 0xa14fdb70 (LWP 27299))
all other threads quiescent / normal. I poked at the tracker log: .local/share/tracker/tracker-store.log and saw: 06 Apr 2011, 09:54:33: Tracker: ---> [213568|evolution|17210] Failed, 1.1: DROP GRAPH is not supported 06 Apr 2011, 09:54:36: Tracker: ---> [213569|evolution|17210] Failed, 1.1: DROP GRAPH is not supported 06 Apr 2011, 09:54:42: Tracker: ---> [213570|evolution|17210] Failed, 1.1: DROP GRAPH is not supported 06 Apr 2011, 09:55:21: Tracker: ---> [213571|evolution|17210] Failed, 1.1: DROP GRAPH is not supported 05 Apr 2011, 09:55:21: Tracker: ---> [213572|evolution|17210] Failed, 1.1: DROP GRAPH is not supported It seems to poke it several times each minute - creating ample scope for a nasty interaction. Then again - looking at the above - it strongly appears (to me) that we have two threads inside write paths concurrently: a "COMMIT" and an "UPDATE" - so - are you certain that we are not just missing some locking ? (or am I mis-understanding the SQL ?).
Created attachment 185271 [details] deadlock on startup I also get a deadlock on startup with this ( which is somewhat irritating - making it hard to work ;-). I attach the trace as an attachment, since the in-line trace magic seems to discard multiple threads [ argh ! ]
Created attachment 185275 [details] [review] updated patch for 2.32.1 omitting deadlock ... The attached up-dated patch for 2.32 fixes the deadlock caused by taking a read lock while owning the write lock. I guess we really want to be able to recursively re-take the write / read lock as well here. That introduces a nasty EStaticRWLock - which should loose the 'static' IMHO, so it is a nicely encapsulated impl. detail, and be moved into some e-library-of-fun-things I guess ;-) Anyhow - that means I can read my mail again ;-) HTH.
Created attachment 185513 [details] lock ordering deadlock ... So - another deadlock - this time (I guess) a simple lock ordering problem, due to a confused lock hierarchy I suspect.
So - to confirm; switching back to a pristine (un-patched) evolution - and with *no* tracker anything on the system, I get:
+ Trace 227237
Thread 2229 (Thread 0xa20f6b70 (LWP 26183))
Of course - it is entirely possible that this will complete - however it started: Tue May 24 14:24:01 BST 2011 and it is now: Tue May 24 14:38:25 BST 2011 14 minutes+ seems a bit long for doing a 'Move' at the same time as a 'Store' ;-) then again, in this case I just foolishly switched folders and its generating a message list (or trying to), so we get some CPU usage; normally I'd expect to be a ~0% in this state.
A(In reply to comment #14) > Created an attachment (id=185513) [details] > lock ordering deadlock ... > > So - another deadlock - this time (I guess) a simple lock ordering problem, due > to a confused lock hierarchy I suspect. Found it, but the issue looks to be deeper than a lock sequence. There seems to be a confusion in locking between protecting the loaded_infos vs freezing them for a while when its re-reloaded between time intervals. Looking at the usage atm to come up with good solution..
Created attachment 188578 [details] [review] Fixes deadlock at comment #14.
(In reply to comment #11) > I just got this again with the rwlock patch: > > Am able to see a read while a write is going on. Will provide a patch for the same..
Created attachment 188650 [details] [review] Updated patch. commit c5348d470881ed5da94462f23e354c70c5b2f4cb Author: Chenthill Palanisamy <pchenthill@novell.com> Date: Thu May 26 15:56:07 2011 +0530 Bug 640054 - CamelDB: do not read the db while a trasaction is in progress. Fix the dead-lock caused due to transaction (DB WRITE_LOCK) and summary lock. commit 72eab52a8bd9d31da0ed1e2fa2f53597d714c66a Author: Chenthill Palanisamy <pchenthill@novell.com> Date: Thu May 26 13:39:30 2011 +0530 CamelDB: Ensure that begin_transaction is called before adding queries to a transaction commit 9dd50efc2b7b242fa51bb5b09f86db327c37f81d Author: Chenthill Palanisamy <pchenthill@novell.com> Date: Thu May 26 13:25:32 2011 +0530 CamelFolderSummary: Remove undefined function camel_folder_summary_header_load from header file commit 9d944e69ed19f0ff6322914fe88ab7e2261818ae Author: Chenthill Palanisamy <pchenthill@novell.com> Date: Thu May 26 13:23:49 2011 +0530 CamelDB: Use camel_db_select while retrieving the folder version commit 78b0894ed791a2f171f784056d3bdd392b7ac7c2 Author: Chenthill Palanisamy <pchenthill@novell.com> Date: Wed May 25 13:10:53 2011 +0530 CamelDB: use a RW lock to prevent reads while write is in progress Incremental commits on master. It would be clean if the transactions are controlled by camel-db.c and un-necessary CamelMIR(FIR)->message-info translations are removed.
Created attachment 188654 [details] [review] patch that applies to gnome-2-32 branch.. (haven tested on 2.32 though)
Created attachment 188655 [details] [review] One more case where a transaction was not initiated before writing into db in vee folder. Patch that applies to 2.32. Patch pushed to master.
Let me know if they work fine. Will get the patches committed in gnome-3.0 branch as well..
Thus far, no recurrence - after a day of use with just the #20 patch :-) nice work ! (testing ongoing though ...)
Patches pushed to gnome-3-0 branch.
I suppose this is behind bug #651823
Ho hum - unfortunately, I get this deadlock with chen's full patch - essentially the same as comment 12 - hopefully my work in comment 13 is useful to fix it. Deadlock:
+ Trace 227403
Then please try change from bug #651823
Created attachment 189382 [details] [review] patch that applies to gnome-2-32 branch gnome-3-0 branch - http://git.gnome.org/browse/evolution-data-server/commit/?h=gnome-3-0&id=fd286b57ecaf5d68cf15c4852c1b2922cb5e0370 . master - http://git.gnome.org/browse/evolution-data-server/commit/?id=6bc75c47f516301364d653140fcf82193d0a2d52 http://git.gnome.org/browse/evolution-data-server/commit/?id=00f20c9b4c129b5e0fcf11c19b67588ae03e2a88 http://git.gnome.org/browse/evolution-data-server/commit/?id=2d7a7779bdc4161f2bd4cea1d301ca43ebd53e96
*** Bug 651823 has been marked as a duplicate of this bug. ***
closing the bug.