GNOME Bugzilla – Bug 363102
E-D-S crash (and potential data corruption?) due to thread race condition
Last modified: 2007-07-01 18:36:48 UTC
As I said on the Evolution hackers mailing list, I am running nightly tests of my SyncEvolution with different versions of Evolution. E-D-S crashes regularly during tests. After investigating with valgrind it looked like a use-after-free problem, but the stack backtraces did not make sense until I realized that the thread replying to CORBA commands was deleting components *while* the main thread was saving them in the idle callback. See below for the stack backtraces of all threads when the main thread crashes. There is an idle_save_mutex, but it only prevents concurrent saves, not modifications while a save is in process. Several solutions come to mind: - lock the mutex inside the high-level e_cal_backend_* functions - lock the mutex inside the low-level functions like remove_component - serialize idle saving and CORBA requests by having saving executed as a CORBA call (not sure exactly how that could be done...) Any comments?
+ Trace 76951
Thread 1 (Thread -1220008000 (LWP 9919))
Created attachment 75471 [details] [review] patch which adds mutex locking to E-D-S 1.8.1 This patch adds mutex locking to prevent the concurrent access by different threads to the shared data. As I said on the Evolution hackers list, this solves to solve the problem for me: SyncEvolution nightly tests have completed fine for several days with this patch attached. Without it they almost failed each day. Please review and apply. I do not claim any copyright on these minimal changes so I don't think it is necessary to go through the normal processes for real patches. I also have a version of this patch for 1.6.3 which I'll upload momentarily.
Created attachment 75472 [details] [review] adds mutex locking to E-D-S 1.6.3 This is the version of the mutex locking patch that I have adapted to E-D-S 1.6.3. It compiles and passed the SyncEvolution calendar test several times without problems. Please review and apply...
There is a syncronization lock already available in ECalBackendSync class which is used to synchronize e_cal_backend_sync* calls from the clients. It is set at the function e_cal_backend_file_init. I suspect there may be a corruption in the remove_component as the same data is free'd at both the places, icalcomponent_remove_component (priv->icalcomp, icalcomp); and g_hash_table_remove (priv->comp_uid_hash, uid);
> There is a syncronization lock already available in ECalBackendSync class which > is used to synchronize e_cal_backend_sync* calls from the clients. I'm not sure I understand this. Does it imply that the race condition that I described should have been avoided already? I'm sure that it happened, the gdb stack backtrace shows that thread 1 and 2 are active at the same time on the same data - the idle loop simply does not obtain this ECalBackendSync lock. Or do you mean that the idle_save_mutex shouldn't have been added in the first place and the syncronization lock be used instead? I don't know enough about this code to have an educated opinion, but that kind of sounds right to me. > I suspect there may be a corruption in > the remove_component as the same data is free'd at both the places, Perhaps, but it did not cause the problem that I was seeing.
Yes i agree, there should be a lock to prevent the access priv->icalcomp between the idle thread and the others. Please modify the patch so that it just protects the critical region alone.
> Please modify the patch so that it just > protects the critical region alone. I don't know the code well enough to properly identify all critical regions. When adding mutex locking I followed the motto "better safe than sorry". Why do you prefer a more fine-grained locking? IMHO the mutex locking itself won't affect performance much compared to IPC overhead and the actual ical calls. I also expect the periods of time when idle saving prevents concurrent read access to be very short, so there is not much to be gained from more fine grained locking.
I can think of another alternative in this case. You could leave the locks which you have put. The synchronization mutex can be turned off using e_cal_backend_sync_set_lock (backend, FALSE) to avoid redundancy. One thing which needs to be ensured here is, all the virtual functions called thro' e-cal-backend* interface should be locked properly. Probably you could test the patch with the sync mutex turned off and provide the results. I will also test the same. Is that fine ?
This bug seems very likely to be the cause of bug 319076, which has been dup'ed many times. Good news to palm users.
Created attachment 79071 [details] [review] replaces e_cal backend locking with explicit locking of the idle_save_rmutex Sorry for the late reply, I haven't had time for this before the holidays. chenthill wrote: > I can think of another alternative in this case. You could leave the locks > which you have put. The synchronization mutex can be turned off using > e_cal_backend_sync_set_lock (backend, FALSE) to avoid redundancy. One thing > which needs to be ensured here is, all the virtual functions called thro' > e-cal-backend* interface should be locked properly. Attached is a patch on top of the older E-D-S 1.8.1 patch which does that: - e_cal_backend_sync_set_lock() is called with FALSE instead of TRUE - adds locking of idle_save_rmutex to all e_cal implemenation functions unless the function is obviously reentrant Please note that I have added labels+goto to ensure proper cleanup even in the error cases. I did that because I got tired of adding mutex unlocking to all function exits; this also allowed removing some other duplicated cleanup code. > Probably you could test the > patch with the sync mutex turned off and provide the results. I will also test > the same. Is that fine ? The nightly testing ran fine twice now with E-D-S and e_cal_backend_sync_set_lock(backend, FALSE). I also had Evolution open while running tests which add/remove items, without problems.
chentill: Is there any update on this? I presume the bottleneck is your time in reviewing the patch(es). This bug is causing a fair bit of grief for pda users (see duplicates of bug 319076), so it would be great to get it committed.
This bug still appears to be present in 1.9.6
Sorry for the delay. The patch looks good. Thanks for the patch !!! Committed the patch to gnome-2-16 and cvs HEAD.
*** Bug 319076 has been marked as a duplicate of this bug. ***