Bug 363102 – E-D-S crash (and potential data corruption?) due to thread race condition

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 363102 - E-D-S crash (and potential data corruption?) due to thread race condition


Summary:	E-D-S crash (and potential data corruption?) due to thread race condition


Status:	RESOLVED FIXED

Product:	evolution-data-server
Classification:	Platform
Component:	Calendar
Version:	1.6.x (obsolete)
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	---
Assigned To:	evolution-calendar-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:

Duplicates:	319076 (view as bug list)
Depends on:
Blocks:

Reported:	2006-10-18 10:13 UTC by Patrick Ohly
Modified:	2007-07-01 18:36 UTC

See Also:
GNOME target:	---
GNOME version:	2.5/2.6

Attachments
patch which adds mutex locking to E-D-S 1.8.1 (10.33 KB, patch) 2006-10-26 20:36 UTC, Patrick Ohly	committed	Details \| Review
adds mutex locking to E-D-S 1.6.3 (10.28 KB, patch) 2006-10-26 20:38 UTC, Patrick Ohly	reviewed	Details \| Review
replaces e_cal backend locking with explicit locking of the idle_save_rmutex (4.61 KB, patch) 2006-12-30 18:35 UTC, Patrick Ohly	committed	Details \| Review

Description Patrick Ohly 2006-10-18 10:13:23 UTC

As I said on the Evolution hackers mailing list, I am running
nightly tests of my SyncEvolution with different versions of Evolution.
E-D-S crashes regularly during tests.

After investigating with valgrind it looked like a use-after-free problem,
but the stack backtraces did not make sense until I realized that the
thread replying to CORBA commands was deleting components *while* the main
thread was saving them in the idle callback. See below for the stack backtraces of all threads when the main thread crashes.

There is an idle_save_mutex, but it only prevents concurrent saves, not
modifications while a save is in process. Several solutions come to mind:
- lock the mutex inside the high-level e_cal_backend_* functions
- lock the mutex inside the low-level functions like remove_component
- serialize idle saving and CORBA requests by having saving executed as
  a CORBA call (not sure exactly how that could be done...)

Any comments?

+ Trace 76951

Thread 1 (Thread -1220008000 (LWP 9919))

#0 waitpid
from /lib/tls/libc.so.6
#1 system
from /lib/tls/libc.so.6
#2 system
from /lib/tls/libc.so.6
#3 system
from /lib/tls/libpthread.so.0
#4 gnome_segv_handler
at server.c line 114
#5 <signal handler called>
#6 pvl_head
at pvl.c line 542
#7 icalcomponent_as_ical_string
at icalcomponent.c line 328
#8 icalcomponent_as_ical_string
at icalcomponent.c line 334
#9 save_file_when_idle
at e-cal-backend-file.c line 183
#10 g_idle_dispatch
at gmain.c line 3796
#11 g_main_dispatch
at gmain.c line 1916
#12 IA__g_main_context_dispatch
at gmain.c line 2466
#13 g_main_context_iterate
at gmain.c line 2547
#14 IA__g_main_loop_run
at gmain.c line 2751
#15 bonobo_main
at bonobo-main.c line 311
#16 main
at server.c line 393
#0 waitpid
from /lib/tls/libc.so.6

Comment 1 Patrick Ohly 2006-10-26 20:36:53 UTC

Created attachment 75471 [details] [review]
patch which adds mutex locking to E-D-S 1.8.1

This patch adds mutex locking to prevent the concurrent access by different
threads to the shared data. As I said on the Evolution hackers list, this
solves to solve the problem for me: SyncEvolution nightly tests have completed
fine for several days with this patch attached. Without it they almost failed
each day.

Please review and apply. I do not claim any copyright on these minimal changes
so I don't think it is necessary to go through the normal processes for
real patches.

I also have a version of this patch for 1.6.3 which I'll upload momentarily.

Comment 2 Patrick Ohly 2006-10-26 20:38:37 UTC

Created attachment 75472 [details] [review]
adds mutex locking to E-D-S 1.6.3

This is the version of the mutex locking patch that I have adapted to
E-D-S 1.6.3. It compiles and passed the SyncEvolution calendar test
several times without problems.

Please review and apply...

Comment 3 Chenthill P 2006-11-20 12:33:38 UTC

There is a syncronization lock already available in ECalBackendSync class which is used to synchronize e_cal_backend_sync* calls from the clients. It is set at the function e_cal_backend_file_init. I suspect there may be a corruption in the remove_component as the same data is free'd at both the places,

icalcomponent_remove_component (priv->icalcomp, icalcomp); and
g_hash_table_remove (priv->comp_uid_hash, uid);

Comment 4 Patrick Ohly 2006-11-20 19:44:01 UTC

> There is a syncronization lock already available in ECalBackendSync class which
> is used to synchronize e_cal_backend_sync* calls from the clients.

I'm not sure I understand this. Does it imply that the race condition that I described should have been avoided already? I'm sure that it happened, the gdb stack backtrace shows that thread 1 and 2 are active at the same time on the same data - the idle loop simply does not obtain this ECalBackendSync lock.

Or do you mean that the idle_save_mutex shouldn't have been added in the first place and the syncronization lock be used instead? I don't know enough about this code to have an educated opinion, but that kind of sounds right to me.

> I suspect there may be a corruption in
> the remove_component as the same data is free'd at both the places,

Perhaps, but it did not cause the problem that I was seeing.

Comment 5 Chenthill P 2006-11-27 14:14:03 UTC

Yes i agree, there should be a lock to prevent the access priv->icalcomp between the idle thread and the others. Please modify the patch so that it just protects the critical region alone.

Comment 6 Patrick Ohly 2006-11-27 15:01:49 UTC

> Please modify the patch so that it just
> protects the critical region alone.

I don't know the code well enough to properly identify
all critical regions. When adding mutex locking I followed the
motto "better safe than sorry".

Why do you prefer a more fine-grained locking? IMHO the
mutex locking itself won't affect performance much compared to
IPC overhead and the actual ical calls. I also
expect the periods of time when idle saving prevents
concurrent read access to be very short, so there is not much
to be gained from more fine grained locking.

Comment 7 Chenthill P 2006-11-28 13:20:26 UTC

I can think of another alternative in this case. You could leave the locks which you have put. The synchronization mutex can be turned off using e_cal_backend_sync_set_lock (backend, FALSE) to avoid redundancy. One thing which needs to be ensured here is,  all the virtual functions called thro' e-cal-backend* interface should be locked properly. Probably you could test the patch with the sync mutex turned off and provide the results. I will also test the same. Is that fine ?

Comment 8 Matt Davey 2006-12-18 17:26:42 UTC

This bug seems very likely to be the cause of bug 319076, which has been dup'ed many times.  Good news to palm users.

Comment 9 Patrick Ohly 2006-12-30 18:35:05 UTC

Created attachment 79071 [details] [review]
replaces e_cal backend locking with explicit locking of the idle_save_rmutex

Sorry for the late reply, I haven't had time for this before the
holidays.

chenthill wrote:
> I can think of another alternative in this case. You could leave the locks
> which you have put. The synchronization mutex can be turned off using
> e_cal_backend_sync_set_lock (backend, FALSE) to avoid redundancy. One thing
> which needs to be ensured here is,  all the virtual functions called thro'
> e-cal-backend* interface should be locked properly.

Attached is a patch on top of the older E-D-S 1.8.1 patch which does that:
- e_cal_backend_sync_set_lock() is called with FALSE instead of TRUE
- adds locking of idle_save_rmutex to all e_cal implemenation functions unless
  the function is obviously reentrant

Please note that I have added labels+goto to ensure proper cleanup even in
the error cases. I did that because I got tired of adding mutex unlocking
to all function exits; this also allowed removing some other duplicated
cleanup code.

> Probably you could test the
> patch with the sync mutex turned off and provide the results. I will also test
> the same. Is that fine ?

The nightly testing ran fine twice now with E-D-S and e_cal_backend_sync_set_lock(backend, FALSE). I also had Evolution open while
running tests which add/remove items, without problems.

Comment 10 Matt Davey 2007-01-16 12:35:38 UTC

chentill: Is there any update on this?  I presume the bottleneck is your time in reviewing the patch(es).

This bug is causing a fair bit of grief for pda users (see duplicates of bug 319076), so it would be great to get it committed.

Comment 11 petera 2007-02-05 19:46:18 UTC

This bug still appears to be present in 1.9.6

Comment 12 Chenthill P 2007-02-12 09:38:25 UTC

Sorry for the delay. The patch looks good. Thanks for the patch !!!
Committed the patch to gnome-2-16 and cvs HEAD.

Comment 13 Karsten Bräckelmann 2007-07-01 18:36:48 UTC

*** Bug 319076 has been marked as a duplicate of this bug. ***