GNOME Bugzilla – Bug 557613
evolution crashed with SIGSEGV in try_open_e_book_cb()
Last modified: 2011-01-03 04:19:10 UTC
The bug has been opened on https://bugs.launchpad.net/ubuntu/+source/evolution/+bug/287136 "While starting Evolution and it was refreshing my folders it crashed. I was switching messages at the time that it crashed.
+ Trace 208594
The bug is new in 2.24.1 and due to the change on http://bugzilla.gnome.org/show_bug.cgi?id=364542
not sure if that's the issue but the change has "+ if (e_book_async_open (book, only_if_exists, try_open_e_book_cb, &data) != FALSE) { + e_flag_free (flag); + g_set_error (error, E_BOOK_ERROR, E_BOOK_ERROR_OTHER_ERROR, "Failed to call e_book_async_open."); + return FALSE; + }" or try_open_e_book_cb() uses "e_flag_set (data->flag);" which can be done after the e_flag_free there
the ubuntu bug already has 8 duplicates
I think I see a potential race here: + while (canceled = camel_operation_cancel_check (NULL), !canceled && !e_flag_is_set (flag)) { + GTimeVal wait; + + g_get_current_time (&wait); + g_time_val_add (&wait, 250000); /* waits 250ms */ + + e_flag_timed_wait (flag, &wait); + } + + e_flag_free (flag); + + if (canceled) { + g_set_error (error, E_BOOK_ERROR, E_BOOK_ERROR_CANCELLED, "Operation has been canceled."); + e_book_cancel_async_op (book, NULL); + return FALSE; + } If the Camel operation is cancelled, we (a) free the EFlag, and then (b) cancel the EBook operation. In the time between (a) and (b), try_open_e_book_cb() may get called, whereby it would be attempting to set a free'd EFlag. I'd recommend placing the e_flag_free() _after_ the block where we cancel the EBook operation.
Created attachment 121202 [details] [review] Proposed patch Haven't been able to reproduce the crash yet to test this if this fixes it, but either way it can't hurt. Also, if try_open_e_book() fails we should emit the warning always. Not just for debugging.
I tried the attached patch and it didn't fix the problem. I still get consistent crashes when opening the Preferences dialog (as detailed in https://bugs.launchpad.net/ubuntu/+source/evolution/+bug/287423). I also added some debug statements which allow me to say that try_open_e_book was called once when evo was starting and that function ran to completion. Then, when I opened the Preferences window the function was called again, but the crash happened before the e_flag_free() call (at the end of that function) was reached.
Could it be a bug on e_book_async_open?
FWIW, I backed out the last change to addressbook/libebook/e-book.c (http://bzr-playground.gnome.org/evolution-data-server/trunk/revision/7751) in the hope that the bug could've been introduced there but it doesn't seem to be the case. I'm still getting consistent crashes.
I still haven't seen a backtrace that actually shows the segfault, so this is mostly guesswork.
I followed the instructions on https://wiki.ubuntu.com/DebuggingProgramCrash to get the backtrace, but since I get constant segfaults I could try other things to get you one which actually shows the segfault.
Created attachment 121272 [details] [review] Revised patch Found another error in the logic. try_open_e_book_cb() is still called when the EBook operation is cancelled, but by then try_open_e_book() has exited and the closure points to garbage. try_open_e_book_cb() needs to check the status and return immediately if the status is CANCELLED, without touching the closure.
Matthew, Here's the output with the debug statements you asked me to add: (evolution:19455): evolution-mail-WARNING **: try_open_e_book: Thread: 0x9d32be8 (evolution:19455): evolution-mail-WARNING **: Can't get contacts: Operation has been canceled. (evolution:19455): evolution-mail-WARNING **: try_open_e_book_cb: Thread: 0x9644538 Segmentation fault Also, when I added the return right after g_set_error() in try_open_e_book_cb() (as suggested by Sebastien), evo stopped crashing. But it also stopped displaying mails (the progress bar indicated the message was being formatted but the formatting would never finish).
Okay, that's consistent with what I've observed. try_open_e_book_cb() always runs in Thread 1 because it's an idle callback (GTK main loop calls it when there are no higher priority tasks). And try_open_e_book() is called from a different thread. That's important because otherwise it would deadlock. All of the backtraces I've seen show the application in try_open_e_book_cb() but not try_open_e_book(), which means try_open_e_book() has already exited. That's bad news, because the closure for try_open_e_book_cb() is allocated on try_open_e_book()'s call stack. If try_open_e_book() exits, the pointer to the closure becomes garbage and if the callback tries to use it (like say for calling e_flag_set()) then the application goes boom. So that seems to be the key question for this bug: how is it that try_open_e_book() is exiting before try_open_e_book_cb() runs?
Created attachment 121349 [details] [review] Revised patch Guilherme, I had a thought after posting that last comment. Back out all previous patches for this bug and try this one, if you would.
The latest patch seems to fix it. I have tried opening the Preferences dialog at least 10 times and it hasn't crashed. Before, it would crash at least 50% of the time I opened it. I'll keep trying and will post an update on monday. Thanks a lot, Matthew.
Commit it Matt.
Committed to trunk (revision 36715) and gnome-2-24 (revision 36716).
*** Bug 558281 has been marked as a duplicate of this bug. ***
I see that this still can happen, unfortunately, though very rarely. I see doing it this sequence: try_open_e_book: new data 0x7fc7fcff7220 and flag 0x7fc7fcf14b20 try_open_e_book: canceled on flag 0x7fc7fcf14b20 try_open_e_book: free data 0x7fc7fcff7220 and flag 0x7fc7fcf14b20 try_open_e_book_cb: 0x7fc7fcff7220 Thus, even the cancel on the ebook is called, then the open callback is called anyway, accessing some freed memory.
Let's see how will this work with an eds-dbus, then recheck.
*** Bug 580806 has been marked as a duplicate of this bug. ***
last dupe in 2.26.1
Clarifying summary. The problem is try_open_e_book_cb(), not e_flag_set().
I don't remember ever hitting this crash with 2.24, but I'm hitting it about once a day with 2.26.1.1. Is there any hope for fixing this in 2.26?
(In reply to comment #23) > I don't remember ever hitting this crash with 2.24, but I'm hitting it about > once a day with 2.26.1.1. If I'm lucky that is, happened 3-4 times within 90 minutes today... FWIW, one thing that seems to increase the chance of hitting it is being 'ahead of' message formatting, e.g. hitting the delete key or selecting a different message while the message is still being formatted.
Created attachment 134724 [details] [review] proposed evo hack for evolution; It should be harder to reproduce with this hack, though I'm still waiting for the eds-dbus, as I hope it should fix it itself. But we support also 2.26, so this hack.
I'm running with the hack patch applied now, and it didn't crash once yesterday. So far, so good, I'll report back if I hit it again...
*** Bug 583774 has been marked as a duplicate of this bug. ***
*** Bug 584280 has been marked as a duplicate of this bug. ***
Unfortunately, I still seem to be hitting this crash with the patch. I'm actually not sure it's still crashing in try_open_e_book_cb, since I don't have the evolution debugging symbols installed ATM, but the rest of the backtrace looks the same. The mutex pointer is NULL in __pthread_mutex_lock().
I wonder how you compile evolution, with a patch, and you do not have debug info there. I know you can turn that off when compiling, but somehow I suppose you do not have it turned off. Let's see with updated trace.
(In reply to comment #30) > I wonder how you compile evolution, with a patch, and you do not have debug > info there. I'm using patched Debian packages, and I didn't have evolution-dbg installed.
*** Bug 584906 has been marked as a duplicate of this bug. ***
*** Bug 587107 has been marked as a duplicate of this bug. ***
Created attachment 137543 [details] [review] Alternative hack We're still crashing due to stale mutex pointers. This alternative patch avoids the problem in TryOpenEBookStruct itself by using just an atomic counter instead of the mutex; it may be slightly more robust, but I just still got a crash:
+ Trace 216243
To solve this, I guess the TryOpenEBookStruct access would need to be protected by a mutex in a longer lived data structure.
As I understand this, the main reason is the fact that the try_open_e_book_cb is called even after the ebook operation had been canceled. I hoped the eds-dbus will fix this too, as a side effect, but it seems we should take care of this even before.
Bad, I cannot reproduce this reliably on actual master. It happens from time to time, but not when I want, thus I cannot verify my previous hypothesis.
*** Bug 587919 has been marked as a duplicate of this bug. ***
*** Bug 588749 has been marked as a duplicate of this bug. ***
I just attached a patch to bug #397265 which rewrites these bits a bit. This can still happen, but if the first fetch survives, then the rest should be working properly onwards.
(In reply to comment #39) > I just attached a patch to bug #397265 which rewrites these bits a bit. This > can still happen, but if the first fetch survives, then the rest should be > working properly onwards. Sounds interesting, but the patch doesn't apply to 2.26.3.
*** Bug 589577 has been marked as a duplicate of this bug. ***
(In reply to comment #40) > Sounds interesting, but the patch doesn't apply to 2.26.3. Can be. It's created for git master.
*** Bug 589966 has been marked as a duplicate of this bug. ***
*** Bug 590123 has been marked as a duplicate of this bug. ***
*** Bug 590115 has been marked as a duplicate of this bug. ***
*** Bug 591160 has been marked as a duplicate of this bug. ***
*** Bug 591179 has been marked as a duplicate of this bug. ***
*** Bug 591717 has been marked as a duplicate of this bug. ***
*** Bug 592266 has been marked as a duplicate of this bug. ***
*** Bug 592553 has been marked as a duplicate of this bug. ***
*** Bug 593265 has been marked as a duplicate of this bug. ***
*** Bug 578780 has been marked as a duplicate of this bug. ***
*** Bug 593670 has been marked as a duplicate of this bug. ***
*** Bug 593827 has been marked as a duplicate of this bug. ***
*** Bug 593914 has been marked as a duplicate of this bug. ***
*** Bug 593984 has been marked as a duplicate of this bug. ***
*** Bug 594159 has been marked as a duplicate of this bug. ***
*** Bug 594590 has been marked as a duplicate of this bug. ***
*** Bug 595046 has been marked as a duplicate of this bug. ***
*** Bug 595140 has been marked as a duplicate of this bug. ***
*** Bug 595168 has been marked as a duplicate of this bug. ***
How many more duplicates will this report need to accumulate before something is done about it? :(
*** Bug 595423 has been marked as a duplicate of this bug. ***
*** Bug 595815 has been marked as a duplicate of this bug. ***
Bump the version to 2.27.9x, someone?
*** Bug 596574 has been marked as a duplicate of this bug. ***
*** Bug 596594 has been marked as a duplicate of this bug. ***
*** Bug 596610 has been marked as a duplicate of this bug. ***
Created attachment 144486 [details] [review] set flag after confirming the operation is cancelled. It would be nice if someone try this patch to see if it fixes the crash.
Created attachment 144487 [details] [review] check with the right enum value [only for master]
(In reply to comment #69) > It would be nice if someone try this patch to see if it fixes the crash. From quick testing, it appears that the patch may indeed fix the crash, but that the preview pane hangs with 'Formatting Message...' instead - bug 568332 again?
Already encountered another preview pane hang, though different this time - an old message was being displayed, and the status bar was spinning with 'Verifying message...' or so. The preview pane hangs seem to occur more often than the crashes, so I'm afraid this cure seems worse than the disease.
Michel, do you have a backtrace of the hang, please? It'll help to catch the issue.
(In reply to comment #73) > Michel, do you have a backtrace of the hang, please? No sorry, and I've switched back to the hack patch now. Also, next week I'm going on vacation for the rest of the month, so I may not get around to getting a backtrace for a while.
*** Bug 597658 has been marked as a duplicate of this bug. ***
*** Bug 597753 has been marked as a duplicate of this bug. ***
*** Bug 597897 has been marked as a duplicate of this bug. ***
*** Bug 598080 has been marked as a duplicate of this bug. ***
*** Bug 598398 has been marked as a duplicate of this bug. ***
*** Bug 598886 has been marked as a duplicate of this bug. ***
*** Bug 598894 has been marked as a duplicate of this bug. ***
*** Bug 599062 has been marked as a duplicate of this bug. ***
I also came across the hang as mentioned in comment #72, but eds should no trace of executing book_open. It looked like the response was somehow missed and evo was indefintely waiting. All the code paths were notifying the response though. So was not able to identify the real problem. But with the fix for bug 397625, this does not happen as the addressbook is opened only once. Michel, it would be nice if you have a go with the patch at bug 397625 and let us know if its fixed. Atleast with my testing I don't face this crasher anymore.
*** Bug 599915 has been marked as a duplicate of this bug. ***
*** Bug 600024 has been marked as a duplicate of this bug. ***
(In reply to comment #83) > Michel, it would be nice if you have a go with the patch at bug 397625 and let > us know if its fixed. Atleast with my testing I don't face this crasher > anymore. I've been running with commit 57712e8456024c5be983f1d934a648034e577208 from the gnome-2-28 branch on top of 2.28.1, and it's looking good so far - not a single crash yet, even though I'm not trying to avoid the problem. Of course, given the random component of the crashes, it'll take some time to gain confidence it's really fixed. But for now I think we can assume it is, and I'll report back if I hit a crash again.
Michel, Thanks for the feedback. Am closing this bug now, if you face it again, please re-open the same.
Actually, I saw this recently, couple times on start. It either crashes or survives (and it luckily usually survives). There needs to be done something more.
*** Bug 600966 has been marked as a duplicate of this bug. ***
(In reply to comment #69) > Created an attachment (id=144486) [details] [review] > set flag after confirming the operation is cancelled. > > It would be nice if someone try this patch to see if it fixes the crash. I just tried this, and I saw that the cancel can fail, which then can make the invalid operation. Please commit it. Thanks.
*** Bug 601062 has been marked as a duplicate of this bug. ***
pushed to stable and master.
closing it.
*** Bug 601994 has been marked as a duplicate of this bug. ***
*** Bug 602394 has been marked as a duplicate of this bug. ***
*** Bug 603624 has been marked as a duplicate of this bug. ***
*** Bug 604030 has been marked as a duplicate of this bug. ***
*** Bug 604147 has been marked as a duplicate of this bug. ***
*** Bug 604400 has been marked as a duplicate of this bug. ***
*** Bug 604609 has been marked as a duplicate of this bug. ***
*** Bug 605521 has been marked as a duplicate of this bug. ***
*** Bug 605482 has been marked as a duplicate of this bug. ***
*** Bug 606114 has been marked as a duplicate of this bug. ***
It seems not everything got fixed. I see, though really rarely, this. Note that the reply had been received, but no thread is waiting for it (in try_open_e_book function). It happens during the first fetch (I removed this boring thread and other from the trace), when I change folders quickly or similar. I'm not sure as I'm not able to reproduce it reliably. I'm only noting it here, and let's see whether someone else is seeing it too. 0x000000393fc0e9dd in waitpid () from /lib64/libpthread.so.0
+ Trace 219939
Thread 1 (Thread 0x7f148b25e800 (LWP 8020))
*** Bug 607038 has been marked as a duplicate of this bug. ***
*** Bug 608567 has been marked as a duplicate of this bug. ***
*** Bug 582472 has been marked as a duplicate of this bug. ***
Bug 638533 has similar traces as comment#104