GNOME Bugzilla – Bug 445309
crash in in camel_certdb_save at camel-certdb.c:371
Last modified: 2008-04-13 16:41:06 UTC
What were you doing when the application crashed? Closing Evolution using 'Quit' Distribution: Fedora release 7 (Moonshine) Gnome Release: 2.18.0 2007-03-23 (Red Hat, Inc) BugBuddy Version: 2.18.0 System: Linux 2.6.21-1.3194.fc7 #1 SMP Wed May 23 22:47:07 EDT 2007 x86_64 X Vendor: The X.Org Foundation X Vendor Release: 10300000 Selinux: Enforcing Accessibility: Disabled GTK+ Theme: Clearlooks Icon Theme: Fedora Memory status: size: 653873152 vsize: 653873152 resident: 77918208 share: 22032384 rss: 77918208 rss_rlim: 18446744073709551615 CPU usage: start_time: 1181254403 rtime: 1426 utime: 1325 stime: 101 cutime:1 cstime: 1 timeout: 0 it_real_value: 0 frequency: 100 Backtrace was generated from '/usr/bin/evolution' Using host libthread_db library "/lib64/libthread_db.so.1". [Thread debugging using libthread_db enabled] [New Thread 46912496389728 (LWP 28963)] [New Thread 1094719824 (LWP 29143)] [New Thread 1157925200 (LWP 28996)] 0x0000003f24a0d89f in waitpid () from /lib64/libpthread.so.0
+ Trace 139215
Thread 2 (Thread 1094719824 (LWP 29143))
----------- .xsession-errors --------------------- ** Message: volume = 0 Xlib: extension "SHAPE" missing on display ":0.0". Xlib: extension "SHAPE" missing on display ":0.0". Xlib: extension "SHAPE" missing on display ":0.0". ** Message: drive = 0 ** Message: volume = 0 ** Message: drive = 0 ** Message: volume = 0 ** Message: drive = 0 ** Message: volume = 0 ** Message: drive = 0 ** Message: volume = 0 Xlib: extension "SHAPE" missing on display ":0.0". Xlib: extension "SHAPE" missing on display ":0.0". Xlib: extension "SHAPE" missing on display ":0.0". --------------------------------------------------
*** Bug 445386 has been marked as a duplicate of this bug. ***
*** Bug 445472 has been marked as a duplicate of this bug. ***
*** Bug 446335 has been marked as a duplicate of this bug. ***
*** Bug 444560 has been marked as a duplicate of this bug. ***
*** Bug 446808 has been marked as a duplicate of this bug. ***
*** Bug 447379 has been marked as a duplicate of this bug. ***
*** Bug 448059 has been marked as a duplicate of this bug. ***
*** Bug 448594 has been marked as a duplicate of this bug. ***
*** Bug 448678 has been marked as a duplicate of this bug. ***
*** Bug 448681 has been marked as a duplicate of this bug. ***
*** Bug 448746 has been marked as a duplicate of this bug. ***
only fedora reports so far...
*** Bug 448781 has been marked as a duplicate of this bug. ***
*** Bug 448811 has been marked as a duplicate of this bug. ***
*** Bug 448871 has been marked as a duplicate of this bug. ***
*** Bug 449165 has been marked as a duplicate of this bug. ***
*** Bug 449352 has been marked as a duplicate of this bug. ***
*** Bug 449424 has been marked as a duplicate of this bug. ***
*** Bug 449448 has been marked as a duplicate of this bug. ***
*** Bug 449452 has been marked as a duplicate of this bug. ***
*** Bug 449557 has been marked as a duplicate of this bug. ***
*** Bug 442067 has been marked as a duplicate of this bug. ***
*** Bug 449846 has been marked as a duplicate of this bug. ***
*** Bug 449896 has been marked as a duplicate of this bug. ***
*** Bug 449990 has been marked as a duplicate of this bug. ***
*** Bug 450735 has been marked as a duplicate of this bug. ***
*** Bug 451318 has been marked as a duplicate of this bug. ***
*** Bug 451582 has been marked as a duplicate of this bug. ***
*** Bug 451736 has been marked as a duplicate of this bug. ***
*** Bug 451757 has been marked as a duplicate of this bug. ***
*** Bug 452433 has been marked as a duplicate of this bug. ***
*** Bug 453140 has been marked as a duplicate of this bug. ***
*** Bug 452635 has been marked as a duplicate of this bug. ***
*** Bug 445550 has been marked as a duplicate of this bug. ***
*** Bug 453974 has been marked as a duplicate of this bug. ***
*** Bug 454077 has been marked as a duplicate of this bug. ***
*** Bug 455772 has been marked as a duplicate of this bug. ***
*** Bug 455773 has been marked as a duplicate of this bug. ***
*** Bug 455780 has been marked as a duplicate of this bug. ***
*** Bug 458051 has been marked as a duplicate of this bug. ***
*** Bug 456765 has been marked as a duplicate of this bug. ***
*** Bug 457746 has been marked as a duplicate of this bug. ***
*** Bug 458805 has been marked as a duplicate of this bug. ***
*** Bug 455213 has been marked as a duplicate of this bug. ***
*** Bug 455332 has been marked as a duplicate of this bug. ***
Dear reporter, thank you for your bug report. It would be helpful if you can install a glibc-debug package, reproduce the bug and attach the stacktrace here. That way we maybe can determine, why fsync() is about to crash. Thanks in advance!
(In reply to comment #46) > Dear reporter, thank you for your bug report. > It would be helpful if you can install a glibc-debug package, reproduce the bug > and attach the stacktrace here. > > That way we maybe can determine, why fsync() is about to crash. > > Thanks in advance! > Hi, added glibc-debuginfo-common, glib-debuginfo, glib2-debuginfo, glibc-debuginfo. I didn't see just a glibc-debug package for fedora 7. Of course since I added those packages it will probably work just fine now. BTW, I'm running into situations while using evolution where it stops working. Sometimes right in the middle of composing a message. A lot of times while it starts up. Is there a way to cause it to dump to see what is going on or do I need to set up a debug environment and fire up gdb? I used to develop code, however I haven't done that in many years. What I do in those situations is I send it a HUP signal, restart it and it asks me if I want to recover the message and everything is fine, for a while. This issue will probably end up being another bug to track. I'm just not sure how to bring this up as an issue. Thanks.
Hi. (In reply to comment #47) > Is there a way to cause it to dump to see what is going on or do I > need to set up a debug environment and fire up gdb? You could attach a debugger (ie. gdb) to the evolution process and generate a backtrace. You don't need a special debug environment. But please file a bug for each issue or write to the evolution-list first :) Cheers.
*** Bug 450163 has been marked as a duplicate of this bug. ***
*** Bug 456200 has been marked as a duplicate of this bug. ***
*** Bug 456295 has been marked as a duplicate of this bug. ***
*** Bug 459892 has been marked as a duplicate of this bug. ***
*** Bug 461575 has been marked as a duplicate of this bug. ***
*** Bug 462143 has been marked as a duplicate of this bug. ***
*** Bug 462241 has been marked as a duplicate of this bug. ***
*** Bug 462374 has been marked as a duplicate of this bug. ***
*** Bug 461768 has been marked as a duplicate of this bug. ***
*** Bug 461799 has been marked as a duplicate of this bug. ***
*** Bug 463261 has been marked as a duplicate of this bug. ***
Please note, that there are way better stacktraces e.g. in bug 445386
+ Trace 157173
my 0.02$ to this is, that certdb hooks g_atexit() which the file descriptor maybe do as well. So certdb then tries to write to an invalid filedescriptor and crashes. Just wild thoughts though... Moving from Evo to e-d-s.
*** Bug 467649 has been marked as a duplicate of this bug. ***
*** Bug 466708 has been marked as a duplicate of this bug. ***
*** Bug 465692 has been marked as a duplicate of this bug. ***
this is currently the worst e-d-s crasher, adding gnome-2.20 target.
[restore]
I see only fedora crashers/dupes and nothing else. I don't see any way the code can crash wrt trunk. I would love to see if any body can prove me wrong with a code review :)
This may or may not be relevant, but the Fedora evolution-data-server package is configured with: --enable-file-locking=fcntl --enable-dot-locking=no Might help to reproduce the problem on non-Fedora distros.
If someone could post or find a stacktrace for this bug that includes debugging info for glibc, that would be very helpful. yum install glibc-debuginfo
(In reply to comment #60) > my 0.02$ to this is, that certdb hooks g_atexit() which the file descriptor > maybe do as well. So certdb then tries to write to an invalid filedescriptor > and crashes. Just wild thoughts though... If the file descriptor is invalid, fsync() should simply return -1 with errno set appropriately, not crash the program. Still, I suspect it may be related to calling fsync() in a atexit() callback. Not only are all the dupes from Fedora, all but the very first dupe (bug #445386) are from Fedora 7. The first dupe is Fedora 8 Development, filed in early June. It makes me wonder if perhaps there was a glitch in glibc that got fixed shortly after Fedora 7 was released. I'll sift through ChangeLogs and look for clues.
That's true that's quite a long time I did not have this bug. Perhaps it was fixed in an update of glibc as Matthew thought.
*** Bug 473638 has been marked as a duplicate of this bug. ***
*** Bug 472163 has been marked as a duplicate of this bug. ***
*** Bug 470893 has been marked as a duplicate of this bug. ***
*** Bug 470412 has been marked as a duplicate of this bug. ***
*** Bug 469858 has been marked as a duplicate of this bug. ***
Given that all the dupes are from Fedora users, and all but one are Fedora 7 (including the most recent dupes that Tobias marked), I'm going to move this downstream. It should not block the Evolution 2.12 release. Closing this as NOTGNOME. Please refer to: http://bugzilla.redhat.com/show_bug.cgi?id=278171
*** Bug 474039 has been marked as a duplicate of this bug. ***
*** Bug 474216 has been marked as a duplicate of this bug. ***
*** Bug 474582 has been marked as a duplicate of this bug. ***
*** Bug 474834 has been marked as a duplicate of this bug. ***
*** Bug 475014 has been marked as a duplicate of this bug. ***
*** Bug 475032 has been marked as a duplicate of this bug. ***
hmm, also see bug 347997 and bug 475277! perhaps not NOTGNOME...
*** Bug 475585 has been marked as a duplicate of this bug. ***
*** Bug 475845 has been marked as a duplicate of this bug. ***
*** Bug 477341 has been marked as a duplicate of this bug. ***
*** Bug 475277 has been marked as a duplicate of this bug. ***
Name : evolution Product : Fedora 7 Version : 2.10.3 Release : 4.fc7 This update fixes a couple bugs: - Evolution fails to close after an IMAP alert has been received. - Combo boxes under "Automatic Contacts" are malfunctioning. I think last evolution version of fedora fix this issue
*** Bug 478884 has been marked as a duplicate of this bug. ***
*** Bug 479832 has been marked as a duplicate of this bug. ***
*** Bug 480020 has been marked as a duplicate of this bug. ***
*** Bug 480346 has been marked as a duplicate of this bug. ***
*** Bug 481504 has been marked as a duplicate of this bug. ***
*** Bug 481719 has been marked as a duplicate of this bug. ***
*** Bug 482303 has been marked as a duplicate of this bug. ***
Please note a great stacktrace in bug 482303 as well.
bug 482303 has: glibc-2.6-4 evolution-2.10.3-4.fc7 evolution-data-server-1.10.3.1-2.fc7
*** Bug 483124 has been marked as a duplicate of this bug. ***
*** Bug 485895 has been marked as a duplicate of this bug. ***
*** Bug 485840 has been marked as a duplicate of this bug. ***
*** Bug 486160 has been marked as a duplicate of this bug. ***
*** Bug 487088 has been marked as a duplicate of this bug. ***
*** Bug 486904 has been marked as a duplicate of this bug. ***
*** Bug 487654 has been marked as a duplicate of this bug. ***
*** Bug 486930 has been marked as a duplicate of this bug. ***
*** Bug 488671 has been marked as a duplicate of this bug. ***
*** Bug 489640 has been marked as a duplicate of this bug. ***
*** Bug 490290 has been marked as a duplicate of this bug. ***
*** Bug 492075 has been marked as a duplicate of this bug. ***
*** Bug 492432 has been marked as a duplicate of this bug. ***
*** Bug 492647 has been marked as a duplicate of this bug. ***
*** Bug 490598 has been marked as a duplicate of this bug. ***
*** Bug 492839 has been marked as a duplicate of this bug. ***
*** Bug 494432 has been marked as a duplicate of this bug. ***
*** Bug 494932 has been marked as a duplicate of this bug. ***
*** Bug 495713 has been marked as a duplicate of this bug. ***
*** Bug 495407 has been marked as a duplicate of this bug. ***
*** Bug 496452 has been marked as a duplicate of this bug. ***
reopen. bug 491988 is from debian 2.18.
*** Bug 491988 has been marked as a duplicate of this bug. ***
Just to note that I'm using now Evolution 2.12 from Fedora 8 and I do not have this bug anymore.
*** Bug 498913 has been marked as a duplicate of this bug. ***
*** Bug 500088 has been marked as a duplicate of this bug. ***
*** Bug 500545 has been marked as a duplicate of this bug. ***
*** Bug 501274 has been marked as a duplicate of this bug. ***
no Evolution 2.12/GNOME 2.20 reports yet. removing gnome-target milestone.
*** Bug 502068 has been marked as a duplicate of this bug. ***
*** Bug 502259 has been marked as a duplicate of this bug. ***
*** Bug 502547 has been marked as a duplicate of this bug. ***
*** Bug 502953 has been marked as a duplicate of this bug. ***
*** Bug 503258 has been marked as a duplicate of this bug. ***
*** Bug 503429 has been marked as a duplicate of this bug. ***
*** Bug 500532 has been marked as a duplicate of this bug. ***
This just happened again using Evolution 2.12.2 on GNOME 2.20.2. This is by backtrace (on Fedora 8) Using host libthread_db library "/lib/libthread_db.so.1". [Thread debugging using libthread_db enabled] [New Thread -1208580336 (LWP 7699)] [New Thread -1251656816 (LWP 7746)] 0x00110402 in __kernel_vsyscall ()
+ Trace 181803
Thread 2 (Thread -1251656816 (LWP 7746))
*** Bug 503523 has been marked as a duplicate of this bug. ***
*** Bug 503744 has been marked as a duplicate of this bug. ***
*** Bug 505565 has been marked as a duplicate of this bug. ***
*** Bug 505522 has been marked as a duplicate of this bug. ***
The only think I can think to suggest that would likely knock this out is to change the procedure to write certificates to a string buffer and then dump the buffer to a file in one shot, rather than writing certificates directly to an open file stream and then moving that temporary file into place. Should be fairly straight-forward to implement but it _will_ break Camel's API slightly, though it's not a part that I think anything outside of CamelCertDB is actually using.
*** Bug 508872 has been marked as a duplicate of this bug. ***
*** Bug 508880 has been marked as a duplicate of this bug. ***
We have seen invalid filedescriptor warnings when testing Modest and Tinymail based E-mail clients with valgrind. Those warnings where about the cert-db handling too. May I suggest testing this problem with valgrind and putting your findings in comments here?
*** Bug 509829 has been marked as a duplicate of this bug. ***
the missing piece from bug 508638:
+ Trace 185778
*** Bug 508638 has been marked as a duplicate of this bug. ***
Hmm, strange, most of the other stack traces show the crash in fsync(). Anyway, here's the code for open() from glibc in Fedora 8: 41: __extern_always_inline int 42: open (__const char *__path, int __oflag, ...) 43: { 44: if (__va_arg_pack_len () > 1) 45: __open_too_many_args (); 46: 47: if (__builtin_constant_p (__oflag)) 48: { 49: if ((__oflag & O_CREAT) != 0 && __va_arg_pack_len () < 1) 50: { 51: __open_missing_mode (); 52: return __open_2 (__path, __oflag); 53: } 54: return __open_alias (__path, __oflag, __va_arg_pack ()); 55: } 56: 57: if (__va_arg_pack_len () < 1) 58: return __open_2 (__path, __oflag); 59: 60: return __open_alias (__path, __oflag, __va_arg_pack ()); 61: } Doesn't shed as much light as I'd hoped, but it gives me something else to search for. Tracing beyond this seems to take us into the kernel.
+ Trace 185790
Thread 1 (Thread 0xb66136c0 (LWP 12550))
$4 = (FILE *) 0x836aeb0 (gdb) p *out $5 = {_flags = -72536956, _IO_read_ptr = 0xb5ea8000 "\200\201�O=SERVICES2,OU=Organizational CA�CN=midro.wal.novell.com,OU=IS&T,O=Novell,L=Waltham,ST=Massachussets,C=US\221wal-3.novell.com�**********************************************\203", _IO_read_end = 0xb5ea8000 "\200\201�O=SERVICES2,OU=Organizational CA�CN=midro.wal.novell.com,OU=IS&T,O=Novell,L=Waltham,ST=Massachussets,C=US\221wal-3.novell.com�**********************************************\203", _IO_read_base = 0xb5ea8000 "\200\201�O=SERVICES2,OU=Organizational CA�CN=midro.wal.novell.com,OU=IS&T,O=Novell,L=Waltham,ST=Massachussets,C=US\221wal-3.novell.com�**********************************************\203", _IO_write_base = 0xb5ea8000 "\200\201�O=SERVICES2,OU=Organizational CA�CN=midro.wal.novell.com,OU=IS&T,O=Novell,L=Waltham,ST=Massachussets,C=US\221wal-3.novell.com�**********************************************\203", _IO_write_ptr = 0xb5ea8000 "\200\201�O=SERVICES2,OU=Organizational CA�CN=midro.wal.novell.com,OU=IS&T,O=Novell,L=Waltham,ST=Massachussets,C=US\221wal-3.novell.com�**********************************************\203", _IO_write_end = 0xb5ea9000 "\177ELF\001\001\001", _IO_buf_base = 0xb5ea8000 "\200\201�O=SERVICES2,OU=Organizational CA�CN=midro.wal.novell.com,OU=IS&T,O=Novell,L=Waltham,ST=Massachussets,C=US\221wal-3.novell.com�**********************************************\203", _IO_buf_end = 0xb5ea9000 "\177ELF\001\001\001", _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0, _chain = 0xb24101c8, _fileno = 16, _flags2 = 0, _old_offset = 142054648, _cur_column = 0, _vtable_offset = 0 '\0', _shortbuf = "\b", _lock = 0x836af48, _offset = -1, __pad1 = 0x1, __pad2 = 0x836af54, __pad3 = 0x0, __pad4 = 0x0, __pad5 = 141397416, _mode = -1, _unused2 = "�ȥ\b�<\033\t\000\000\000\000\000\000\000\000pɥ\b", '\0' <repeats 12 times>, "!\000\000\000�\020�\b"} (gdb) p fileno(out) $6 = 16 I have masked few data. So, I got this today. I have debugged more. *out is valid and fileno is right. I still donno why fsync crashes. Should we change to just sync() ?
I was comparing camel_certdb_save() against GLib's g_file_set_contents(). GLib doesn't call fflush() or fsync() explicitly. It just does fwrite() followed by fclose(), and I believe fclose() should flush and sync the file for you. The Single UNIX Specification v3 says: "The fclose() function shall cause the stream pointed to by stream to be flushed and the associated file to be closed." Still, I suspect removing the fflush() and fsync() calls would only move the problem elsewhere. And it still doesn't explain why open() is crashing. This brings me back to my note in comment #69: "If the file descriptor is invalid, fsync() should simply return -1 with errno set appropriately, not crash the program. Still, I suspect it may be related to calling fsync() in a atexit() callback." camel_certdb_save() gets called from camel_shutdown(), which is an atexit() callback. Maybe camel_shutdown() should be public and we should require calling it explicitly before exiting the process. Evolution calls camel_init() from mail_session_init(). Perhaps we need a mail_session_shutdown() that calls camel_shutdown()?
Matt, I'm fine to try it. Its definitely dont gonna harm more. We can have a clean shutdown path that way.
fclose() calls fflush(), which flushes the FILE* buffers, but they might not get sync'd to disk, which can only be done via fsync().
*** Bug 510447 has been marked as a duplicate of this bug. ***
I think I made some actual investigative progress on this tonight. My idea in comment #148 did not fix the problem. If anything, it made it EASIER to reproduce. By making camel_shutdown() public and explicitly calling it earlier in the shutdown process (before exit() begins), I was able to reproduce the crash fairly frequently by simply starting Evolution, waiting for it to start freshing folders on an SSL-enabled IMAP account, then closing Evolution before it finished. The source of the crash is not fsync(). It's PR_Lock(), called from _pt_thread_death_internal() in the "primordial" thread. (NSPR defines the "primordial" thread as the thread from which PR_Init() was called.) fsync() seems to be the catalyst for a race between _pth_thread_death_internal() and PR_Cleanup(). I'll let the code speak for itself. Note the comment. PR_IMPLEMENT(PRStatus) PR_Cleanup(void) { PRThread *me = PR_GetCurrentThread(); ... if (me->state & PT_THREAD_PRIMORD) { ... /* * I am not sure if it's safe to delete the cv and lock here, * since there may still be "system" threads around. If this * call isn't immediately prior to exiting, then there's a * problem. */ if (0 == pt_book.system) { PR_DestroyCondVar(pt_book.cv); pt_book.cv = NULL; PR_DestroyLock(pt_book.ml); pt_book.ml = NULL; } ... } } static void _pt_thread_death_internal(void *arg, PRBool callDestructors) { PRThread *thred = (PRThread*)arg; if (thred->state & (PT_THREAD_FOREIGN|PT_THREAD_PRIMORD)) { PR_Lock(pt_book.ml); ... PR_Unlock(pt_book.ml); } ... } PR_Cleanup() is called BEFORE camel_certdb_save() in camel_shutdown(). I think simply swapping the order of these calls might do the trick. Indeed, with the calls swapped and after many tries, I've not been able to reproduce the crash. It might also explain the reported Win32 problems: From camel_shutdown(): #if defined (HAVE_NSS) && !defined (G_OS_WIN32) /* For some reason we get into trouble on Win32 if we call these. * But they shouldn't be necessary as the process is exiting anywy? */ NSS_Shutdown (); PR_Cleanup (); #endif /* HAVE_NSS */ Footnote: Milan was right about MailComponent never being finalized. In fact none of the EvolutionComponents are being finalized. EShellView is also leaking a reference. I've yet to pin down the source of the leak but I suspect it's somewhere in the shell.
Created attachment 103303 [details] [review] Proposed patch If I'm right about everything above, I think this should do it. Note that I've removed the Win32 workaround since I have a theory about what the problem may have been. It will be interesting to see if it crops up again in the future, assuming we ever get Evolution working on Win32 again.
Matt, Your theory seems fine to me. I think the only way we could verify completely is to commit and test it. I would say commit it asap and give it a week or two and watch for it in bugzilla.
Committed to trunk (revision #8399). Everyone please keep an eye out for this in 2.21.90 or later!
*** Bug 511475 has been marked as a duplicate of this bug. ***
*** Bug 513469 has been marked as a duplicate of this bug. ***
*** Bug 513874 has been marked as a duplicate of this bug. ***
*** Bug 514110 has been marked as a duplicate of this bug. ***
*** Bug 514423 has been marked as a duplicate of this bug. ***
*** Bug 514770 has been marked as a duplicate of this bug. ***
*** Bug 515479 has been marked as a duplicate of this bug. ***
*** Bug 516310 has been marked as a duplicate of this bug. ***
*** Bug 516551 has been marked as a duplicate of this bug. ***
*** Bug 516783 has been marked as a duplicate of this bug. ***
*** Bug 517528 has been marked as a duplicate of this bug. ***
*** Bug 513477 has been marked as a duplicate of this bug. ***
*** Bug 518658 has been marked as a duplicate of this bug. ***
*** Bug 510655 has been marked as a duplicate of this bug. ***
*** Bug 515189 has been marked as a duplicate of this bug. ***
*** Bug 519601 has been marked as a duplicate of this bug. ***
*** Bug 520396 has been marked as a duplicate of this bug. ***
*** Bug 520727 has been marked as a duplicate of this bug. ***
*** Bug 521447 has been marked as a duplicate of this bug. ***
*** Bug 521485 has been marked as a duplicate of this bug. ***
*** Bug 347997 has been marked as a duplicate of this bug. ***