GNOME Bugzilla – Bug 523463
Core dump in gmain.c:2482:IA__g_main_context_check()
Last modified: 2010-01-15 03:56:23 UTC
Steps to reproduce: I'm not sure how to reproduce this. I built glib, libsoup, libbonobo, libgtkth l, evolution-data-server, evolution, evolution-exchange, and evolution-webcal as checked out from SVN trunk today at about 2pm EST. I was trying to accept a meeting request from Exchange when Evolution dumped core. Examining the core dump led me to a glib problem. I'm using glib SVN: $ svn info Path: . URL: http://svn.gnome.org/svn/glib/trunk Repository Root: http://svn.gnome.org/svn/glib Repository UUID: 5bbd4a9e-d125-0410-bf1d-f987e7eefc80 Revision: 6738 Node Kind: directory Schedule: normal Last Changed Author: sbacher Last Changed Rev: 6737 Last Changed Date: 2008-03-19 10:07:00 -0400 (Wed, 19 Mar 2008) Stack trace: Core was generated by `/opt/evo/bin/evolution --component=mail'. Program terminated with signal 11, Segmentation fault.
+ Trace 192892
(gdb) p n_fds $2 = 7 (gdb) p i $3 = 6 (gdb) p *context->poll_records->next->next->next->next->next->next $4 = {fd = 0x86f0334, next = 0x0, priority = 0} Other information: Examining the source here, it seems like there is a serious bug in this loop: 2478 pollrec = context->poll_records; 2479 i = 0; 2480 while (i < n_fds) 2481 { 2482 if (pollrec->fd->events) 2483 { 2484 pollrec->fd->revents = fds[i].revents; 2485 i++; 2486 } 2487 pollrec = pollrec->next; 2488 } The while() statement uses a condition of i<n_fds to determine when to exit, but i is not always incremented: it's only incremented when pollrec->fd->events is true. Because we don't check to see if pollrec is NULL, it's easy to walk off the end of this linked list, which is what happened here. Note in the above gdb output n_fds is 7, i is 6, and there are in fact 7 records in the context->poll_records linked list. But somewhere along the line we must have skipped one of the i++ statements, while still performing the pollrec = pollrec->next statement. I don't know the code well enough to know if the answer is to add another test to the while condition: while (i < n_fds && pollrec) or to move the pollrec = pollrec->next inside the if-statement along with i++, or what.
I did a search in Bugzilla for IA__g_main_context_check and I found a lot of bugs where something (usually Evolution) dumped core in this function, right around this line. I also checked the Subversion history via "Annotate" and this code was apparently added in SVN revision 943, on Tue Dec 5 20:45:33 2000 UTC and it doesn't seem like it has changed since. Either I don't understand the code, or there's been a bug here for quite a while.
whoo.. you should reference the duplicates you found.
I just did a search on IA__g_main_context_check and got a response of 24 bugs. I believe all but one or two are duplicates of this bug; here is the list of bugs I think are duplicates. I can't mark them as such unfortunately. Bug #448467
+ Trace 192899
Adding a NULL check would be harmless, but might be fixing the symptom. In a single-threaded program all should be fine: g_main_context_query counts the number in the way that g_main_context_check expects it done. However, in a multi-threaded environment I am not so sure that things are right. For example, have a look at g_main_context_iterate. Between the call to g_main_context_query and the call to g_main_context_check, what is preventing the context to be updated? The lock is specifically not held. And g_main_context_poll might cause changes.
Yes, that's what I thought as well (adding the NULL check would just mask the real problem). Looking at this code, it appears to test context->poll_changed and if it's set, it returns early saying "If the set of poll file descriptors changed, bail out and let the main loop rerun". The poll_changed member is reset after we count the number of fds in g_main_context_query(), and it's set whenever we add or remove items in the poll_records list. So, it seems pretty safe that we won't get to this loop if there have been records added or removed from the list. However, what we AREN'T safe against here is if one of the existing records had their fd->events pointer reset between the query function (counts them) and the iterate function (walks them). If they did, then the loop will walk off the end and you'll get a core dump. It will require a good bit more spelunking to determine when that pointer could be reset, and what the right way is to guard against it. One option would be to ignore any value that should have been assigned to a record which now has an empty field; that would involve rewriting the problematic loop like this: pollrec = context->poll_records; i = 0; while (i < n_fds) { if (pollrec->fd->events) pollrec->fd->revents = fds[i].revents; i++; pollrec = pollrec->next; } This would fix the core but I don't know whether or not it's the right thing to do.
Created attachment 107767 [details] [review] Insufficient fix for this bug I've attached a patch which implements my last comment. It seems correct to me, or at least as correct as can be accomplished in this situation. The only alternative I can see is perform another sweep of the records list to make sure they are all present, and if not fail the same way as we fail if the records list has been changed. We can't wait until we find the error since we'll have processed some but not all of the fds which doesn't seem like a good idea (but, I know the code only very superficially). Of course, in a multi-threaded environment just locking the records list doesn't guarantee that the records it points to won't be modified in another thread; I don't know if we can/should bother locking all those records as well, or something.
The patch is certainly not correct as it stands without the corresponding change to g_main_context_query(). You'll a) get out of sync and b) read past the end of the fds array if there are any pollrecs with NULL ->events. If the corresponding change to g_main_context_query() was made, the patch would make more sense. But I'd really like to know in what circumstances ->events is being mutated to make sure that diagnosis is correct. Adding a poll record then changing ->events later was certainly not an anticipated usage of the API.
Hi Owen. You're right; I was concentrating on not walking off the linked list and didn't pay enough attention to the fds array. I'll have to go back and look at g_main_context_query(). I do agree with you that even were the patch correct, it feels like a workaround rather than a fully-grokked solution to the problem. However, I'd definitely need to dig a lot farther into the code to understand the way events and revents are used, and what areas of the code might be changing them. The records stuff is nice in that it's wholely contained in gmain.c and therefore easy to understand. It's probably more efficient for someone who's actually ever written a line of glib code to look into the issue :-). But, given the number of bugs that have been filed related to this problem I think it's important to get to the bottom of it.
Comment on attachment 107767 [details] [review] Insufficient fix for this bug See Owen Taylor's comment #7; this patch isn't completely correct.
I don't think digging into the GLib code would tell you anything about the usage of the events field ... the GLib usage is all as expected when I originally wrote the code. But if you look at: http://svn.gnome.org/viewvc/ORBit2/trunk/linc2/src/linc-source.c?view=markup (This is the CORBA implementation that evolution uses internally), and in particular linc_source_set_condition(), you'll see that it does change the events field on the fly. In light of that tracking down the exact sequence of rare conditions between threads to figure out the crash isn't really necessary ... I think we can go ahead and adjust the GLib code to try and be robust against this (ab)use. If you extended the if (pollfd->events) removal to cover g_main_context_query() as well, then the downsides I see are: - Tiny amount of performance loss, as pollfd records with events=0 will be passed to the kernel. - If you have a pollfd with events=0 that is in an error state (e.g., has an invalid FD in it), then GLib will spin at 100% cpu, where previously it would not. I don't see much of an alternative within the constraints of how the code is structured and exposed publically, so I'd be OK with such a patch.
(BTW, I forgot to mention: good work in tracking this down! it's presumably been causing the occasional evolution crash for years.)
*** Bug 482819 has been marked as a duplicate of this bug. ***
grepping through my jhbuild source directory, I found some code in avahi that appears to do this too: avahi-glib/glib-watch.c:watch_update().
*** Bug 491945 has been marked as a duplicate of this bug. ***
*** Bug 495622 has been marked as a duplicate of this bug. ***
*** Bug 527295 has been marked as a duplicate of this bug. ***
*** Bug 527716 has been marked as a duplicate of this bug. ***
Paul, can you paste here complete stack trace from evolution, when it appears again? I've such a feeling it's not a glib issue, but I can be wrong.
I DID put a complete stack trace from evolution, in the description. It's only 5 frames deep but the last frame is main() just as you'd expect. Also, if you look at the various bugs that have been filed as duplicates against this bug you'll see they all have identical stack traces at the top where the core happens, but some of them are from completely different applications including rhythmbox etc. And, check the discussion between myself and Owen Taylor in this bug. I'm pretty confident (as is Owen) that this is a bug in glib and not specific to Evo. Why do you think it's not a glib issue? There seems to be PLENTY of evidence in this bug report showing that it is. I'd like to get around to creating a correct patch for this but I just haven't found the time. It's actually not that easy to know how to fix this bug unless you have a lot of understanding of how the glib event loop works, which I definitely don't have.
I thought about "thread apply all bt", not only "bt". It will show all threads, which I call complete stack trace. :) I saw some crashers which show signal handler in main thread, but the reason for the crash was caused in the other thread. I'm just wondering how much is this related to address book lookups, which some of your possible duplicates has too. I read briefly this bug report, I also looked into your patch and I also do not know glib so good to have any idea on it :) But there obviously some possibility to move over border in the while you changed.
OK, see below for an all-thread bt: (gdb) thread apply all bt
+ Trace 195863
Thread 1 (process 25310)
Paul, would you mind submitting a revised patch following Owen's suggestions? Then Owen or someone can review and hopefully approve it so we can close this. Nice work, btw. I've seen similar stack traces myself time and again from Evolution and it never occurred to me it could be a GLib/ORBit issue.
OK, I'll make some time tonight to get into this.
Created attachment 109762 [details] [review] Second attempt at a patch for glib Sorry for the delay, but I had to get some sleep and it took a long time to recompile the latest Evolution (I hadn't done it in a while). I've tested the new Evo with this change applied to glib and it seems to work OK, although obviously I don't know how exactly to trigger the bug so I can't really say. I think this is what Owen was requesting in his comments. Please let me know if I got something wrong. Sorry about the extra whitespace changes; I have Emacs set up to clean up whitespace automatically and there is a LOT of extra whitespace in this file; I removed most of it from the patch before I posted it.
*** Bug 448876 has been marked as a duplicate of this bug. ***
*** Bug 448467 has been marked as a duplicate of this bug. ***
Owen, does the patch look like what you were expecting ?
Comment on attachment 109762 [details] [review] Second attempt at a patch for glib This patch is getting closer, but isn't quite there... as it is currently, it will leave entries in the poll[] array with unitialized events and fields. I would suggest simply removing the pollrec->fd->events check from g_main_context_query(). Also, I'd like to see a comment to prevent this from being unbroken by someone optimizing it later. Say: /* We need to include entries with fd->events == 0 in the array because otherwise if the application changes fd->events behind our back and makes it non-zero , we'll be out of sync when we check the fds[] array. (Changing fd->events after adding an FD wasn't an anticipated use of this API, but it occurs in practice.) */
*** Bug 535995 has been marked as a duplicate of this bug. ***
Is it likely that bug #533225 and its duplicates are caused by this?
*** Bug 551030 has been marked as a duplicate of this bug. ***
Created attachment 118198 [details] [review] another patch Does this match what you want to see, Owen ?
Comment on attachment 118198 [details] [review] another patch That looks good to me.
2008-09-09 Matthias Clasen <mclasen@redhat.com> Bug 523463 – Core dump in gmain.c:2482:IA__g_main_context_check() * glib/gmain.c (g_main_context_check): Be robust against setting event fields on the fly, as e.g. happens in linc. Tracked down by Paul Smith, fix proposed by Owen Taylor.
*** Bug 518922 has been marked as a duplicate of this bug. ***
Thanks Matthias. I just couldn't find the time to learn glib (which I've never actually used at all for anything) and get the patch right. Especially since, for some odd reason, I never seem to hit this bug anymore (I don't think I've had an Evo crash related to this since May or so); I wonder if the Evo guys reworked the code to avoid the situation, or if I was just lucky. Still, it will be excellent to get this resolved.
*** Bug 554210 has been marked as a duplicate of this bug. ***
*** Bug 519444 has been marked as a duplicate of this bug. ***
*** Bug 478585 has been marked as a duplicate of this bug. ***
*** Bug 480380 has been marked as a duplicate of this bug. ***
*** Bug 492041 has been marked as a duplicate of this bug. ***
*** Bug 495591 has been marked as a duplicate of this bug. ***
*** Bug 503690 has been marked as a duplicate of this bug. ***
*** Bug 504882 has been marked as a duplicate of this bug. ***
*** Bug 507058 has been marked as a duplicate of this bug. ***
*** Bug 509548 has been marked as a duplicate of this bug. ***
*** Bug 509639 has been marked as a duplicate of this bug. ***
*** Bug 510696 has been marked as a duplicate of this bug. ***
*** Bug 514369 has been marked as a duplicate of this bug. ***
*** Bug 555479 has been marked as a duplicate of this bug. ***
*** Bug 556923 has been marked as a duplicate of this bug. ***
*** Bug 558573 has been marked as a duplicate of this bug. ***
*** Bug 558861 has been marked as a duplicate of this bug. ***
*** Bug 558862 has been marked as a duplicate of this bug. ***
*** Bug 559565 has been marked as a duplicate of this bug. ***
*** Bug 541639 has been marked as a duplicate of this bug. ***
*** Bug 562463 has been marked as a duplicate of this bug. ***
*** Bug 556999 has been marked as a duplicate of this bug. ***
*** Bug 568306 has been marked as a duplicate of this bug. ***
*** Bug 551257 has been marked as a duplicate of this bug. ***
*** Bug 534060 has been marked as a duplicate of this bug. ***
*** Bug 552337 has been marked as a duplicate of this bug. ***
*** Bug 554231 has been marked as a duplicate of this bug. ***
*** Bug 554449 has been marked as a duplicate of this bug. ***
*** Bug 556525 has been marked as a duplicate of this bug. ***
*** Bug 569256 has been marked as a duplicate of this bug. ***
*** Bug 572375 has been marked as a duplicate of this bug. ***
*** Bug 485489 has been marked as a duplicate of this bug. ***
*** Bug 510370 has been marked as a duplicate of this bug. ***
*** Bug 572472 has been marked as a duplicate of this bug. ***
*** Bug 574262 has been marked as a duplicate of this bug. ***
*** Bug 574997 has been marked as a duplicate of this bug. ***
*** Bug 575716 has been marked as a duplicate of this bug. ***
*** Bug 577148 has been marked as a duplicate of this bug. ***
*** Bug 578156 has been marked as a duplicate of this bug. ***
*** Bug 577665 has been marked as a duplicate of this bug. ***
*** Bug 581069 has been marked as a duplicate of this bug. ***
*** Bug 581471 has been marked as a duplicate of this bug. ***
*** Bug 581646 has been marked as a duplicate of this bug. ***
*** Bug 582709 has been marked as a duplicate of this bug. ***
*** Bug 479996 has been marked as a duplicate of this bug. ***
*** Bug 484068 has been marked as a duplicate of this bug. ***
*** Bug 418069 has been marked as a duplicate of this bug. ***
*** Bug 553618 has been marked as a duplicate of this bug. ***