GNOME Bugzilla – Bug 711090
periodic failure of spawn-multithreaded async testcase
Last modified: 2014-06-17 11:16:06 UTC
Try while true; do ./spawn-multithreaded ; done it'll get stuck eventually, with this backtrace:
+ Trace 232676
Thread 1 (Thread 0x7f8552b88740 (LWP 16607))
This is the nightmare case. We have one outstanding item on unix_child_watches: $5 = {source = {callback_data = 0x7f8530001770, callback_funcs = 0x7f8552f01a20 <g_source_callback_funcs>, source_funcs = 0x7f8552f01980 <g_child_watch_funcs>, ref_count = 1, context = 0x7f85300008c0, priority = 0, flags = 1, source_id = 1, poll_fds = 0x0, prev = 0x0, next = 0x0, name = 0x0, priv = 0x7f8530001480}, pid = 16655, child_status = 0, child_exited = 0} and sure enough, a zombie standing by, waiting to be collected: desrt 16655 0.0 0.0 0 0 pts/0 Z+ 10:19 0:00 [test-spawn-echo] <defunct> ie: a SIGCHLD is getting lost somewhere along the way....
Kinda obvious what is happening from reading the code... dispatch_unix_signals() basically does this: { ACQUIRE_LOCK (); if (child_pending) { handle... } child_pending = FALSE; RELEASE_LOCK(); } but the signal handler doesn't take the lock. so it's possible that we get 'child_pending = TRUE' from another thread between the check and the clear in dispatch_unix_signals(). I guess this is why people like pipes for this problem.... but I'd realllly like to avoid using a pipe here.
This is a regression from bug 704322.
Created attachment 258470 [details] [review] Revert "glib-unix: fix handling of multiple signal source for the same signal" This reverts commit be2c7b83c4a9c9d3aa76b1499c27ab19e0f4e470. This revert alone is enough to make the bug go away, but the following commit should remove even the theoretical possibility.
Created attachment 258471 [details] [review] unix signals: Use atomics to remove thin race There is an extremely thin race between checking and clearing a flag during while an additional unix signal could be delivered. Use atomics to close the gap.
Created attachment 258472 [details] [review] Fix races in unix signal dispatch Fix some races introduced in be2c7b83c4a9c9d3aa76b1499c27ab19e0f4e470 while keeping the property that multiple handlers for the same unix signal all get dispatched.
I'd prefer to commit the first two patches because I don't believe that multiple handlers on the same signal is useful (and in fact I think we should just forbid it). If I can be convinced that multiple handlers for the same signal are useful then we can take the last patch instead.
Review of attachment 258470 [details] [review]: I observed people doing this in real-world code and it was in fact an intentional part of the design: glib/glib-unix.c:g_unix_signal_source_new * Note that unlike the UNIX default, all sources which have created a * watch will be dispatched, regardless of which underlying thread * invoked g_unix_signal_source_new(). So if the patch to retain the functionality and fix the race is so small, can we just go with that please?
Review of attachment 258471 [details] [review]: This looks right to me, thanks!
Review of attachment 258472 [details] [review]: Right, OK.
Created attachment 258479 [details] [review] Fix races in unix signal dispatch Ok. We can do it your way... :) This fixes an additional bug that was present in the last commit (and has existed for a while, in fact). Could use a careful re-reading to make sure I got it right.
Review of attachment 258479 [details] [review]: Argh, I thought we'd landed this. Thanks for doing the patch. ::: glib/gmain.c @@ +4824,2 @@ /* clear this first incase another one arrives while we're processing */ any_unix_signal_pending = FALSE; What about this variable? I think we need to atomic cmpxcg it too; or alternatively just eliminate it and traverse the pending array each time.
Created attachment 264373 [details] [review] gmain: Remove unnecessary, non-atomic "any_unix_signal_pending" variable The use of this variable was racy; it could be set by the signal handler, but then immediately unset in the handler thread. It's not much of an optimization either; we don't have *that* many signals, nor does the worker thread wake up often. (At the moment).
Ok, re-reviewing this I think the extra boolean variable is not racy. Attachment 258479 [details] pushed as 76584e7 - Fix races in unix signal dispatch
This broke FreeBSD: gmain.c:4838:18: error: '_GStaticAssertCompileTimeAssertion_22' declared as an array with a negative size pending[i] = g_atomic_int_compare_and_exchange (&unix_signal_pending[i], TRUE, FALSE); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's caused by sig_atomic_t not being equal to int on FreeBSD (and therefore our atomic integer operations cannot be safely used on it).
Bug 671997 for the original justification, but fwiw, I think we can safely assume that "int" writes will always be machine-instruction atomic on all systems that we care about. We kinda assume that in g_atomic_int_set() which is nothing more than a set of an integer and a barrier. I think reverting the patch from bug 671997 (and maybe adding some comment) is the sanest approach.
Created attachment 265177 [details] [review] Revert "gmain: Use sig_atomic_t for list of pending Unix signals" This reverts commit 6833385c5a7a35e22bb549ad0e7f390031952945. This also partially reverts 190891042d9830fb095bf50220555384cae7f823 which added the define to config.h.win32.in. Add a comment to the site of the variable declarations about why we feel that using 'int' is safe. Doing this patch made me suspect that the atomic-compare-and-exchange approach in this bug is not strictly race-free in the case that we emulate our atomic operations, in which case this could occur: - start g_atomic_int_compare_and_exchange() - acquire lock - compare old value, see that it is what we expect **signal** - handler overwrites the value that we are compare-and-exchanging **return** - overwrite value - unlock However, we only ever request an atomic exchange from TRUE to FALSE and the handler only ever writes TRUE. In this case, assuming int is atomic (as we do) and using 'volatile' in the pointer type of the emulated function in gatomic to prevent compiler-reordering keeps us on the right side of this theoretical problem.
Review of attachment 265177 [details] [review]: Your comment above says the bug is FreeBSD's sig_atomic_t is not an int (is it a long?). But then your later comment is talking about the interaction with emulated atomic operations when compiled with LLVM, as is the case of FreeBSD? (Right? If so, ouch!) I'm a little uncertain about just entirely dropping sig_atomic_t...signals are highly special. I vaguely recall reading something to the effect of certain hardware architectures with Linux actually needing it. It feels to me like the real bug here is the atomic ops emulation, we should basically never be hitting that case. Alternatively, if this is really about the *combination* of non-int sig_atomic_t and emulated atomics, perhaps we should hard require just an atomic integer test-and-set, and implement just that one for LLVM?
FreeBSD + LLVM has normal atomic operations, properly supported. That's not the problem here. The problem is simply that FreeBSD's sig_atomic_t is not an int -- it's a long on x86_64. I'm not totally sure that long is guaranteed to be atomic cross-processor on amd64, but that's sort of not the point here. The guarantee is only that a read or a write will occur in one processor instruction (ie: will not be divided by a signal occurring, delivered via the same thread). The headers that define these types are doing it in pretty boring ways: #ifdef __amd64__ typedef long sig_atomic_t; or #ifdef __i386__ typedef int sig_atomic_t; so there's really no magic required here. The only architecture that I can bring to mind where int is non-atomic is AVR (ie: arduino) which is an 8bit processor that emulates 16bit ints in software. I'm going to state as a simple fact: int is sufficiently atomic for any system that we care about. As for atomic ops emulation, my comment was actually attempting to outline why I believe it is _not_ a problem in our case. But that sort of brings me to another point: if emulated atomic operations are working OK for us (with no lock taken on the writer side) then we're basically saying that normal reads and writes (appropriately guarded by volatile, with care taken about order) would be fine here. The commit above: """ There is an extremely thin race between checking and clearing a flag during while an additional unix signal could be delivered. Use atomics to close the gap. """ was actually FALSE. There was never a problem here to begin with. The problem was only introduced with Giovanni's commit that moved the set-to-FALSE to below the loop, unconditionally. Here's my logic, and I'm pretty sure it stands up, assuming volatile is in use: - reads and write to int are atomic on every platform we care about - the signal handler will only ever transition a field from FALSE to TRUE - the dispatch code (in it's original state, before sig_atomic_t, g_atomic or Giovanni's patch) will only ever transition a field from TRUE to FALSE and it will only do so while also dispatching a GSource for handling therefore: - an arriving signal will get dispatched at least once because it definitely sets the variable to TRUE and the dispatch code will run whenever the variable is TRUE. There is nowhere that sets the variable to FALSE except while dispatching. The key to the regression that was introduced by Giovanni's patch was that it was setting the variable to FALSE even in cases that it didn't do a dispatch - an arriving signal will never be dispatched more than once because only one thread (the GLib worker) is responsible for dispatches. It will see the TRUE value there only once and will set it back to FALSE. It is not possible that the TRUE value is seen twice. - multiple arrivals of the same signal get merged and that's OK (since the kernel may do this without us having the chance to notice anyway) - cases where we install or remove a handler just as a signal are arriving can win or lose that race either way and we don't care; we use a lock when scanning that list, so no matter what, we never crash So the long and the short of it is that we're very very much in the clear here, and we have a lot of choice about how to deal with this. We can use sig_atomic_t or not. We can use atomics or not. We just can't use sig_atomic_t *and* atomics because of the type mismatch. Without knowledge of how sig_atomic_t is implemented, my preferred approach is to use it. Since I know that it's really only an int (or worse, a long), I'm actually inclined toward just using an int. Just to really drive the point across, I think it would be totally appropriate to use volatile gboolean (ie: int) here, without atomics. One thing that is very clear: I need to write a large comment explaining why this is safe and telling people not to touch it.
Created attachment 265193 [details] [review] unix signals: stop using atomics They are not required here. See the discussion in the bug report. Another approach. This is more strictly correct on systems that may define sig_atomic_t to weird things. I'm pretty sure this is an academic debate, but we may as well be on the correct side of the theoretical.
Comment on attachment 265193 [details] [review] unix signals: stop using atomics Attachment 265193 [details] pushed as 1867fc2 - unix signals: stop using atomics Pushing this since it unbreaks the build. Would still appreciate a review for correctness, though...
(In reply to comment #21) > (From update of attachment 265193 [details] [review]) > Attachment 265193 [details] pushed as 1867fc2 - unix signals: stop using atomics > > Pushing this since it unbreaks the build. By "the build" here you mean "one make check test on FreeBSD" I believe... > Would still appreciate a > review for correctness, though... This bit makes sense to me: > The key to the regression that was introduced by > Giovanni's patch was that it was setting the variable to FALSE even in > cases that it didn't do a dispatch
*** Bug 731771 has been marked as a duplicate of this bug. ***