Bug 711090 – periodic failure of spawn-multithreaded async testcase

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 711090 - periodic failure of spawn-multithreaded async testcase


Summary:	periodic failure of spawn-multithreaded async testcase


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	731771 (view as bug list)
Depends on:
Blocks:

Reported:	2013-10-29 17:22 UTC by Allison Karlitskaya (desrt)
Modified:	2014-06-17 11:16 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Revert "glib-unix: fix handling of multiple signal source for the same signal" (4.67 KB, patch) 2013-10-29 18:05 UTC, Allison Karlitskaya (desrt)	reviewed	Details \| Review
unix signals: Use atomics to remove thin race (1.45 KB, patch) 2013-10-29 18:05 UTC, Allison Karlitskaya (desrt)	accepted-commit_now	Details \| Review
Fix races in unix signal dispatch (2.16 KB, patch) 2013-10-29 18:14 UTC, Allison Karlitskaya (desrt)	accepted-commit_now	Details \| Review
Fix races in unix signal dispatch (2.95 KB, patch) 2013-10-29 18:38 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review
gmain: Remove unnecessary, non-atomic "any_unix_signal_pending" variable (1.80 KB, patch) 2013-12-16 22:58 UTC, Colin Walters	none	Details \| Review
Revert "gmain: Use sig_atomic_t for list of pending Unix signals" (2.77 KB, patch) 2014-01-02 18:17 UTC, Allison Karlitskaya (desrt)	reviewed	Details \| Review
unix signals: stop using atomics (1.32 KB, patch) 2014-01-02 21:46 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review

Description Allison Karlitskaya (desrt) 2013-10-29 17:22:24 UTC

Try

while true; do ./spawn-multithreaded ; done 


it'll get stuck eventually, with this backtrace:

+ Trace 232676

Thread 1 (Thread 0x7f8552b88740 (LWP 16607))

#0 pthread_join
at pthread_join.c line 92
#1 g_system_thread_wait
at gthread-posix.c line 1174
#2 g_thread_join
at gthread.c line 964
#3 multithreaded_test_run
at spawn-multithreaded.c line 52
#4 test_spawn_async_multithreaded
at spawn-multithreaded.c line 212
#5 test_case_run
at gtestutils.c line 2067
#6 g_test_run_suite_internal
at gtestutils.c line 2127
#7 g_test_run_suite_internal
at gtestutils.c line 2138
#8 g_test_run_suite
at gtestutils.c line 2189
#9 g_test_run
at gtestutils.c line 1508
#10 main
at spawn-multithreaded.c line 238

Comment 1 Allison Karlitskaya (desrt) 2013-10-29 17:27:53 UTC

This is the nightmare case.

We have one outstanding item on unix_child_watches:

$5 = {source = {callback_data = 0x7f8530001770, callback_funcs = 0x7f8552f01a20 <g_source_callback_funcs>, 
    source_funcs = 0x7f8552f01980 <g_child_watch_funcs>, ref_count = 1, context = 0x7f85300008c0, priority = 0, 
    flags = 1, source_id = 1, poll_fds = 0x0, prev = 0x0, next = 0x0, name = 0x0, priv = 0x7f8530001480}, pid = 16655, 
  child_status = 0, child_exited = 0}

and sure enough, a zombie standing by, waiting to be collected:

desrt    16655  0.0  0.0      0     0 pts/0    Z+   10:19   0:00 [test-spawn-echo] <defunct>


ie: a SIGCHLD is getting lost somewhere along the way....

Comment 2 Allison Karlitskaya (desrt) 2013-10-29 17:38:53 UTC

Kinda obvious what is happening from reading the code...

dispatch_unix_signals() basically does this:

{
  ACQUIRE_LOCK ();
  if (child_pending)
    {
      handle...
    }
  child_pending = FALSE;
  RELEASE_LOCK();
}


but the signal handler doesn't take the lock.

so it's possible that we get 'child_pending = TRUE' from another thread between the check and the clear in dispatch_unix_signals().

I guess this is why people like pipes for this problem.... but I'd realllly like to avoid using a pipe here.

Comment 3 Allison Karlitskaya (desrt) 2013-10-29 17:51:24 UTC

This is a regression from bug 704322.

Comment 4 Allison Karlitskaya (desrt) 2013-10-29 18:05:30 UTC

Created attachment 258470 [details] [review]
Revert "glib-unix: fix handling of multiple signal source for the same signal"

This reverts commit be2c7b83c4a9c9d3aa76b1499c27ab19e0f4e470.


This revert alone is enough to make the bug go away, but the following
commit should remove even the theoretical possibility.

Comment 5 Allison Karlitskaya (desrt) 2013-10-29 18:05:35 UTC

Created attachment 258471 [details] [review]
unix signals: Use atomics to remove thin race

There is an extremely thin race between checking and clearing a flag
during while an additional unix signal could be delivered.  Use atomics
to close the gap.

Comment 6 Allison Karlitskaya (desrt) 2013-10-29 18:14:01 UTC

Created attachment 258472 [details] [review]
Fix races in unix signal dispatch

Fix some races introduced in be2c7b83c4a9c9d3aa76b1499c27ab19e0f4e470
while keeping the property that multiple handlers for the same unix
signal all get dispatched.

Comment 7 Allison Karlitskaya (desrt) 2013-10-29 18:14:53 UTC

I'd prefer to commit the first two patches because I don't believe that multiple handlers on the same signal is useful (and in fact I think we should just forbid it).

If I can be convinced that multiple handlers for the same signal are useful then we can take the last patch instead.

Comment 8 Colin Walters 2013-10-29 18:17:21 UTC

Review of attachment 258470 [details] [review]:

I observed people doing this in real-world code and it was in fact an intentional part of the design:

glib/glib-unix.c:g_unix_signal_source_new

 * Note that unlike the UNIX default, all sources which have created a
 * watch will be dispatched, regardless of which underlying thread
 * invoked g_unix_signal_source_new().

So if the patch to retain the functionality and fix the race is so small, can we just go with that please?

Comment 9 Colin Walters 2013-10-29 18:18:04 UTC

Review of attachment 258471 [details] [review]:

This looks right to me, thanks!

Comment 10 Colin Walters 2013-10-29 18:23:35 UTC

Review of attachment 258472 [details] [review]:

Right, OK.

Comment 11 Allison Karlitskaya (desrt) 2013-10-29 18:38:19 UTC

Created attachment 258479 [details] [review]
Fix races in unix signal dispatch

Ok.  We can do it your way... :)

This fixes an additional bug that was present in the last commit (and has
existed for a while, in fact).  Could use a careful re-reading to make sure I
got it right.

Comment 12 Colin Walters 2013-12-09 16:19:18 UTC

Review of attachment 258479 [details] [review]:

Argh, I thought we'd landed this.  Thanks for doing the patch.

::: glib/gmain.c
@@ +4824,2 @@
   /* clear this first incase another one arrives while we're processing */
   any_unix_signal_pending = FALSE;

What about this variable?  I think we need to atomic cmpxcg it too; or alternatively just eliminate it and traverse the pending array each time.

Comment 13 Colin Walters 2013-12-16 22:58:31 UTC

Created attachment 264373 [details] [review]
gmain: Remove unnecessary, non-atomic "any_unix_signal_pending" variable

The use of this variable was racy; it could be set by the signal
handler, but then immediately unset in the handler thread.

It's not much of an optimization either; we don't have *that* many
signals, nor does the worker thread wake up often.  (At the moment).

Comment 14 Colin Walters 2014-01-02 00:28:38 UTC

Ok, re-reviewing this I think the extra boolean variable is not racy.

Attachment 258479 [details] pushed as 76584e7 - Fix races in unix signal dispatch

Comment 15 Allison Karlitskaya (desrt) 2014-01-02 17:52:37 UTC

This broke FreeBSD:

gmain.c:4838:18: error: '_GStaticAssertCompileTimeAssertion_22' declared as an array with a negative size
    pending[i] = g_atomic_int_compare_and_exchange (&unix_signal_pending[i], TRUE, FALSE);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


It's caused by sig_atomic_t not being equal to int on FreeBSD (and therefore our atomic integer operations cannot be safely used on it).

Comment 16 Allison Karlitskaya (desrt) 2014-01-02 18:07:37 UTC

Bug 671997 for the original justification, but fwiw, I think we can safely assume that "int" writes will always be machine-instruction atomic on all systems that we care about.  We kinda assume that in g_atomic_int_set() which is nothing more than a set of an integer and a barrier.

I think reverting the patch from bug 671997 (and maybe adding some comment) is the sanest approach.

Comment 17 Allison Karlitskaya (desrt) 2014-01-02 18:17:59 UTC

Created attachment 265177 [details] [review]
Revert "gmain: Use sig_atomic_t for list of pending Unix signals"

This reverts commit 6833385c5a7a35e22bb549ad0e7f390031952945.

This also partially reverts 190891042d9830fb095bf50220555384cae7f823
which added the define to config.h.win32.in.

Add a comment to the site of the variable declarations about why we feel
that using 'int' is safe.



Doing this patch made me suspect that the atomic-compare-and-exchange
approach in this bug is not strictly race-free in the case that we
emulate our atomic operations, in which case this could occur:


 - start g_atomic_int_compare_and_exchange()
 - acquire lock
 - compare old value, see that it is what we expect
**signal**
 - handler overwrites the value that we are compare-and-exchanging
**return**
 - overwrite value
 - unlock

However, we only ever request an atomic exchange from TRUE to FALSE and
the handler only ever writes TRUE.  In this case, assuming int is atomic
(as we do) and using 'volatile' in the pointer type of the emulated
function in gatomic to prevent compiler-reordering keeps us on the right
side of this theoretical problem.

Comment 18 Colin Walters 2014-01-02 19:03:23 UTC

Review of attachment 265177 [details] [review]:

Your comment above says the bug is FreeBSD's sig_atomic_t is not an int (is it a long?).

But then your later comment is talking about the interaction with emulated atomic operations when compiled with LLVM, as is the case of FreeBSD?  (Right?  If so, ouch!)

I'm a little uncertain about just entirely dropping sig_atomic_t...signals are highly special.  I vaguely recall reading something to the effect of certain hardware architectures with Linux actually needing it.

It feels to me like the real bug here is the atomic ops emulation, we should basically never be hitting that case.

Alternatively, if this is really about the *combination* of non-int sig_atomic_t and emulated atomics, perhaps we should hard require just an atomic integer test-and-set, and implement just that one for LLVM?

Comment 19 Allison Karlitskaya (desrt) 2014-01-02 20:43:15 UTC

FreeBSD + LLVM has normal atomic operations, properly supported. That's not the problem here. The problem is simply that FreeBSD's sig_atomic_t is not an int -- it's a long on x86_64.

I'm not totally sure that long is guaranteed to be atomic cross-processor on amd64, but that's sort of not the point here. The guarantee is only that a read or a write will occur in one processor instruction (ie: will not be divided by a signal occurring, delivered via the same thread).

The headers that define these types are doing it in pretty boring ways:

#ifdef __amd64__
typedef long sig_atomic_t;

#ifdef __i386__
typedef int sig_atomic_t;

so there's really no magic required here.

The only architecture that I can bring to mind where int is non-atomic is AVR (ie: arduino) which is an 8bit processor that emulates 16bit ints in software. I'm going to state as a simple fact: int is sufficiently atomic for any system that we care about.

As for atomic ops emulation, my comment was actually attempting to outline why I believe it is _not_ a problem in our case. But that sort of brings me to another point: if emulated atomic operations are working OK for us (with no lock taken on the writer side) then we're basically saying that normal reads and writes (appropriately guarded by volatile, with care taken about order) would be fine here.

The commit above:

"""

There is an extremely thin race between checking and clearing a flag
during while an additional unix signal could be delivered. Use atomics
to close the gap.

"""

was actually FALSE. There was never a problem here to begin with. The problem was only introduced with Giovanni's commit that moved the set-to-FALSE to below the loop, unconditionally.

Here's my logic, and I'm pretty sure it stands up, assuming volatile is in use:

- reads and write to int are atomic on every platform we care about

- the signal handler will only ever transition a field from FALSE to TRUE

- the dispatch code (in it's original state, before sig_atomic_t, g_atomic or
Giovanni's patch) will only ever transition a field from TRUE to FALSE and
it will only do so while also dispatching a GSource for handling

therefore:

- an arriving signal will get dispatched at least once because it definitely
sets the variable to TRUE and the dispatch code will run whenever the
variable is TRUE. There is nowhere that sets the variable to FALSE except
while dispatching. The key to the regression that was introduced by
Giovanni's patch was that it was setting the variable to FALSE even in
cases that it didn't do a dispatch

- an arriving signal will never be dispatched more than once because only
one thread (the GLib worker) is responsible for dispatches. It will see
the TRUE value there only once and will set it back to FALSE. It is not
possible that the TRUE value is seen twice.

- multiple arrivals of the same signal get merged and that's OK (since the
kernel may do this without us having the chance to notice anyway)

- cases where we install or remove a handler just as a signal are arriving
can win or lose that race either way and we don't care; we use a lock
when scanning that list, so no matter what, we never crash

So the long and the short of it is that we're very very much in the clear here, and we have a lot of choice about how to deal with this. We can use sig_atomic_t or not. We can use atomics or not. We just can't use sig_atomic_t *and* atomics because of the type mismatch.

Without knowledge of how sig_atomic_t is implemented, my preferred approach is to use it. Since I know that it's really only an int (or worse, a long), I'm actually inclined toward just using an int. Just to really drive the point across, I think it would be totally appropriate to use volatile gboolean (ie: int) here, without atomics. One thing that is very clear: I need to write a large comment explaining why this is safe and telling people not to touch it.

Comment 20 Allison Karlitskaya (desrt) 2014-01-02 21:46:15 UTC

Created attachment 265193 [details] [review]
unix signals: stop using atomics

They are not required here.  See the discussion in the bug report.

Another approach.  This is more strictly correct on systems that may
define sig_atomic_t to weird things.  I'm pretty sure this is an
academic debate, but we may as well be on the correct side of the
theoretical.

Comment 21 Allison Karlitskaya (desrt) 2014-01-03 01:39:11 UTC

Comment on attachment 265193 [details] [review]
unix signals: stop using atomics

Attachment 265193 [details] pushed as 1867fc2 - unix signals: stop using atomics

Pushing this since it unbreaks the build.  Would still appreciate a
review for correctness, though...

Comment 22 Colin Walters 2014-01-03 16:08:30 UTC

(In reply to comment #21)
> (From update of attachment 265193 [details] [review])
> Attachment 265193 [details] pushed as 1867fc2 - unix signals: stop using atomics
> 
> Pushing this since it unbreaks the build.  

By "the build" here you mean "one make check test on FreeBSD" I believe...

> Would still appreciate a
> review for correctness, though...

This bit makes sense to me:

> The key to the regression that was introduced by
> Giovanni's patch was that it was setting the variable to FALSE even in
>  cases that it didn't do a dispatch

Comment 23 Stef Walter 2014-06-17 11:16:06 UTC

*** Bug 731771 has been marked as a duplicate of this bug. ***