GNOME Bugzilla – Bug 791754
gdbus-peer test intermittently fails: assertion 'source->ref_count > 0' failed
Last modified: 2018-01-09 01:19:43 UTC
Steps to reproduce: run gdbus-peer test in a loop like this: for x in `seq 1 100`; do make -C .../gio/tests check-TESTS TESTS=gdbus-peer || break; done Eventually it crashes. The backtrace resembles Bug #725503 and Bug #781601, but I already have the fix for Bug #781601 in my GLib and it's still failing. Opening this as a separate bug because merging duplicates is easier than disentangling separate issues within one bug. (gdb) thread apply all bt
+ Trace 238272
Thread 1 (Thread 0x7f457baba840 (LWP 19620))
In the main thread (thread 1), the cancellable is cancelled, which sets the ready time of the source. Meanwhile, in the service (thread 6), the cancellable source is in the process of being finalized.
It’s odd that the (source->ref_count > 0) precondition in g_source_set_ready_time() should fail, given that around the source_funcs->finalize() call, there is now an increment/decrement pair for that ref count. Could this be a race on the ref count, since g_source_set_ready_time() is accessing the ref count without locking the context first? The GMainContext lock is what protects a GSource’s ref count.
*** Bug 725503 has been marked as a duplicate of this bug. ***
(In reply to Philip Withnall from comment #2) > It’s odd that the (source->ref_count > 0) precondition in > g_source_set_ready_time() should fail, given that around the > source_funcs->finalize() call, there is now an increment/decrement pair for > that ref count. It could be thread 1 (the cancelling thread) seeing a stale value for the refcount, since as you say there is no locking or atomic access here. However, as far as I can see, it's worse than that: there is nothing to prevent thread 1 from entering g_cancellable_cancel() during the window in which thread 6 has entered g_source_unref_internal() and done the last-unref, but not yet incremented the refcount to call cancellable_source_finalize(). Commit 281e3010 narrowed that window, but didn't eliminate it. Further, there doesn't seem to be anything GCancellable *can* do to eliminate that window, because the first time GCancellable learns that its source is being destroyed is during finalize(), which is too late: there has already been a time during which it would have been an error to set the ready time. I tried giving GCancellable a list of borrowed references to its sources, temporarily promoting them to owned references during g_cancellable_cancel(), and clearing the list in cancellable_source_finalize() (with all accesses to that list done under the cancellable's lock). However, I don't think that approach actually solves anything, because the point of no return is before cancellable_source_finalize(): if g_cancellable_cancel() is invoked in thread 1 while thread 6 is between the first "source->ref_count--" and the "if (source->source_funcs->finalize)" of g_source_unref_internal(), then it would be invalid for thread 1 to increase the refcount and resurrect the object. (But this wouldn't currently even be detected, because g_source_ref and g_source_unref make no assertions about the refcount.) I also tried adding a new g_source_ref_if_not_destroyed() and calling that from cancellable_source_cancelled(), but it looks as though g_source_unref() in cancellable_source_cancelled() is a bad idea, because it can deadlock: if the cancelling thread ends up owning the last ref, then it waits forever for the condition variable, which is only going to be triggered later (in the same thread!), on return to g_cancellable_cancel(). Perhaps combining those two ideas could work - g_cancellable_cancel() could use g_source_ref_if_not_destroyed() to ensure that cancellable_source_cancelled() doesn't own the last ref to a source, then somehow arrange to avoid deadlocking with itself while releasing those refs?
This is Debian bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884654
Created attachment 365740 [details] [review] cancellable: Don't assert if finalization races with cancellation Commit 281e3010 narrowed the race between GCancellable::cancelled and GCancellableSource's finalize(), but did not prevent it: there was nothing to stop cancellation from occurring after the refcount drops to 0, but before g_source_unref_internal() bumps it back up to 1 to run finalize(). GCancellable cannot be expected to detect that situation, because the only way it has to detect last-unref is finalize(), but in that situation finalize() hasn't happened yet. Add a new private function that tolerates being called here. --- This is the least bad solution I've been able to come up with so far. It remains to be seen whether this is actually reliable - I'll need to run gdbus-peer a few hundred times in a loop to be sure. Another possibility would be to change g_source_set_ready_time() to turn it into effectively this function: remove the assertion about the refcount, and remove the "if (source->priv->ready_time == ready_time) return" which I don't think is really safe against concurrent modifications of ready_time.
(In reply to Simon McVittie from comment #6) > This is the least bad solution I've been able to come up with so far. It > remains to be seen whether this is actually reliable - I'll need to run > gdbus-peer a few hundred times in a loop to be sure. After 1583 repetitions it failed with a different assertion failure, for which I'll open a separate bug: ERROR: gdbus-peer - Bail out! GLib-GObject-FATAL-WARNING: invalid uninstantiatable type '(null)' in cast to 'GDBusServer' so I think this is reasonably reliable.
(In reply to Simon McVittie from comment #7) > After 1583 repetitions it failed with a different assertion failure, for > which I'll open a separate bug: > > ERROR: gdbus-peer - Bail out! GLib-GObject-FATAL-WARNING: invalid > uninstantiatable type '(null)' in cast to 'GDBusServer' Bug #791787
Review of attachment 365740 [details] [review]: ::: glib/gmain.c @@ +1872,3 @@ + * The same as `g_source_set_ready_time (source, 0)`, except that it is + * OK to call it on a #GSource that has no more references, as long as + * the finalize() function has not yet been called. Maybe add a reference to this bug here? @@ +1886,3 @@ + LOCK_CONTEXT (context); + + source->priv->ready_time = 0; Why omit the `if (source->priv->ready_time == 0); return` check? This could lead to spurious conditional_wakeup() calls.
Colin, got any thoughts on this approach?
(In reply to Philip Withnall from comment #9) > ::: glib/gmain.c > @@ +1872,3 @@ > + * The same as `g_source_set_ready_time (source, 0)`, except that it is > + * OK to call it on a #GSource that has no more references, as long as > + * the finalize() function has not yet been called. > > Maybe add a reference to this bug here? Makes sense. > @@ +1886,3 @@ > + LOCK_CONTEXT (context); > + > + source->priv->ready_time = 0; > > Why omit the `if (source->priv->ready_time == 0); return` check? This could > lead to spurious conditional_wakeup() calls. If we put it back in, then it needs to be under the lock; otherwise there's no guarantee that we're seeing the real ready time. (Arguably g_source_set_ready_time() ought to have the same fix, but I'm a little wary about undoing someone's careful optimization...) Any thoughts on having a GLIB_PRIVATE_CALL that is essentially a slightly more thread-safe but less use-after-free-resistant version of g_source_set_ready_time(), vs. just modifying g_source_set_ready_time() to be less use-after-free-resistant?
(In reply to Simon McVittie from comment #11) > (In reply to Philip Withnall from comment #9) > > ::: glib/gmain.c > > @@ +1872,3 @@ > > + * The same as `g_source_set_ready_time (source, 0)`, except that it is > > + * OK to call it on a #GSource that has no more references, as long as > > + * the finalize() function has not yet been called. > > > > Maybe add a reference to this bug here? > > Makes sense. > > > @@ +1886,3 @@ > > + LOCK_CONTEXT (context); > > + > > + source->priv->ready_time = 0; > > > > Why omit the `if (source->priv->ready_time == 0); return` check? This could > > lead to spurious conditional_wakeup() calls. > > If we put it back in, then it needs to be under the lock; otherwise there's > no guarantee that we're seeing the real ready time. (Arguably > g_source_set_ready_time() ought to have the same fix, but I'm a little wary > about undoing someone's careful optimization...) > > Any thoughts on having a GLIB_PRIVATE_CALL that is essentially a slightly > more thread-safe but less use-after-free-resistant version of > g_source_set_ready_time(), vs. just modifying g_source_set_ready_time() to > be less use-after-free-resistant? Put the check in to g_source_set_ready_now() (under the lock), and move the check in g_source_set_ready_time() to also be under the lock. There’s no evidence of g_source_set_ready_time() being highly optimised: it hasn’t been changed since its original implementation in bug #657729 (commit 768574635dcb69e91a2b749467ef3c75ed16579f).
Created attachment 365941 [details] [review] [1] g_source_set_ready_time: Move no-op fast-path under the lock If we don't take the lock, then we don't have the necessary "happens before" relationships to avoid this situation: * source->priv->ready_time was equal to ready_time until recently * another thread has set source->priv->ready_time to a different value * that write hasn't become visible to this thread yet * result: we should reset the ready_time, but we don't
Created attachment 365942 [details] [review] [option A, 2/2] cancellable: Don't assert if finalization races with cancellation Commit 281e3010 narrowed the race between GCancellable::cancelled and GCancellableSource's finalize(), but did not prevent it: there was nothing to stop cancellation from occurring after the refcount drops to 0, but before g_source_unref_internal() bumps it back up to 1 to run finalize(). GCancellable cannot be expected to detect that situation, because the only way it has to detect last-unref is finalize(), but in that situation finalize() hasn't happened yet. Instead of detecting last-unref, relax the precondition a little to make it detect finalization: priv is only poisoned (set to NULL) after the finalize() function has been called, so we can assume that GCancellable has already seen finalize() by then. --- Perhaps this version is an acceptable way to avoid needing a separate GLIB_PRIVATE_CALL by relaxing the precondition checks slightly? If not, I'll attach a different version soon, which is more similar to what I previously attached (there will be a time delay while I run 1000 iterations to try to reproduce the crash).
Created attachment 365943 [details] [review] [option B, 2/3] g_source_set_ready_time: Factor out set_ready_time_internal() This is the same thing, but without the precondition checks.
Created attachment 365944 [details] [review] [option B, 3/3] cancellable: Don't assert if finalization races with cancellation Commit 281e3010 narrowed the race between GCancellable::cancelled and GCancellableSource's finalize(), but did not prevent it: there was nothing to stop cancellation from occurring after the refcount drops to 0, but before g_source_unref_internal() bumps it back up to 1 to run finalize(). GCancellable cannot be expected to detect that situation, because the only way it has to detect last-unref is finalize(), but in that situation finalize() hasn't happened yet. Add a new private function that tolerates being called here. --- If Attachment #365942 [details] isn't considered acceptable, maybe this one is. It has only been tested briefly so far; 1000 runs of gdbus-peer in a loop are still going.
(In reply to Simon McVittie from comment #16) > Created attachment 365944 [details] [review] > > If Attachment #365942 [details] isn't considered acceptable, maybe this one > is. It has only been tested briefly so far; 1000 runs of gdbus-peer in a > loop are still going. This version passed 540 iterations before crashing with Bug #791787, so I think we can say that either option A or option B is reasonably stable.
Review of attachment 365941 [details] [review]: ++
Review of attachment 365942 [details] [review]: I think I prefer this over option B, since it doesn’t introduce the new private function. ::: gio/gcancellable.c @@ +647,3 @@ +/* + * We can't guarantee that the source still has references, so we can't + * use g_source_set_ready_time() - it might be in the window between The code does actually go on to call g_source_set_ready_time(source, 0), so this bit of the comment isn’t quite correct.
Created attachment 366395 [details] [review] [2] cancellable: Don't assert if finalization races with cancellation --- Functionally the same as Attachment #365942 [details], but with improved comments. In particular I fixed the untrue statement that Philip pointed out.
(In reply to Philip Withnall from comment #19) > I think I prefer this over option B, since it doesn’t introduce the new > private function. Yeah, me too. > The code does actually go on to call g_source_set_ready_time(source, 0), so > this bit of the comment isn’t quite correct. Sorry, that was left over from a previous attempt. I've reviewed and re-worded both comments to be clearer about what properties we're relying on where.
Review of attachment 366395 [details] [review]: shipit Please also push both patches to the glib-2-54 branch.
Thanks, committed to master as a4686b8e and 7f3bfcb8. Cherry-picked to glib-2-54 as b6629ef8 and ffb60646.
Should be fixed in 2.55.1 and 2.54.3.
Only gave this a cursory review, but LGTM.