After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 791754 - gdbus-peer test intermittently fails: assertion 'source->ref_count > 0' failed
gdbus-peer test intermittently fails: assertion 'source->ref_count > 0' failed
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gdbus
unspecified
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
: 725503 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2017-12-18 19:45 UTC by Simon McVittie
Modified: 2018-01-09 01:19 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
cancellable: Don't assert if finalization races with cancellation (4.18 KB, patch)
2017-12-19 11:02 UTC, Simon McVittie
none Details | Review
[1] g_source_set_ready_time: Move no-op fast-path under the lock (1.48 KB, patch)
2017-12-24 16:59 UTC, Simon McVittie
committed Details | Review
[option A, 2/2] cancellable: Don't assert if finalization races with cancellation (2.92 KB, patch)
2017-12-24 17:02 UTC, Simon McVittie
none Details | Review
[option B, 2/3] g_source_set_ready_time: Factor out set_ready_time_internal() (1.39 KB, patch)
2017-12-24 17:06 UTC, Simon McVittie
none Details | Review
[option B, 3/3] cancellable: Don't assert if finalization races with cancellation (3.94 KB, patch)
2017-12-24 17:07 UTC, Simon McVittie
none Details | Review
[2] cancellable: Don't assert if finalization races with cancellation (3.10 KB, patch)
2018-01-05 19:51 UTC, Simon McVittie
committed Details | Review

Description Simon McVittie 2017-12-18 19:45:09 UTC
Steps to reproduce: run gdbus-peer test in a loop like this:

for x in `seq 1 100`; do make -C .../gio/tests check-TESTS TESTS=gdbus-peer || break; done

Eventually it crashes. The backtrace resembles Bug #725503 and Bug #781601, but I already have the fix for Bug #781601 in my GLib and it's still failing. Opening this as a separate bug because merging duplicates is easier than disentangling separate issues within one bug.

(gdb) thread apply all bt

Thread 1 (Thread 0x7f457baba840 (LWP 19620))

  • #0 _g_log_abort
    at /home/smcv/src/glib/glib/gmessages.c line 583
  • #1 g_logv
    at /home/smcv/src/glib/glib/gmessages.c line 1391
  • #2 g_log
    at /home/smcv/src/glib/glib/gmessages.c line 1432
  • #3 g_return_if_fail_warning
    at /home/smcv/src/glib/glib/gmessages.c line 2752
  • #4 g_source_set_ready_time
    at /home/smcv/src/glib/glib/gmain.c line 1845
  • #5 cancellable_source_cancelled
    at /home/smcv/src/glib/gio/gcancellable.c line 653
  • #6 g_cclosure_marshal_VOID__VOID
    at /home/smcv/src/glib/gobject/gmarshal.c line 875
  • #7 g_closure_invoke
    at /home/smcv/src/glib/gobject/gclosure.c line 804
  • #8 signal_emit_unlocked_R
    at /home/smcv/src/glib/gobject/gsignal.c line 3635
  • #9 g_signal_emit_valist
    at /home/smcv/src/glib/gobject/gsignal.c line 3391
  • #10 g_signal_emit
    at /home/smcv/src/glib/gobject/gsignal.c line 3447
  • #11 g_cancellable_cancel
    at /home/smcv/src/glib/gio/gcancellable.c line 508
  • #12 set_active
    at /home/smcv/src/glib/gio/gsocketservice.c line 163
  • #13 g_socket_service_stop
    at /home/smcv/src/glib/gio/gsocketservice.c line 299
  • #14 g_dbus_server_stop
    at /home/smcv/src/glib/gio/gdbusserver.c line 613
  • #15 test_nonce_tcp
    at /home/smcv/src/glib/gio/tests/gdbus-peer.c line 1355
  • #16 test_case_run
    at /home/smcv/src/glib/glib/gtestutils.c line 2255
  • #17 g_test_run_suite_internal
    at /home/smcv/src/glib/glib/gtestutils.c line 2339
  • #18 g_test_run_suite_internal
    at /home/smcv/src/glib/glib/gtestutils.c line 2351
  • #19 g_test_run_suite
    at /home/smcv/src/glib/glib/gtestutils.c line 2426
  • #20 g_test_run
    at /home/smcv/src/glib/glib/gtestutils.c line 1692
  • #21 main
    at /home/smcv/src/glib/gio/tests/gdbus-peer.c line 1759

Comment 1 Simon McVittie 2017-12-18 19:54:29 UTC
In the main thread (thread 1), the cancellable is cancelled, which sets the ready time of the source. Meanwhile, in the service (thread 6), the cancellable source is in the process of being finalized.
Comment 2 Philip Withnall 2017-12-19 09:36:10 UTC
It’s odd that the (source->ref_count > 0) precondition in g_source_set_ready_time() should fail, given that around the source_funcs->finalize() call, there is now an increment/decrement pair for that ref count.

Could this be a race on the ref count, since g_source_set_ready_time() is accessing the ref count without locking the context first? The GMainContext lock is what protects a GSource’s ref count.
Comment 3 Philip Withnall 2017-12-19 09:36:42 UTC
*** Bug 725503 has been marked as a duplicate of this bug. ***
Comment 4 Simon McVittie 2017-12-19 10:33:10 UTC
(In reply to Philip Withnall from comment #2)
> It’s odd that the (source->ref_count > 0) precondition in
> g_source_set_ready_time() should fail, given that around the
> source_funcs->finalize() call, there is now an increment/decrement pair for
> that ref count.

It could be thread 1 (the cancelling thread) seeing a stale value for the refcount, since as you say there is no locking or atomic access here.

However, as far as I can see, it's worse than that: there is nothing to prevent thread 1 from entering g_cancellable_cancel() during the window in which thread 6 has entered g_source_unref_internal() and done the last-unref, but not yet incremented the refcount to call cancellable_source_finalize(). Commit 281e3010 narrowed that window, but didn't eliminate it.

Further, there doesn't seem to be anything GCancellable *can* do to eliminate that window, because the first time GCancellable learns that its source is being destroyed is during finalize(), which is too late: there has already been a time during which it would have been an error to set the ready time.

I tried giving GCancellable a list of borrowed references to its sources, temporarily promoting them to owned references during g_cancellable_cancel(), and clearing the list in cancellable_source_finalize() (with all accesses to that list done under the cancellable's lock). However, I don't think that approach actually solves anything, because the point of no return is before cancellable_source_finalize(): if g_cancellable_cancel() is invoked in thread 1 while thread 6 is between the first "source->ref_count--" and the "if (source->source_funcs->finalize)" of g_source_unref_internal(), then it would be invalid for thread 1 to increase the refcount and resurrect the object. (But this wouldn't currently even be detected, because g_source_ref and g_source_unref make no assertions about the refcount.)

I also tried adding a new g_source_ref_if_not_destroyed() and calling that from cancellable_source_cancelled(), but it looks as though g_source_unref() in cancellable_source_cancelled() is a bad idea, because it can deadlock: if the cancelling thread ends up owning the last ref, then it waits forever for the condition variable, which is only going to be triggered later (in the same thread!), on return to g_cancellable_cancel().

Perhaps combining those two ideas could work - g_cancellable_cancel() could use g_source_ref_if_not_destroyed() to ensure that cancellable_source_cancelled() doesn't own the last ref to a source, then somehow arrange to avoid deadlocking with itself while releasing those refs?
Comment 5 Simon McVittie 2017-12-19 10:58:11 UTC
This is Debian bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884654
Comment 6 Simon McVittie 2017-12-19 11:02:40 UTC
Created attachment 365740 [details] [review]
cancellable: Don't assert if finalization races with cancellation

Commit 281e3010 narrowed the race between GCancellable::cancelled and
GCancellableSource's finalize(), but did not prevent it: there was
nothing to stop cancellation from occurring after the refcount drops
to 0, but before g_source_unref_internal() bumps it back up to 1 to
run finalize().

GCancellable cannot be expected to detect that situation, because the
only way it has to detect last-unref is finalize(), but in that
situation finalize() hasn't happened yet. Add a new private function
that tolerates being called here.

---

This is the least bad solution I've been able to come up with so far. It remains to be seen whether this is actually reliable - I'll need to run gdbus-peer a few hundred times in a loop to be sure.

Another possibility would be to change g_source_set_ready_time() to turn it into effectively this function: remove the assertion about the refcount, and remove the "if (source->priv->ready_time == ready_time) return" which I don't think is really safe against concurrent modifications of ready_time.
Comment 7 Simon McVittie 2017-12-19 14:43:15 UTC
(In reply to Simon McVittie from comment #6)
> This is the least bad solution I've been able to come up with so far. It
> remains to be seen whether this is actually reliable - I'll need to run
> gdbus-peer a few hundred times in a loop to be sure.

After 1583 repetitions it failed with a different assertion failure, for which I'll open a separate bug:

ERROR: gdbus-peer - Bail out! GLib-GObject-FATAL-WARNING: invalid uninstantiatable type '(null)' in cast to 'GDBusServer'

so I think this is reasonably reliable.
Comment 8 Simon McVittie 2017-12-19 14:47:36 UTC
(In reply to Simon McVittie from comment #7)
> After 1583 repetitions it failed with a different assertion failure, for
> which I'll open a separate bug:
> 
> ERROR: gdbus-peer - Bail out! GLib-GObject-FATAL-WARNING: invalid
> uninstantiatable type '(null)' in cast to 'GDBusServer'

Bug #791787
Comment 9 Philip Withnall 2017-12-21 11:08:17 UTC
Review of attachment 365740 [details] [review]:

::: glib/gmain.c
@@ +1872,3 @@
+ * The same as `g_source_set_ready_time (source, 0)`, except that it is
+ * OK to call it on a #GSource that has no more references, as long as
+ * the finalize() function has not yet been called.

Maybe add a reference to this bug here?

@@ +1886,3 @@
+    LOCK_CONTEXT (context);
+
+  source->priv->ready_time = 0;

Why omit the `if (source->priv->ready_time == 0); return` check? This could lead to spurious conditional_wakeup() calls.
Comment 10 Philip Withnall 2017-12-21 11:09:04 UTC
Colin, got any thoughts on this approach?
Comment 11 Simon McVittie 2017-12-21 11:25:35 UTC
(In reply to Philip Withnall from comment #9)
> ::: glib/gmain.c
> @@ +1872,3 @@
> + * The same as `g_source_set_ready_time (source, 0)`, except that it is
> + * OK to call it on a #GSource that has no more references, as long as
> + * the finalize() function has not yet been called.
> 
> Maybe add a reference to this bug here?

Makes sense.

> @@ +1886,3 @@
> +    LOCK_CONTEXT (context);
> +
> +  source->priv->ready_time = 0;
> 
> Why omit the `if (source->priv->ready_time == 0); return` check? This could
> lead to spurious conditional_wakeup() calls.

If we put it back in, then it needs to be under the lock; otherwise there's no guarantee that we're seeing the real ready time. (Arguably g_source_set_ready_time() ought to have the same fix, but I'm a little wary about undoing someone's careful optimization...)

Any thoughts on having a GLIB_PRIVATE_CALL that is essentially a slightly more thread-safe but less use-after-free-resistant version of g_source_set_ready_time(), vs. just modifying g_source_set_ready_time() to be less use-after-free-resistant?
Comment 12 Philip Withnall 2017-12-21 11:50:15 UTC
(In reply to Simon McVittie from comment #11)
> (In reply to Philip Withnall from comment #9)
> > ::: glib/gmain.c
> > @@ +1872,3 @@
> > + * The same as `g_source_set_ready_time (source, 0)`, except that it is
> > + * OK to call it on a #GSource that has no more references, as long as
> > + * the finalize() function has not yet been called.
> > 
> > Maybe add a reference to this bug here?
> 
> Makes sense.
> 
> > @@ +1886,3 @@
> > +    LOCK_CONTEXT (context);
> > +
> > +  source->priv->ready_time = 0;
> > 
> > Why omit the `if (source->priv->ready_time == 0); return` check? This could
> > lead to spurious conditional_wakeup() calls.
> 
> If we put it back in, then it needs to be under the lock; otherwise there's
> no guarantee that we're seeing the real ready time. (Arguably
> g_source_set_ready_time() ought to have the same fix, but I'm a little wary
> about undoing someone's careful optimization...)
> 
> Any thoughts on having a GLIB_PRIVATE_CALL that is essentially a slightly
> more thread-safe but less use-after-free-resistant version of
> g_source_set_ready_time(), vs. just modifying g_source_set_ready_time() to
> be less use-after-free-resistant?

Put the check in to g_source_set_ready_now() (under the lock), and move the check in g_source_set_ready_time() to also be under the lock.

There’s no evidence of g_source_set_ready_time() being highly optimised: it hasn’t been changed since its original implementation in bug #657729 (commit 768574635dcb69e91a2b749467ef3c75ed16579f).
Comment 13 Simon McVittie 2017-12-24 16:59:51 UTC
Created attachment 365941 [details] [review]
[1] g_source_set_ready_time: Move no-op fast-path under the lock

If we don't take the lock, then we don't have the necessary
"happens before" relationships to avoid this situation:

* source->priv->ready_time was equal to ready_time until recently
* another thread has set source->priv->ready_time to a different value
* that write hasn't become visible to this thread yet
* result: we should reset the ready_time, but we don't
Comment 14 Simon McVittie 2017-12-24 17:02:29 UTC
Created attachment 365942 [details] [review]
[option A, 2/2] cancellable: Don't assert if finalization races with cancellation

Commit 281e3010 narrowed the race between GCancellable::cancelled and
GCancellableSource's finalize(), but did not prevent it: there was
nothing to stop cancellation from occurring after the refcount drops
to 0, but before g_source_unref_internal() bumps it back up to 1 to
run finalize().
                                                     
GCancellable cannot be expected to detect that situation, because the
only way it has to detect last-unref is finalize(), but in that
situation finalize() hasn't happened yet.
                                                      
Instead of detecting last-unref, relax the precondition a little
to make it detect finalization: priv is only poisoned (set to NULL)
after the finalize() function has been called, so we can assume that
GCancellable has already seen finalize() by then.

---

Perhaps this version is an acceptable way to avoid needing a separate GLIB_PRIVATE_CALL by relaxing the precondition checks slightly?

If not, I'll attach a different version soon, which is more similar to what I previously attached (there will be a time delay while I run 1000 iterations to try to reproduce the crash).
Comment 15 Simon McVittie 2017-12-24 17:06:16 UTC
Created attachment 365943 [details] [review]
[option B, 2/3] g_source_set_ready_time: Factor out set_ready_time_internal()

This is the same thing, but without the precondition checks.
Comment 16 Simon McVittie 2017-12-24 17:07:59 UTC
Created attachment 365944 [details] [review]
[option B, 3/3] cancellable: Don't assert if finalization races with cancellation

Commit 281e3010 narrowed the race between GCancellable::cancelled and
GCancellableSource's finalize(), but did not prevent it: there was
nothing to stop cancellation from occurring after the refcount drops
to 0, but before g_source_unref_internal() bumps it back up to 1 to
run finalize().
                                                     
GCancellable cannot be expected to detect that situation, because the
only way it has to detect last-unref is finalize(), but in that
situation finalize() hasn't happened yet. Add a new private function
that tolerates being called here.

---

If Attachment #365942 [details] isn't considered acceptable, maybe this one is. It has only been tested briefly so far; 1000 runs of gdbus-peer in a loop are still going.
Comment 17 Simon McVittie 2017-12-24 17:26:18 UTC
(In reply to Simon McVittie from comment #16)
> Created attachment 365944 [details] [review]
> 
> If Attachment #365942 [details] isn't considered acceptable, maybe this one
> is. It has only been tested briefly so far; 1000 runs of gdbus-peer in a
> loop are still going.

This version passed 540 iterations before crashing with Bug #791787, so I think we can say that either option A or option B is reasonably stable.
Comment 18 Philip Withnall 2018-01-03 11:45:17 UTC
Review of attachment 365941 [details] [review]:

++
Comment 19 Philip Withnall 2018-01-03 11:50:33 UTC
Review of attachment 365942 [details] [review]:

I think I prefer this over option B, since it doesn’t introduce the new private function.

::: gio/gcancellable.c
@@ +647,3 @@
+/*
+ * We can't guarantee that the source still has references, so we can't
+ * use g_source_set_ready_time() - it might be in the window between

The code does actually go on to call g_source_set_ready_time(source, 0), so this bit of the comment isn’t quite correct.
Comment 20 Simon McVittie 2018-01-05 19:51:32 UTC
Created attachment 366395 [details] [review]
[2] cancellable: Don't assert if finalization races with  cancellation

---

Functionally the same as Attachment #365942 [details], but with improved comments. In particular I fixed the untrue statement that Philip pointed out.
Comment 21 Simon McVittie 2018-01-05 19:52:50 UTC
(In reply to Philip Withnall from comment #19)
> I think I prefer this over option B, since it doesn’t introduce the new
> private function.

Yeah, me too.

> The code does actually go on to call g_source_set_ready_time(source, 0), so
> this bit of the comment isn’t quite correct.

Sorry, that was left over from a previous attempt. I've reviewed and re-worded both comments to be clearer about what properties we're relying on where.
Comment 22 Philip Withnall 2018-01-05 20:23:38 UTC
Review of attachment 366395 [details] [review]:

shipit

Please also push both patches to the glib-2-54 branch.
Comment 23 Simon McVittie 2018-01-05 21:39:08 UTC
Thanks, committed to master as a4686b8e and 7f3bfcb8. Cherry-picked to glib-2-54 as b6629ef8 and ffb60646.
Comment 24 Simon McVittie 2018-01-05 21:40:18 UTC
Should be fixed in 2.55.1 and 2.54.3.
Comment 25 Colin Walters 2018-01-06 19:36:31 UTC
Only gave this a cursory review, but LGTM.