After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 651133 - race condition in GDBusConnection's emit_signal_instance_in_idle_cb
race condition in GDBusConnection's emit_signal_instance_in_idle_cb
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gdbus
2.28.x
Other Linux
: Normal normal
: ---
Assigned To: David Zeuthen (not reading bugmail)
gtkdev
Depends on:
Blocks:
 
 
Reported: 2011-05-26 10:34 UTC by Simon McVittie
Modified: 2011-08-11 09:12 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch (9.72 KB, patch)
2011-05-26 13:17 UTC, David Zeuthen (not reading bugmail)
none Details | Review

Description Simon McVittie 2011-05-26 10:34:15 UTC
If:

* you subscribe to a D-Bus signal from a thread whose thread-default main
  context is M
* you unsubscribe from that D-Bus signal in thread A
* ... at a time when M is currently being dispatched in thread B

then signal delivery in thread B can race with the unsubscription in thread A:

* the message handler thread adds an idle to M

* emit_signal_instance_in_idle_cb starts in thread B

* thread B locks the connection, checks that the subscription still exists,
  and unlocks the connection

* thread A calls g_dbus_connection_signal_unsubscribe()

* thread B calls the callback even though thread A just unsubscribed

* the callback's expectations are violated

Here's a concrete example where this is a problem, seen in libtracker-sparql:

* thread A is a worker thread in a thread pool, which only performs blocking
  operations and so will never run a main loop

* M is the global default main context

* thread B is the main thread, running a main loop for M

* thread A creates a GDBusProxy
  - which connects to NameOwnerChanged, with itself as user_data;
    the thread has not changed its thread-default main context,
    since it does not ever expect to iterate a main context,
    so callbacks are scheduled to be run in M

* thread A calls a blocking method

* thread A unrefs the GDBusProxy, which is finalized
  - in its finalize() it unsubscribes from NameOwnerChanged

* simultaneously, thread B runs emit_signal_instance_in_idle_cb
  to process a NameOwnerChanged emission

If this race condition can't be fixed, then something should document which of the things done by libtracker-sparql is not allowed.
Comment 1 Simon McVittie 2011-05-26 11:04:11 UTC
Looking at GDBusProxy, it doesn't seem to have any locking, so the answer to the concrete example must be one of these options:

* GDBusProxy is buggy, it should be MT-safe

* libtracker-sparql is buggy, it should only use a GDBusProxy from one thread,
  and not while that thread's thread-default main context is being dispatched
  in another thread; GDBusProxy should document that limitation
Comment 2 David Zeuthen (not reading bugmail) 2011-05-26 12:24:46 UTC
(In reply to comment #1)
> Looking at GDBusProxy, it doesn't seem to have any locking, so the answer to
> the concrete example must be one of these options:
> 
> * GDBusProxy is buggy, it should be MT-safe
> 
> * libtracker-sparql is buggy, it should only use a GDBusProxy from one thread,
>   and not while that thread's thread-default main context is being dispatched
>   in another thread; GDBusProxy should document that limitation

We do want some locking in GDBusProxy - it should be possible to use from multiple threads. I will add this right now and document it.

But I don't think that answers the general question posed in comment 0.
Comment 3 David Zeuthen (not reading bugmail) 2011-05-26 12:32:35 UTC
(In reply to comment #0)
> If:
> 
> * you subscribe to a D-Bus signal from a thread whose thread-default main
>   context is M
> * you unsubscribe from that D-Bus signal in thread A
> * ... at a time when M is currently being dispatched in thread B
> 
> then signal delivery in thread B can race with the unsubscription in thread A:
> 
> * the message handler thread adds an idle to M
> 
> * emit_signal_instance_in_idle_cb starts in thread B
> 
> * thread B locks the connection, checks that the subscription still exists,
>   and unlocks the connection
> 
> * thread A calls g_dbus_connection_signal_unsubscribe()
> 
> * thread B calls the callback even though thread A just unsubscribed
> 
> * the callback's expectations are violated

Right - if you unsubscribe a signal handler from another thread than where it will deliver the callback then this is possible. There's really no way to fix this - e.g. thread A could be calling g_dbus_connection_signal_unsubscribe() just before the machine code instruction to do the callback is issued.

We can add this to the docs but it's more or less just how threads work, isn't it?

> Here's a concrete example where this is a problem, seen in libtracker-sparql:
> 
> * thread A is a worker thread in a thread pool, which only performs blocking
>   operations and so will never run a main loop
> 
> * M is the global default main context
> 
> * thread B is the main thread, running a main loop for M
> 
> * thread A creates a GDBusProxy
>   - which connects to NameOwnerChanged, with itself as user_data;
>     the thread has not changed its thread-default main context,
>     since it does not ever expect to iterate a main context,
>     so callbacks are scheduled to be run in M
> 
> * thread A calls a blocking method
> 
> * thread A unrefs the GDBusProxy, which is finalized
>   - in its finalize() it unsubscribes from NameOwnerChanged
> 
> * simultaneously, thread B runs emit_signal_instance_in_idle_cb
>   to process a NameOwnerChanged emission
> 
> If this race condition can't be fixed, then something should document which of
> the things done by libtracker-sparql is not allowed.

Ideally we should be able to fix this in GDBusProxy - I'm not entirely sure that is possible without adding some kind of global variable to keep track of what have been recently freed. Thoughts?
Comment 4 Simon McVittie 2011-05-26 13:02:16 UTC
(In reply to comment #3)
> Ideally we should be able to fix this in GDBusProxy - I'm not entirely sure
> that is possible without adding some kind of global variable to keep track of
> what have been recently freed. Thoughts?

For the specific case of the three signal subscriptions (NameOwnerChanged, PropertiesChanged and the general signal subscriber), I'd suggest something like this:

* instead of a borrowed GDBusProxy*, make the user_data a GDBusProxy**
  containing a weak ref (with g_object_add_weak_pointer)

* in the callback, try to get and ref the GDBusProxy, and hold a ref around
  the callback; if we couldn't, skip the callback

However, that second step is harder than it sounds, because I'm not aware of a way to get the weak pointer and ref the object in an atomic way. Perhaps we could do some trickery involving g_object_weak_ref() and a lock?

telepathy-glib has a TpWeakRef struct which encapsulates a weak ref, but isn't thread-safe (for the same reason); if we can make it thread-safe, it might be worth contributing to gobject or gio as a boxed type?
Comment 5 Dan Winship 2011-05-26 13:02:50 UTC
GCancellable tries to be clever about avoiding this race condition for you, and... well, bug 586432, bug 587300, bug 642968, bug 650252.

IMHO the fix (here and in GCancellable) is to just document that the race condition inevitably exists, and how to write your code so as to cope with it. (ie, what the GCancellable docs used to say before the introduction of g_cancellable_connect/disconnect: http://git.gnome.org/browse/glib/commit/gio/gcancellable.c?id=efce76ce677f98c72f748e435ff693ef12a44dce)
Comment 6 David Zeuthen (not reading bugmail) 2011-05-26 13:17:31 UTC
Created attachment 188666 [details] [review]
Patch

(In reply to comment #4)
> (In reply to comment #3)
> > Ideally we should be able to fix this in GDBusProxy - I'm not entirely sure
> > that is possible without adding some kind of global variable to keep track of
> > what have been recently freed. Thoughts?
> 
> For the specific case of the three signal subscriptions (NameOwnerChanged,
> PropertiesChanged and the general signal subscriber), I'd suggest something
> like this:
> 
> * instead of a borrowed GDBusProxy*, make the user_data a GDBusProxy**
>   containing a weak ref (with g_object_add_weak_pointer)
> 
> * in the callback, try to get and ref the GDBusProxy, and hold a ref around
>   the callback; if we couldn't, skip the callback
> 
> However, that second step is harder than it sounds, because I'm not aware of a
> way to get the weak pointer and ref the object in an atomic way. Perhaps we
> could do some trickery involving g_object_weak_ref() and a lock?
> 
> telepathy-glib has a TpWeakRef struct which encapsulates a weak ref, but isn't
> thread-safe (for the same reason); if we can make it thread-safe, it might be
> worth contributing to gobject or gio as a boxed type?

Yeah, I ended up doing something like this which avoids weak-refs by doing my own locking... 'make check' passes and I think it's generally safe but would appreciate some extra eyeballs.. thoughts?
Comment 7 Simon McVittie 2011-05-26 13:21:29 UTC
(In reply to comment #5)
> IMHO the fix (here and in GCancellable) is to just document that the race
> condition inevitably exists, and how to write your code so as to cope with it.
> (ie, what the GCancellable docs used to say before the introduction of
> g_cancellable_connect/disconnect:
> http://git.gnome.org/browse/glib/commit/gio/gcancellable.c?id=efce76ce677f98c72f748e435ff693ef12a44dce)

Interesting. That advice (essentially, "have the signal connection itself hold a ref, and unref it in your GDestroyNotify") is useful in most cases, but in this particular case we're dealing with self-referential user data: GDBusProxy connects to some signals, but the GDBusProxy itself shouldn't stay alive just because it connected the signals, because that would make it immortal. Hence my suggestion of using weak refs, but then we just have a less likely race, AFAICS.

One possible solution would be for the signal connection to only have a weak ref, but for the idle used to deliver an individual signal to have a strong ref - but that would require a version of g_dbus_connection_signal_subscribe() that knows how to ref its user_data. g_dbus_connection_signal_subscribe_closure()? (But I don't really understand GClosure or its thread-safety characteristics either.)
Comment 8 Simon McVittie 2011-05-26 13:25:55 UTC
Review of attachment 188666 [details] [review]:

Looks OK to me (but I'm not a GObject committer). This is basically TpWeakRef, but with explicit code in dispose() instead of a weak ref :-)
Comment 9 David Zeuthen (not reading bugmail) 2011-05-26 13:31:29 UTC
(In reply to comment #8)
> Review of attachment 188666 [details] [review]:
> 
> Looks OK to me (but I'm not a GObject committer). This is basically TpWeakRef,
> but with explicit code in dispose() instead of a weak ref :-)

Yeah, it's pretty much that - thanks for reviewing. Committed

 http://git.gnome.org/browse/glib/commit/?id=7e0f890e3811c23d331079c3e878d0c1df4ae282
Comment 10 David Zeuthen (not reading bugmail) 2011-05-26 14:00:49 UTC
(In reply to comment #2)
> We do want some locking in GDBusProxy - it should be possible to use from
> multiple threads. I will add this right now and document it.

Did this, see

 http://git.gnome.org/browse/glib/commit/?id=c0f4a63c89332ee18c1ddf1fe48fe04b16b27fa3
Comment 11 Simon McVittie 2011-08-11 09:12:35 UTC
(FYI, this fix wasn't complete: stress-testing it revealed Bug #656282, Bug #656039 and Bug #656173. I've attached patches to all of those.)