GNOME Bugzilla – Bug 651133
race condition in GDBusConnection's emit_signal_instance_in_idle_cb
Last modified: 2011-08-11 09:12:35 UTC
If: * you subscribe to a D-Bus signal from a thread whose thread-default main context is M * you unsubscribe from that D-Bus signal in thread A * ... at a time when M is currently being dispatched in thread B then signal delivery in thread B can race with the unsubscription in thread A: * the message handler thread adds an idle to M * emit_signal_instance_in_idle_cb starts in thread B * thread B locks the connection, checks that the subscription still exists, and unlocks the connection * thread A calls g_dbus_connection_signal_unsubscribe() * thread B calls the callback even though thread A just unsubscribed * the callback's expectations are violated Here's a concrete example where this is a problem, seen in libtracker-sparql: * thread A is a worker thread in a thread pool, which only performs blocking operations and so will never run a main loop * M is the global default main context * thread B is the main thread, running a main loop for M * thread A creates a GDBusProxy - which connects to NameOwnerChanged, with itself as user_data; the thread has not changed its thread-default main context, since it does not ever expect to iterate a main context, so callbacks are scheduled to be run in M * thread A calls a blocking method * thread A unrefs the GDBusProxy, which is finalized - in its finalize() it unsubscribes from NameOwnerChanged * simultaneously, thread B runs emit_signal_instance_in_idle_cb to process a NameOwnerChanged emission If this race condition can't be fixed, then something should document which of the things done by libtracker-sparql is not allowed.
Looking at GDBusProxy, it doesn't seem to have any locking, so the answer to the concrete example must be one of these options: * GDBusProxy is buggy, it should be MT-safe * libtracker-sparql is buggy, it should only use a GDBusProxy from one thread, and not while that thread's thread-default main context is being dispatched in another thread; GDBusProxy should document that limitation
(In reply to comment #1) > Looking at GDBusProxy, it doesn't seem to have any locking, so the answer to > the concrete example must be one of these options: > > * GDBusProxy is buggy, it should be MT-safe > > * libtracker-sparql is buggy, it should only use a GDBusProxy from one thread, > and not while that thread's thread-default main context is being dispatched > in another thread; GDBusProxy should document that limitation We do want some locking in GDBusProxy - it should be possible to use from multiple threads. I will add this right now and document it. But I don't think that answers the general question posed in comment 0.
(In reply to comment #0) > If: > > * you subscribe to a D-Bus signal from a thread whose thread-default main > context is M > * you unsubscribe from that D-Bus signal in thread A > * ... at a time when M is currently being dispatched in thread B > > then signal delivery in thread B can race with the unsubscription in thread A: > > * the message handler thread adds an idle to M > > * emit_signal_instance_in_idle_cb starts in thread B > > * thread B locks the connection, checks that the subscription still exists, > and unlocks the connection > > * thread A calls g_dbus_connection_signal_unsubscribe() > > * thread B calls the callback even though thread A just unsubscribed > > * the callback's expectations are violated Right - if you unsubscribe a signal handler from another thread than where it will deliver the callback then this is possible. There's really no way to fix this - e.g. thread A could be calling g_dbus_connection_signal_unsubscribe() just before the machine code instruction to do the callback is issued. We can add this to the docs but it's more or less just how threads work, isn't it? > Here's a concrete example where this is a problem, seen in libtracker-sparql: > > * thread A is a worker thread in a thread pool, which only performs blocking > operations and so will never run a main loop > > * M is the global default main context > > * thread B is the main thread, running a main loop for M > > * thread A creates a GDBusProxy > - which connects to NameOwnerChanged, with itself as user_data; > the thread has not changed its thread-default main context, > since it does not ever expect to iterate a main context, > so callbacks are scheduled to be run in M > > * thread A calls a blocking method > > * thread A unrefs the GDBusProxy, which is finalized > - in its finalize() it unsubscribes from NameOwnerChanged > > * simultaneously, thread B runs emit_signal_instance_in_idle_cb > to process a NameOwnerChanged emission > > If this race condition can't be fixed, then something should document which of > the things done by libtracker-sparql is not allowed. Ideally we should be able to fix this in GDBusProxy - I'm not entirely sure that is possible without adding some kind of global variable to keep track of what have been recently freed. Thoughts?
(In reply to comment #3) > Ideally we should be able to fix this in GDBusProxy - I'm not entirely sure > that is possible without adding some kind of global variable to keep track of > what have been recently freed. Thoughts? For the specific case of the three signal subscriptions (NameOwnerChanged, PropertiesChanged and the general signal subscriber), I'd suggest something like this: * instead of a borrowed GDBusProxy*, make the user_data a GDBusProxy** containing a weak ref (with g_object_add_weak_pointer) * in the callback, try to get and ref the GDBusProxy, and hold a ref around the callback; if we couldn't, skip the callback However, that second step is harder than it sounds, because I'm not aware of a way to get the weak pointer and ref the object in an atomic way. Perhaps we could do some trickery involving g_object_weak_ref() and a lock? telepathy-glib has a TpWeakRef struct which encapsulates a weak ref, but isn't thread-safe (for the same reason); if we can make it thread-safe, it might be worth contributing to gobject or gio as a boxed type?
GCancellable tries to be clever about avoiding this race condition for you, and... well, bug 586432, bug 587300, bug 642968, bug 650252. IMHO the fix (here and in GCancellable) is to just document that the race condition inevitably exists, and how to write your code so as to cope with it. (ie, what the GCancellable docs used to say before the introduction of g_cancellable_connect/disconnect: http://git.gnome.org/browse/glib/commit/gio/gcancellable.c?id=efce76ce677f98c72f748e435ff693ef12a44dce)
Created attachment 188666 [details] [review] Patch (In reply to comment #4) > (In reply to comment #3) > > Ideally we should be able to fix this in GDBusProxy - I'm not entirely sure > > that is possible without adding some kind of global variable to keep track of > > what have been recently freed. Thoughts? > > For the specific case of the three signal subscriptions (NameOwnerChanged, > PropertiesChanged and the general signal subscriber), I'd suggest something > like this: > > * instead of a borrowed GDBusProxy*, make the user_data a GDBusProxy** > containing a weak ref (with g_object_add_weak_pointer) > > * in the callback, try to get and ref the GDBusProxy, and hold a ref around > the callback; if we couldn't, skip the callback > > However, that second step is harder than it sounds, because I'm not aware of a > way to get the weak pointer and ref the object in an atomic way. Perhaps we > could do some trickery involving g_object_weak_ref() and a lock? > > telepathy-glib has a TpWeakRef struct which encapsulates a weak ref, but isn't > thread-safe (for the same reason); if we can make it thread-safe, it might be > worth contributing to gobject or gio as a boxed type? Yeah, I ended up doing something like this which avoids weak-refs by doing my own locking... 'make check' passes and I think it's generally safe but would appreciate some extra eyeballs.. thoughts?
(In reply to comment #5) > IMHO the fix (here and in GCancellable) is to just document that the race > condition inevitably exists, and how to write your code so as to cope with it. > (ie, what the GCancellable docs used to say before the introduction of > g_cancellable_connect/disconnect: > http://git.gnome.org/browse/glib/commit/gio/gcancellable.c?id=efce76ce677f98c72f748e435ff693ef12a44dce) Interesting. That advice (essentially, "have the signal connection itself hold a ref, and unref it in your GDestroyNotify") is useful in most cases, but in this particular case we're dealing with self-referential user data: GDBusProxy connects to some signals, but the GDBusProxy itself shouldn't stay alive just because it connected the signals, because that would make it immortal. Hence my suggestion of using weak refs, but then we just have a less likely race, AFAICS. One possible solution would be for the signal connection to only have a weak ref, but for the idle used to deliver an individual signal to have a strong ref - but that would require a version of g_dbus_connection_signal_subscribe() that knows how to ref its user_data. g_dbus_connection_signal_subscribe_closure()? (But I don't really understand GClosure or its thread-safety characteristics either.)
Review of attachment 188666 [details] [review]: Looks OK to me (but I'm not a GObject committer). This is basically TpWeakRef, but with explicit code in dispose() instead of a weak ref :-)
(In reply to comment #8) > Review of attachment 188666 [details] [review]: > > Looks OK to me (but I'm not a GObject committer). This is basically TpWeakRef, > but with explicit code in dispose() instead of a weak ref :-) Yeah, it's pretty much that - thanks for reviewing. Committed http://git.gnome.org/browse/glib/commit/?id=7e0f890e3811c23d331079c3e878d0c1df4ae282
(In reply to comment #2) > We do want some locking in GDBusProxy - it should be possible to use from > multiple threads. I will add this right now and document it. Did this, see http://git.gnome.org/browse/glib/commit/?id=c0f4a63c89332ee18c1ddf1fe48fe04b16b27fa3
(FYI, this fix wasn't complete: stress-testing it revealed Bug #656282, Bug #656039 and Bug #656173. I've attached patches to all of those.)