GNOME Bugzilla – Bug 705395
deadlock as calling g_cancellable_disconnect inside cancelled handler
Last modified: 2018-05-24 15:35:30 UTC
Created attachment 250758 [details] [review] move cancelled handler code out to main context The following deadlock happens. Patch to fix it (based on alike evolution code) is attached. I moved more (ie the whole code inside the handler).
+ Trace 232332
Thread 3 (Thread 0x7fffe3c05700 (LWP 2177))
Thread 1 (Thread 0x7ffff7fc1a00 (LWP 1605))
Interesting. For the record, here are the steps to reproduce this: - Run the goa-daemon under GDB. - Create a Kerberos account in GOA. - Delete the account from the panel. - Do a "thread apply all bt" in GDB and you will see that the daemon is deadlocked with this backtrace..
Review of attachment 250758 [details] [review]: Looks good, apart from a few minor issues and the bugzilla URL is missing from the commit message. Great work! ::: src/goaidentity/goaalarm.c @@ +257,3 @@ } + Extra newline. @@ +265,3 @@ + clear_scheduled_wakeups (self); + + return FALSE; G_SOURCE_REMOVE is better because it is more readable. @@ +266,3 @@ + + return FALSE; +} Missing a newline. @@ +282,3 @@ + async_alarm_cancel_idle_cb, + g_object_ref (self), + (GDestroyNotify) g_object_unref); This could have been in one line, and there is no need to cast g_object_unref. @@ +284,3 @@ + (GDestroyNotify) g_object_unref); + g_source_attach (idle_source, main_context); + g_source_unref (idle_source); You should use spaces, not tabs.
Created attachment 253405 [details] [review] goaidentity: Fix deadlock in goaalarm on_cancelled Fixes the minor style issues in the previous patch.
Many thanks for tracking this down.
i think this patch isn't quite right. It will cause clear_scheduled_wakeups to get run from the main thread which I don't think it's set up to do.
maybe we could nullify the timeout.source right away, but not unref until it hits the main thread. then check in the on_timeout_source_ready function if the source on the priv data is the same as the callback source before proceeding. or maybe we should fix glib not to deadlock ?
So i talked to alex and we've come up with a plan to fix the deadlock in glib. I'm going to do a little confirmation that the plan will pan out and then write the patch and he's going to review it. I'd suggest we revert attachment 253405 [details] [review] since it's not correct.
perhaps we should come up with a workaround for the stable branch of g-o-a though.
(one last rapid fire message) Just to spell things out...the plan is to replace the cancelled_running boolean state variable with a cancelled_running_thread GThread * state variable. Then all cases where we do while (priv->cancelled_running) get changed to while (priv->cancelled_running_thread != NULL && priv->cancelled_running_thread != g_thread_self ()) that way if the current thread gets to one of those code paths it will skip over it (which, we think, turns out to be the right thing to do)
I think the solution in comment 9 will work, *if* gsignal guarantees that disconnecting a signal handler from an instance+signal that is currently being emitted guarantees that the disconnected signal handler will not be emitted. I *believe* this is true, but we need to verify that. Otherwise signal handler FOO could disconnect an unrelated signal handler BAR (which should be ok with the above) and have BAR be called after its been disconnected.
I haven't had a chance to write a test case to confirm that behavior yet, but looking through the code it seemed to be the case. I'll attach the in progress patch i have laying around for now, just so it doesn't get lost.
Created attachment 254778 [details] [review] gcancellable: allow g_cancellable_disconnect from "cancelled" handler on same thread g_cancellable_disconnect will wait until any pending "cancelled" handlers finish. This is useful because disconnecting a handler can have the side-effect of freeing data that the cancelled handler may rely on. Unfortunately, the code used to enforce this synchronization between "cancelled" handlers and g_cancellable_disconnect will also cause deadlock if the cancelled handler itself calls g_cancellable_disconect. Obviously, if g_cancellable_disconnect is explicitly called by a "cancelled" handler, then the "cancelled" handler is shouldering the responsibility of not using any data that may be freed by disconnection. Also, g_cancellable_disconnect can be called in unexpected places by lower layers in the code (for instance as a result of g_source_destroy). In practice, this means it's easy for deadlocks to inadvertently crop up when using "cancelled" handlers. For these reasons, it would be good to fix the deadlock. This commit prevents the deadlock by allowing foregoing synchronization, if a pending "cancelled" handler is in the same thread as the g_cancellabale_disconnnect call.
Quick update..I reverted attachment 253405 [details] [review] last weekend before the code freeze to be on the safe side. But, I started looking into this in a little more detail today. A few things I've found: - attachment 254778 [details] [review] definitely prevents the deadlock - attachment 253405 [details] [review] with a small tweak to use the proper context also prevents the deadlock and 100% cpu issue Next steps: 1) I'm going to un-revert the patch, apply the small tweak to make it use the proper context, and push it to the gnome-3-8 branch. We're in hard code freeze now, so I won't push it to master. 2) I'm going update the glib patch to have a test case and reattach it here. I'm hoping we can get that patch in before GLib's release on monday. GLib isn't held to hard code freeze like goa is, and fixing it in GLib is more "right" anyway, imo.
(In reply to comment #10) > I think the solution in comment 9 will work, *if* gsignal guarantees that > disconnecting a signal handler from an instance+signal that is currently being > emitted guarantees that the disconnected signal handler will not be emitted. I > *believe* this is true, but we need to verify that. I pushed this test to glib master: https://git.gnome.org/browse/glib/commit/?id=29ef8217665fa1cd000e4668e01dbd638ab61f5e It demonstrates the behavior is indeed what we expect So, I think we're in the clear. So does attachment 254778 [details] [review] look good to go in?
I've pushed the fix for the patch to gnome-3-8 here: https://git.gnome.org/browse/gnome-online-accounts/commit/?h=gnome-3-8&id=64aab6a7345e7ae92148fbcd54f66a33244b99fc Moving this bug to glib since all that's left is the glib patch.
Review of attachment 254778 [details] [review]: There is another possible deadlock in the case where you call g_cancellable_connect() on an already cancelled cancellable. We then call the callback directly with the mutex held. Should be easy to avoid though by just unlocking earlier. ::: gio/gcancellable.c @@ +597,3 @@ + * Note, it is safe to call this function from a #GCancellable::cancelled + * signal handler running in the same thread. + * We should document that this is only safe since 2.38
Created attachment 255548 [details] [review] gcancellable: allow g_cancellable_disconnect from "cancelled" handler on same thread g_cancellable_disconnect will wait until any pending "cancelled" handlers finish. This is useful because disconnecting a handler can have the side-effect of freeing data that the cancelled handler may rely on. Unfortunately, the code used to enforce this synchronization between "cancelled" handlers and g_cancellable_disconnect will also cause deadlock if the cancelled handler itself calls g_cancellable_disconect. Obviously, if g_cancellable_disconnect is explicitly called by a "cancelled" handler, then the "cancelled" handler is shouldering the responsibility of not using any data that may be freed by disconnection. Also, g_cancellable_disconnect can be called in unexpected places by lower layers in the code (for instance as a result of g_source_destroy). In practice, this means it's easy for deadlocks to inadvertently crop up when using "cancelled" handlers. For these reasons, it would be good to fix the deadlock. This commit prevents the deadlock by allowing foregoing synchronization, if a pending "cancelled" handler is in the same thread as the g_cancellabale_disconnnect call.
Attachment 254778 [details] pushed as 83605e2 - gcancellable: allow g_cancellable_disconnect from "cancelled" handler on same thread Attachment 255548 [details] pushed as 83605e2 - gcancellable: allow g_cancellable_disconnect from "cancelled" handler on same thread
Pushed a version with the doc changes and early unlock in g_cancellable_connect().
Kinda nice how you landed this during hard code freeze on the day of the final tarballs due... My experience with this code in the past is that it's a maze of potential deadlocks. How sure are we that this change causes no regressions?
After a long discussion on IRC I am convinced that this code will only modify behaviour in the case of existing deadlocks. There is one possible chance for a real regression here which is that it is now possible for two "cancel" handlers on the same cancellable to run at the same time (in different threads) as a result of _connect(), which was not possible before. I don't expect this to be a problem at all. What does concern me is the introduction of a new supported behaviour: that disconnect(), when called on a handler from itself, should succeed. This makes no sense at all to me and I'd actually prefer a deadlock here, but after studying our use of gsignal I don't think it's possible to make a distinction between disconnecting a handler from itself and from another handler (since once we call _emit, we have no idea which handler is running). This means that the nice guarantee currently in the docs that "Additionally, in the event that a signal handler is currently running, this call will block until the handler has finished." to "disconnect won't return while your handler is running, unless it's running on this thread, and also won't return if anyone else's handler is running", which is somewhat less nice as an API. I'm going to add some notes to the docs about what we consider to be supported and unsupported behaviour here. Specifically, disconnecting ones own handler from itself via the _disconnect API never makes sense, and should not be considered as being supported. In the future, we could theoretically add a gsignal API along the lines of "tell me the ID of the currently-emitting handler on the innermost emission of signal X on instance Y" or even "is handler X on instance Y currently running?" and use that to tell the difference between the two cases. We could even theoretically use this to make the blocking more fine-grained so that we only block if the user's handler is actually the one running, perhaps with a blocking g_signal_handler_disconnect() API.
I take it all back. GClosure actually makes this all very easy.
Okay. GClosure is a no-go because of the fact that people may mix their own g_signal_connect() with our g_cancellable_disconnect(). Therefore I went for the "blocking disconnect" idea in bug 708643. We'll get that committed early next cycle, and deprecate g_cancellable_disconnect(). That makes me feel a lot better about any weird behaviour that it currently has.
but do you feel better about modifying this code the day before a .0 release?
So given commen 24, ryan is reverting the GLIb fix. I've pushed the same g-o-a workaround i pushed to gnome-3-8 to master after getting release-team approval.
Comment on attachment 255548 [details] [review] gcancellable: allow g_cancellable_disconnect from "cancelled" handler on same thread After consulting with the release team we decided to fix this in g-o-a and revert the GLib changes.
We now need to fix this next cycle (which starts now, I guess)... I still like the part of this patch that avoids holding the lock while calling the callback, but I think we no longer need the rest of it -- let's just deprecate g_cancellable_disconnect() and leave it as-is, implementation-wise.
i think rather than leaving it as-is implementation wise, and rather than recommitting attachment 255548 [details] [review] we should make g_cancellable_disconnect be implemented in terms of attachment 255585 [details] [review] from bug 708643
Even if g_signal_disconnect_sync() solves the race on disconnect we can't fully deprecatte the connect/disconnect pair unless we also have a story for g_cancellable_connect() races. I.E. unless you disconnect a handler you're supposed to get one (and only one) callback to your handler, even if the cancellable is already cancelled on connect time. External could could check for canceled before connecting, but that introduces a race condition where the signal may be emitted between checking and connecting.
*** Bug 642968 has been marked as a duplicate of this bug. ***
*** Bug 711006 has been marked as a duplicate of this bug. ***
Created attachment 258328 [details] [review] GCancellable: drop lock for callback during connect() Don't hold the lock when calling the user's callback during g_cancellable_connect() for the case that the cancellable has already fired. Taken from a patch by Alex Larsson.
I think we should commit at least this patch to fix Colin's issue in bug 711006.
Review of attachment 258328 [details] [review]: This looks simple and correct to me. But maybe also worth updating the docs to say e.g.: "Since GLib 2.40, it is now safe to invoke further functions such as g_cancellable_cancel() inside @callback. This lifts a restriction in place for earlier GLib versions which now makes it easier to write cleanup code that unconditionally invokes e.g. g_cancellable_cancel()."
Attachment 258328 [details] pushed as c8aba61 - GCancellable: drop lock for callback during connect()
Not fixed yet. This was just one small part of it.
given it's been almost 4 years, we should consider just pushing attachment 255548 [details] [review] again
(In reply to Ray Strode [halfline] from comment #37) > given it's been almost 4 years, we should consider just pushing attachment > 255548 [details] [review] again What’s still the issue here? It seems like a lot of possibilities were discussed here, and the consensus isn’t clear to me 4 years later.
deadlock if some code calls g_cancellable_disconnect (on any cancellable) down the call stack from a "cancelled" signal handler (on any other cancellable).
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/740.