GNOME Bugzilla – Bug 651268
assertion failed: (!g_output_stream_has_pending (ostream)) in GDBus worker thread
Last modified: 2017-11-15 12:58:55 UTC
this report has been filed here: https://bugs.launchpad.net/ubuntu/+source/gnome-power-manager/+bug/740878 gpm seems to be crashing for users after log in, backtrace is: ".
+ Trace 227287
Thread 2 (Thread 1227)
I've seen automated crash reports resembling this backtrace in several non-public apps, and it's in GDBus internal code, so I think it's likely to be a GDBus bug. Could someone with Bugzilla permissions please retitle to something like: assertion failed: (!g_output_stream_has_pending (ostream)) in GDBus worker thread and reassign to Platform/glib/GDBus? So far, I've been unable to reproduce it in a synthetic stress-test, or spot what the problem is from examination of the source code. The interesting part of the stack trace is thread 1:
+ Trace 228338
Thread 1 (Thread 1234)
[For those working on the appropriate project, this is NB#271520.] One theory I had was: > /* called in private thread shared by all GDBusConnection instances (without > write-lock held) */ > static gboolean > write_message_in_idle_cb (gpointer user_data) > { > GDBusWorker *worker = user_data; > if (worker->num_writes_pending == 0 && !worker->flush_pending) > maybe_write_next_message (worker); > return FALSE; > } > > It checks num_writes_pending and flush_pending without holding a lock: they > might be updated from another thread. However, on closer examination, this doesn't look like a bug. The code is subtle, but seems correct: * the relevant threads come in two categories: the shared GDBus worker thread (I'll call it "the worker thread") and all other threads, including the main thread (I'll call them "user threads") * user threads sometimes look at num_writes_pending and flush_pending, but do not modify them; they take the write_lock while looking at those variables * various functions in the worker thread alter num_writes_pending and flush_pending; they always take the write_lock while altering those variables * write_message_in_idle_cb reads those variables without taking the lock, but this is safe, because user threads never alter them, and write_message_in_idle_cb runs in the worker thread, so the worker thread can't be doing anything else (like concurrent alteration) at the same time Based on drawing a diagram of the call-graph and how the pending state and those variables relate, I can't yet see how this assertion failure can happen, except under pathological circumstances [1], but there must be something I haven't spotted. I'll have another look tomorrow... -------- [1] These are the pathological circumstances I mean: * someone has passed a GIOStream (which is a paired GInputStream and GOutputStream) to g_dbus_connection_new(), and * a user thread is concurrently sending data out of the GOutputStream, bypassing GDBus That would be a silly thing to do, and I don't think it's something we want to support. I doubt gnome-power-manager actually does this, though!
(In reply to comment #2) > Based on drawing a diagram of the call-graph and how the pending state and > those variables relate, I can't yet see how this assertion failure can happen Of course, just after writing that, I spotted a potential problem, then found the fix in git. I believe this is a duplicate of Bug #635626.
(In reply to comment #3) > I believe this is a duplicate of Bug #635626. In fact it can't be the same thing - the reporter of the original Launchpad bug here was running GLib 2.28.4, the systems where I've seen this bug are also based on GLib 2.28.4, and Bug #635626 was fixed in 2.27.5. Perhaps the fix was incomplete?
The crash looks like it's caused by what's happening in the other thread (my backtraces from non-public apps all have the same pattern). The GDBus worker thread is sending a message:
+ Trace 228365
GDBusWorker guards itself against pending write and flush operations (assuming you don't mix data into the same output stream, which would be silly), but it doesn't guard against a pending close. I'm trying to write a test-case for this.
Ugh, Bugzilla's "pretty" stacktrace display broke my comment. What I meant to say (I'll turn it into two separate comments) is: The GDBus worker thread is sending a message:
+ Trace 228366
... and meanwhile, the main thread is finalizing a private GDBusConnection, which results in it closing its IO stream (the actual close is synchronous and relatively fast, although the result won't be reported until the main loop is re-entered):
+ Trace 228367
Created attachment 195924 [details] [review] GDBus test case Both cases in this test fail in the same way as gnome-power-manager and my non-public apps: one case closes the connection just after sending a message, and the other last-unrefs the connection. To reproduce the bug it was necessary to make close() take long enough that the worker thread gets scheduled (I used 1 millisecond), and repeat the test a few times (repeating 100 times reliably triggers the bug on my laptop). I'm working towards a fix, but that's harder... (This test-case demonstrates that there's definitely a GDBus bug - nothing gnome-power-manager-specific here.)
Created attachment 196024 [details] [review] [1/5] GDBusWorker: combine num_writes_pending with flush_pending num_writes_pending was a counter, but it only took values 0 or 1, so make it a boolean: it would never make sense to be trying to write out two messages at the same time (they'd get interleaved). Similarly, we can never be writing and flushing at the same time (that'd mean we were flushing halfway through a message, which would be pointless) so combine it with flush_pending too, calling the result output_pending. Also assert that it takes the expected value whenever we change it, and document the locking discipline used for it, including a subtle case in write_message_in_idle_cb where it's not obvious at first glance why we don't need the lock.
Created attachment 196025 [details] [review] [2/5] GDBusConnection: delegate to the worker to close the stream We can't safely close the output part of the I/O stream until any pending write or flush has been completed. In the worst case, this could lead to an assertion failure in the worker (when the close wins the race) or not closing the stream at all (when the write wins the race). --- This is the fix for this bug as reported.
Created attachment 196027 [details] [review] [3/5] GDBusWorker: annotate more functions with lock/pending state
Created attachment 196028 [details] [review] [4/5] Regression test for closing a stream, slowly, while a message is queued
Created attachment 196029 [details] [review] [5/5] GDBusWorker: as a precaution, access 'stopped' boolean atomically This member is written in _g_dbus_worker_stop from arbitrary threads, and read by the worker thread, so it should be accessed atomically. --- That's all the patches for now; they seem to fix the test in 4/5, at least. Developed against 2.28, but should hopefully apply to 2.30 or master.
I've no gdbus/glib reviewer hat, but all patches seem OK to me, though.
Review of attachment 196025 [details] [review]: Minor comments, looks generally right to me. ::: gio/gdbusconnection.c @@ -520,3 @@ - NULL, /* GCancellable */ - NULL, /* GAsyncReadyCallback */ - NULL); /* userdata */ So...what is responsible for ensuring the stream is closed now? Just the refcount on the socket? If that's the case, why were we explicitly closing it before? Looks like we cancel/unref the worker in dispose(), so this close_async() call should have always been superfluous. ::: gio/gdbusprivate.c @@ +1390,3 @@ + + /* if we want to close the connection, that takes precedence */ + if (worker->close_list != NULL) Hm, "close_list" is a weird name to me. How about "pending_close_data"? Same goes for "flush_list". @@ +1489,3 @@ +/* + * @write_data: (transfer full) (allow-none): + * @close_data: (transfer full) (allow-none): If you're going to make it look like gtk-doc, you might as well add schedule_write_in_worker_thread: to the top too. @@ +1519,3 @@ + (GDestroyNotify) _g_dbus_worker_unref); + g_source_attach (idle_source, shared_thread_data->context); + g_source_unref (idle_source); Hmm, I thought we had centralized the idle creation in a utility function earlier, but I guess not.
Review of attachment 196024 [details] [review]: Makes sense.
Review of attachment 196027 [details] [review]: We clearly lack a nice semi-formal style for lock annotations, but this looks like an improvement at least.
Review of attachment 196028 [details] [review]: ::: gio/tests/gdbus-close-pending.c @@ +304,3 @@ + + g_assert_cmpint (pipe (f->server_to_client), ==, 0); + g_assert_cmpint (pipe (f->client_to_server), ==, 0); g_unix_open_pipe() Also I definitely prefer expressions with side effects to be outside assertions. @@ +382,3 @@ + gconstpointer context) +{ + g_message ("skipped (FIXME: use a Win32 pipe or something)"); Seems like it'd be better to only compile/run this at all on unix via automake conditional rather than having 90% of the file under #ifdef G_OS_UNIX.
Review of attachment 196029 [details] [review]: Sure.
Review of attachment 196025 [details] [review]: ::: gio/gdbusprivate.c @@ +1519,3 @@ + (GDestroyNotify) _g_dbus_worker_unref); + g_source_attach (idle_source, shared_thread_data->context); + g_source_unref (idle_source); Ryan added g_main_context_invoke() for this purpose.
(In reply to comment #15) > So...what is responsible for ensuring the stream is closed now? Just the > refcount on the socket? _g_dbus_worker_stop() in dispose (which was already called) now schedules a close on the stream, in addition to what it already did - so finalize no longer needs to. > Looks like we cancel/unref the worker in dispose(), so this close_async() call > should have always been superfluous. Yes, but relying on the refcount would mean that the stream wasn't actually closed if there was a refleak. > Hm, "close_list" is a weird name to me. How about "pending_close_data"? Perhaps close_attempts? > Hmm, I thought we had centralized the idle creation in a utility function > earlier, but I guess not. Perhaps you did on 2.30, but I'm still targeting 2.28 at the moment. I'll check that everything applies nicely to 2.30 - there'll probably be an additional patch to clean up things that can be cleaned up. (Unless you're no longer fixing 2.28 bugs - but I need a version of these patches that can apply to 2.28 anyway, whether you still want those or not.) (In reply to comment #18) > g_unix_open_pipe() Not on 2.28, but yes. > Also I definitely prefer expressions with side effects to be outside > assertions. Yes; I should fix gdbus-non-socket.c (on which I based this test), too. > Seems like it'd be better to only compile/run this at all on unix via automake > conditional rather than having 90% of the file under #ifdef G_OS_UNIX. The reason I didn't is that there's nothing inherently Unix-specific in this test - it just needs a pipe-to-self implementation, and I don't know the using-the-right-C-runtime subtleties of writing one of those on Windows. gdbus-non-socket.c is the same.
Hey, sorry for not responding sooner - I was at the Plumbers conference last week. Thanks for the patches and also thanks to Colin for the review. Patches generally look good and I'm happy to see how they improve the quality of the code in gio/gdbusprivate.c - the code I originally wrote is pretty cryptic because everything in async such that multiple connections can share the same worker thread - sorry about that! I think it sounds like a great idea to get this into both 2.28 and 2.30. I just tried applying the patches on 2.30 but they didn't cleanly apply... FWIW, I'm not too concerned about using the new glib-unix stuff or g_main_context_invoke() only in 2.29.* from the get go - I think it's more important to get the bug-fixes (and test cases) into the tree and then later we can add patches to make the code 2.29.* only by using e.g. g_main_context_invoke(). Thanks again for the great work!
OK, I now have two branches, for 2.28 and 2.30/master. I'll attach them after some testing. (In reply to comment #18) > g_unix_open_pipe() Fixed in the 2.30 version. > Also I definitely prefer expressions with side effects to be outside > assertions. Fixed in both versions for the test I added, but this seems to be a general problem in GLib tests, which is rather out-of-scope for this bug. You could argue that when the tests report failure as an assertion, it's not a problem if they don't really work with assertions disabled? I have about half a branch to fix the other instances in 2.28, but it got rather long, so I shelved it; I'll revisit it when the changes from this bug have landed. Do people want this fixed in 2.28, 2.30, or only in master? (I suspect only-in-master is best.) > Seems like it'd be better to only compile/run this at all on unix via automake > conditional rather than having 90% of the file under #ifdef G_OS_UNIX. The GDBus tests are Unix-only at the moment anyway, because most of them run under a temporary session bus using dbus-launch. For the moment I made it #error if used on non-Unix; I have half a branch to make it work on Windows, but in my mingw-w64 + Wine environment it still fails, possibly because it's picking up a GIO DLL with this bug not fixed. I have no idea whether it works on real Windows. (In reply to comment #20) > Ryan added g_main_context_invoke() for this purpose. I don't think this is an appropriate change for 2.28 (although the function does exist there), but I'll add a patch to do this in the 2.30 branch. I would personally not apply this to 2.30 either, only to master after merging 2.30, but it's up to you whether you want to apply it anyway. How do stable branches work in GLib? It looks to me as though fixes get applied or cherry-picked separately to all applicable branches, rather than merging stable into master like we do in dbus?
(In reply to comment #23) > OK, I now have two branches, for 2.28 and 2.30/master. I'll attach them after > some testing. Excellent, thanks! > > Also I definitely prefer expressions with side effects to be outside > > assertions. > > Fixed in both versions for the test I added, but this seems to be a general > problem in GLib tests, which is rather out-of-scope for this bug. You could > argue that when the tests report failure as an assertion, it's not a problem if > they don't really work with assertions disabled? > > I have about half a branch to fix the other instances in 2.28, but it got > rather long, so I shelved it; I'll revisit it when the changes from this bug > have landed. Do people want this fixed in 2.28, 2.30, or only in master? (I > suspect only-in-master is best.) I think that it's fine to have side-effects in asserts in the tests - the whole idea is that the g_assert*() family of functions are used in tests so the test program won't work correctly if asserts are disabled. > The GDBus tests are Unix-only at the moment anyway, because most of them run > under a temporary session bus using dbus-launch. Yeah. I would love to see gtester support something like that natively though... > For the moment I made it #error if used on non-Unix; I have half a branch to > make it work on Windows, but in my mingw-w64 + Wine environment it still fails, > possibly because it's picking up a GIO DLL with this bug not fixed. I have no > idea whether it works on real Windows. That's fine - if someone is so inclined they can make it work on Windows. > How do stable branches work in GLib? It looks to me as though fixes get applied > or cherry-picked separately to all applicable branches, rather than merging > stable into master like we do in dbus? Yeah, I think we only cherry-pick only the really important changes / bug-fixes.
(In reply to comment #24) > > The GDBus tests are Unix-only at the moment anyway, because most of them run > > under a temporary session bus using dbus-launch. > > Yeah. I would love to see gtester support something like that natively > though... For that sort of thing, you probably want have gtester support a G_TEST_TRAP_DBUS_DAEMON flag that runs something like 'dbus-daemon --print-address=3 --no-fork' directly, rather than relying on dbus-launch. dbus-launch isn't really the right tool for the job: see <https://bugs.freedesktop.org/show_bug.cgi?id=39196> (where review would also be appreciated).
Created attachment 196410 [details] [review] [1/6 v2 for 2.28] GDBusWorker: combine num_writes_pending with flush_pending num_writes_pending was a counter, but it only took values 0 or 1, so make it a boolean: it would never make sense to be trying to write out two messages at the same time (they'd get interleaved). Similarly, we can never be writing and flushing at the same time (that'd mean we were flushing halfway through a message, which would be pointless) so combine it with flush_pending too, calling the result output_pending. Also assert that it takes the expected value whenever we change it, and document the locking discipline used for it, including a subtle case in write_message_in_idle_cb where it's not obvious at first glance why we don't need the lock. (Having the combined boolean at the top of the block of write-related struct members improves struct packing on 64-bit platforms, by packing read_num_ancillary_messages and output_pending into one word.)
Created attachment 196411 [details] [review] [2/6 v2 for 2.28] GDBusConnection: delegate to the worker to close the stream We can't safely close the output part of the I/O stream until any pending write or flush has been completed. In the worst case, this could lead to an assertion failure in the worker (when the close wins the race) or not closing the stream at all (when the write wins the race). --- I think this incorporates all of Colin's review feedback that is applicable to 2.28: - variables renamed: GList *pending_close_attempts, *pending_flush_attempts; - add "schedule_write_in_worker_thread:" heading to its pseudo-gtkdoc comment Using g_main_context_invoke_full() is not a trivial change, because it calls the callback inline if it can acquire the thread-default main context on behalf of the calling thread. That's not what we want here: it seems we genuinely do need an idle to be used. I did try using g_main_context_invoke_full on the 2.30 branch, and once I'd fixed the obvious deadlock (must not hold worker->write_lock across the g_main_context_invoke_full call), I still saw deadlocks in my gdbus-proxy-threads stress-test. One notable thing about that test is that it uses GDBusProxy synchronously from threads that don't have their own thread-default main context - which seems to be something that's done "in real life", notably in Tracker.
Created attachment 196412 [details] [review] [3/6 v2 for 2.28] GDBusWorker: annotate more functions with lock/pending state Now with locking-discipline annotations on the writing-related struct members, too.
Created attachment 196413 [details] [review] [4/6 v2 for 2.28] Regression test for closing a stream, slowly, while a message is queued
Created attachment 196414 [details] [review] [5/6 v2 for 2.28] GDBusWorker: as a precaution, access 'stopped' boolean atomically This member is written in _g_dbus_worker_stop from arbitrary threads, and read by the worker thread, so it should be accessed atomically.
(In reply to comment #29) > [4/6 v2 for 2.28] Regression test for closing a stream, slowly, while a > message is queued I think this addresses Colin's review feedback, except for g_unix_open_pipe() which isn't available in 2.28.
Created attachment 196416 [details] [review] [6/6 for 2.28] _g_dbus_worker_new: ref worker until after _g_dbus_worker_thread_begin_func Otherwise, if the caller of _g_dbus_worker_new immediately unrefs the worker, calling _g_dbus_worker_thread_begin_func in the worker thread could be a use-after-free. --- New patch: I noticed this problem while rebasing to 2.30. This is a backport.
Created attachment 196418 [details] [review] [1/6 for 2.30] GDBusWorker: combine num_writes_pending with flush_pending The same as the corresponding 2.28 patch, but it deletes num_writes_pending from a different location (the struct packing had already been improved).
Created attachment 196419 [details] [review] [2/6 for 2.30] GDBusConnection: delegate to the worker to close the stream Non-trivial changes relative to the 2.28 version: - _g_dbus_worker_stop no longer needs to unref in an idle, because it now calls _g_dbus_worker_close, which unrefs in an idle anyway: so comit 322e25b35 can essentially be reverted
Created attachment 196420 [details] [review] [3/6 for 2.30] GDBusWorker: annotate more functions with lock/pending state Almost identical to the 2.28 version. I've checked that it's all still relevant.
Created attachment 196422 [details] [review] [4/6 v2 for 2.30] Regression test for closing a stream, slowly, while a message is queued Changes relative to 2.28: the Makefile.am fixes some conflicts, and the test itself uses g_unix_open_pipe().
Created attachment 196423 [details] [review] [5/6 for 2.30] GDBusWorker: as a precaution, access 'stopped' boolean atomically Essentially the same as the 2.28 version; it just fixes some conflicts.
Created attachment 196424 [details] [review] [6/6 for 2.30] _g_dbus_worker_new: hold a ref until we have done the initial read Otherwise, we could use-after-free the GDBusWorker, if its last-unref is immediately after _g_dbus_worker_new returns (before the worker thread does its initial read). --- The code structure in 2.30 differs significantly from 2.28, but the bug and the fix are basically the same as Attachment #196416 [details].
The 2.30 patches apply cleanly to master and still pass tests. The 2.28 patches will conflict badly.
OK, I've committed the 2.30 patches to master and glib-2-30 branches and the 2.28 patches to the glib-2-28 branch. Thanks!
*** Bug 659085 has been marked as a duplicate of this bug. ***