GNOME Bugzilla – Bug 723788
gdbus-close-pending-test: regression test for unlikely race condition has its own unlikely race condition
Last modified: 2014-03-26 12:08:50 UTC
Created attachment 268331 [details] [review] patch, possibly incorrect (Continuing the saga of "can we get GLib 2.38 built on all Debian's architectures simultaneously?") gdbus-close-pending.c is a regression test for Bug #651268, which was that close() in the main thread raced with write() or flush() in the GDBus thread. However, it appears to have its own race condition, so far seen once, on a mips autobuilder: FAIL: gdbus-close-pending ========================= # random seed: R02Sc2942a140521364046646a1a66f824a9 # Start of gdbus tests ** GLib-GIO:ERROR:/«PKGBUILDDIR»/./gio/tests/gdbus-close-pending.c:353:test_once: assertion failed (f->error == NULL): Stream has outstanding operation (g-io-error-quark, 20) # GLib-GIO:ERROR:/«PKGBUILDDIR»/./gio/tests/gdbus-close-pending.c:353:test_once: assertion failed (f->error == NULL): Stream has outstanding operation (g-io-error-quark, 20) Aborted This means that the g_dbus_connection_close_sync() failed with that error. I thought this was a simple case of "the main thread might race with the GDBus thread", but on closer inspection of how I fixed Bug #651268, all the actual I/O should be happening in the GDBus thread now; so I don't actually know how this situation can arise.
Possibly related: Bug #723719, Bug #707912, Bug #723655.
This is on Debian's glib2.0/2.38.2-3. Some analysis: Tracing into the code, this means that g_io_stream_close_finish() failed with the same error in iostream_close_cb(). That call starts in continue_writing() and is guarded by worker->output_pending. The other things done to that IOStream are: * g_input_stream_read_async on the input side, which is *not* guarded by output_pending * g_socket_send_message and polling on the output side (if a socket), which are guarded by output_pending * g_output_stream_write_async on the output side, which is guarded by output_pending * g_output_stream_flush_async on the output side, which is guarded by output_pending Under normal circumstances, the GIOStream is a GSocketConnection, in which the close() and close_async() vfuncs can only fail if g_socket_close() fails, and that never fails with PENDING. In this particular regression test, the GIOStream is a MyIOStream, which encapsulates a pair of pipes, with the output stream wrapped in a class whose close() and close_async() are "slow" - this was the only way I could find to reproduce Bug #651268. It uses the GIOStream base class' close() and close_async(), which does have (and respect) a "pending" flag; so it will fail with PENDING if that flag is set. However, that flag is only set by close() and close_async() (at least in 2.38). Also, the GIOStream close() and close_async() end up calling the sync close() methods on the input and output sides, so if those raise PENDING, so will the GIOStream. So, this failure must indicate that a GIOStream close[_async]() overlapped with another GIOStream close[_async](), or that close() was called on either the GInputStream or GOutputStream while it had a pending operation. In this regression test, only one place closes the GDBusConnection, so the GIOStream close[_async]() can't be overlapping another close[_async](). Writing or flushing the GOutputStream can't overlap its close operation, because they're guarded by worker->output_pending. So that leaves the GInputStream. The GInputStream is a GUnixInputStream wrapping the read end of a pipe. Its close() respects its own pending flag; for it to raise PENDING, it must be pending on a close[_async](), read[_async]() or seek[_async](). The only one of them that GDBusWorker does, other than the non-overlapping close() guarded by worker->output_pending, is the read_async(). The read_async() is not guarded, so we could easily try to start a close() with a read pending. OK, so why doesn't this test *always* fail? Our read_async() implementation is cancellable. _g_dbus_worker_close() cancels it from the main thread, before taking a mutex to enqueue a "please close" request which will be received by the worker thread. If the worker thread is currently writing or flushing, it'll only process that request when that operation finishes; if it's currently idle, it'll only process that request in an idle. So I can well believe that cancelling the cancellable is "nearly always" faster than the write, flush or idle, so in practice the async read has "nearly always" finished by the time we close. I think Mikhail's patch on Bug #723719 might address this. I'll analyze that next.
I've confirmed that Mikhail's patch on Bug #723719 fixes this. *** This bug has been marked as a duplicate of bug 723719 ***