Bug 723788 – gdbus-close-pending-test: regression test for unlikely race condition has its own unlikely race condition

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 723788 - gdbus-close-pending-test: regression test for unlikely race condition has its own unlikely race condition


Summary:	gdbus-close-pending-test: regression test for unlikely race condition has its...


Status:	RESOLVED DUPLICATE of bug 723719

Product:	glib
Classification:	Platform
Component:	gdbus
Version:	2.38.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	David Zeuthen (not reading bugmail)
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-02-06 18:47 UTC by Simon McVittie
Modified:	2014-03-26 12:08 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch, possibly incorrect (1.36 KB, patch) 2014-02-06 18:47 UTC, Simon McVittie	none	Details \| Review

Description Simon McVittie 2014-02-06 18:47:02 UTC

Created attachment 268331 [details] [review]
patch, possibly incorrect

(Continuing the saga of "can we get GLib 2.38 built on all Debian's architectures simultaneously?")

gdbus-close-pending.c is a regression test for Bug #651268, which was that close() in the main thread raced with write() or flush() in the GDBus thread.

However, it appears to have its own race condition, so far seen once, on a mips autobuilder:

FAIL: gdbus-close-pending
=========================
# random seed: R02Sc2942a140521364046646a1a66f824a9
# Start of gdbus tests
**
GLib-GIO:ERROR:/«PKGBUILDDIR»/./gio/tests/gdbus-close-pending.c:353:test_once: assertion failed (f->error == NULL): Stream has outstanding operation (g-io-error-quark, 20)
# GLib-GIO:ERROR:/«PKGBUILDDIR»/./gio/tests/gdbus-close-pending.c:353:test_once: assertion failed (f->error == NULL): Stream has outstanding operation (g-io-error-quark, 20)
Aborted

This means that the g_dbus_connection_close_sync() failed with that error.

I thought this was a simple case of "the main thread might race with the GDBus thread", but on closer inspection of how I fixed Bug #651268, all the actual I/O should be happening in the GDBus thread now; so I don't actually know how this situation can arise.

Comment 1 Simon McVittie 2014-02-08 23:50:50 UTC

Possibly related: Bug #723719, Bug #707912, Bug #723655.

Comment 2 Simon McVittie 2014-02-11 12:38:39 UTC

This is on Debian's glib2.0/2.38.2-3.

Some analysis:

Tracing into the code, this means that g_io_stream_close_finish() failed with the same error in iostream_close_cb(). That call starts in continue_writing() and is guarded by worker->output_pending.

The other things done to that IOStream are:

* g_input_stream_read_async on the input side, which is *not* guarded by
  output_pending

* g_socket_send_message and polling on the output side (if a socket),
  which are guarded by output_pending

* g_output_stream_write_async on the output side, which is guarded
  by output_pending

* g_output_stream_flush_async on the output side, which is guarded by
  output_pending

Under normal circumstances, the GIOStream is a GSocketConnection, in which the close() and close_async() vfuncs can only fail if g_socket_close() fails, and that never fails with PENDING.

In this particular regression test, the GIOStream is a MyIOStream, which encapsulates a pair of pipes, with the output stream wrapped in a class whose close() and close_async() are "slow" - this was the only way I could find to reproduce Bug #651268. It uses the GIOStream base class' close() and close_async(), which does have (and respect) a "pending" flag; so it will fail with PENDING if that flag is set. However, that flag is only set by close() and close_async() (at least in 2.38).

Also, the GIOStream close() and close_async() end up calling the sync close() methods on the input and output sides, so if those raise PENDING, so will the GIOStream. So, this failure must indicate that a GIOStream close[_async]() overlapped with another GIOStream close[_async](), or that close() was called on either the GInputStream or GOutputStream while it had a pending operation.

In this regression test, only one place closes the GDBusConnection, so the GIOStream close[_async]() can't be overlapping another close[_async]().

Writing or flushing the GOutputStream can't overlap its close operation, because they're guarded by worker->output_pending. So that leaves the GInputStream.

The GInputStream is a GUnixInputStream wrapping the read end of a pipe. Its close() respects its own pending flag; for it to raise PENDING, it must be pending on a close[_async](), read[_async]() or seek[_async](). The only one of them that GDBusWorker does, other than the non-overlapping close() guarded by worker->output_pending, is the read_async(). The read_async() is not guarded, so we could easily try to start a close() with a read pending.

OK, so why doesn't this test *always* fail? Our read_async() implementation is cancellable. _g_dbus_worker_close() cancels it from the main thread, before taking a mutex to enqueue a "please close" request which will be received by the worker thread. If the worker thread is currently writing or flushing, it'll only process that request when that operation finishes; if it's currently idle, it'll only process that request in an idle. So I can well believe that cancelling the cancellable is "nearly always" faster than the write, flush or idle, so in practice the async read has "nearly always" finished by the time we close.

I think Mikhail's patch on Bug #723719 might address this. I'll analyze that next.

Comment 3 Simon McVittie 2014-02-11 16:13:45 UTC

I've confirmed that Mikhail's patch on Bug #723719 fixes this.

*** This bug has been marked as a duplicate of bug 723719 ***