Bug 662395 – _g_dbus_worker_flush_sync doesn't always flush when it should

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 662395 - _g_dbus_worker_flush_sync doesn't always flush when it should


Summary:	_g_dbus_worker_flush_sync doesn't always flush when it should


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	gdbus
Version:	2.30.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	David Zeuthen (not reading bugmail)
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-10-21 15:36 UTC by Simon McVittie
Modified:	2011-11-23 12:19 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
[1/7] GDBusWorker: distinguish between 3 sorts of output that might be pending (10.70 KB, patch) 2011-11-01 18:21 UTC, Simon McVittie	rejected	Details \| Review
[2/7] GDBusWorker: rename some functions (2.93 KB, patch) 2011-11-01 18:21 UTC, Simon McVittie	none	Details \| Review
[3/7] schedule_write_in_worker_thread: require caller to lock; rename accordingly (2.71 KB, patch) 2011-11-01 18:22 UTC, Simon McVittie	none	Details \| Review
[4/7] GDBusWorker: move flush async op into continue_writing() (6.14 KB, patch) 2011-11-01 18:23 UTC, Simon McVittie	none	Details \| Review
[5/7] _g_dbus_worker_flush_sync: always flush if we need to (5.18 KB, patch) 2011-11-01 18:23 UTC, Simon McVittie	none	Details \| Review
[6/7] DBus tests: factor out TestIOStream, test_pipe and test_bidi_pipe (19.27 KB, patch) 2011-11-01 18:24 UTC, Simon McVittie	none	Details \| Review
[7/7] Add test case for #662395 (13.37 KB, patch) 2011-11-01 18:26 UTC, Simon McVittie	none	Details \| Review
[1/7 v2] GDBusWorker: distinguish between 3 sorts of output that might be pending (10.66 KB, patch) 2011-11-01 19:16 UTC, Simon McVittie	none	Details \| Review
[4/7 v2] GDBusWorker: move flush async op into continue_writing() (6.33 KB, patch) 2011-11-01 19:17 UTC, Simon McVittie	none	Details \| Review

Description Simon McVittie 2011-10-21 15:36:37 UTC

I believe _g_dbus_worker_flush_sync doesn't always flush when it should, although I still need to write a regression test to prove whether this is the case.

I think the two problem situations are:

* Send one or more messages, wait for them to all be sent, then flush.
  The transport should be flushed, but isn't, and flush_sync returns
  immediately.

* Send a message, then flush after the async write has started,
  but before it finishes. If no other messages are pending, the
  transport won't be flushed, and flush_sync will return immediately
  even though that message is still in-flight.

I have some patches which I believe should fix this; I'll work on a regression test next week. The first problem situation should be easy to reproduce with a transport with a large buffer that is only written when you flush, and the second should be easy to reproduce with a transport where writing takes a long time.

Comment 1 David Zeuthen (not reading bugmail) 2011-10-21 17:20:15 UTC

Ah, yeah, I can see how that's wrong (at least the first case).

Comment 2 Simon McVittie 2011-11-01 18:18:42 UTC

(In reply to comment #0)
> * Send one or more messages, wait for them to all be sent, then flush.
>   The transport should be flushed, but isn't, and flush_sync returns
>   immediately.

This is /gdbus/connection/flush/idle in the test I'm about to attach.

> * Send a message, then flush after the async write has started,
>   but before it finishes. If no other messages are pending, the
>   transport won't be flushed, and flush_sync will return immediately
>   even though that message is still in-flight.

This is /gdbus/connection/flush/busy in the test I'm about to attach.

Comment 3 Simon McVittie 2011-11-01 18:21:38 UTC

Created attachment 200430 [details] [review]
[1/7] GDBusWorker: distinguish between 3 sorts of output that might be pending

If the user calls flush_sync() with no messages in the queue, but an
async write call pending, then we ought to flush after that async write
returns (although we don't currently do that). If it was an async close
or flush that was pending, there's no need to flush (again) afterwards.
So, we need to distinguish.

Comment 4 Simon McVittie 2011-11-01 18:21:56 UTC

Created attachment 200431 [details] [review]
[2/7] GDBusWorker: rename some functions

maybe_write_next_message now also closes, and I'm about to make it
consider whether to flush as well, so its name is increasingly
inappropriate. Similarly, write_message_in_idle_cb is a wrapper around
it which could do any of those things.

Comment 5 Simon McVittie 2011-11-01 18:22:25 UTC

Created attachment 200432 [details] [review]
[3/7] schedule_write_in_worker_thread: require caller to lock;  rename accordingly

When we use this function to schedule a flush, it'll be called
with the lock held. Releasing and immediately re-taking the lock would
be pointless.

Comment 6 Simon McVittie 2011-11-01 18:23:27 UTC

Created attachment 200433 [details] [review]
[4/7] GDBusWorker: move flush async op into continue_writing()

This makes it easier to schedule a flush, by putting it on the same code
path as writing and closing.

Also change message_written to expect the lock to be held, since all
that's left in that function either wants to hold the lock or doesn't
care, and it's silly to release the lock immediately before calling
message_written, which just takes it again.

Comment 7 Simon McVittie 2011-11-01 18:23:52 UTC

Created attachment 200434 [details] [review]
[5/7] _g_dbus_worker_flush_sync: always flush if we need to

We didn't previously flush in a couple of cases where we should have
done:
   
* a write is running when flush is called: we should flush after it
  finishes
 
* writes have been made since the last flush, but none are pending or
  running right now: we should flush the underlying transport straight
  away

Comment 8 Simon McVittie 2011-11-01 18:24:53 UTC

Created attachment 200435 [details] [review]
[6/7] DBus tests: factor out TestIOStream, test_pipe and  test_bidi_pipe

These might even make useful public API if they grew a Windows
implementation, but for now they can be Unix-only test API.

Comment 9 Simon McVittie 2011-11-01 18:26:48 UTC

Created attachment 200436 [details] [review]
[7/7] Add test case for #662395

---

Confirmed to fail (with Attachment #200435 [details] applied first) on glib-2-30, but succeed with the other attachments from this bug applied first.

These patches are all relative to glib-2-30 - I'll check whether they apply cleanly to master next. If this bugfix is too intrusive for glib-2-30, please say.

Comment 10 Simon McVittie 2011-11-01 18:34:07 UTC

Review of attachment 200430 [details] [review]:

::: gio/gdbusprivate.c
@@ +1285,3 @@
   g_mutex_lock (data->worker->write_lock);
+  g_assert (data->worker->output_pending == PENDING_WRITE);
+  data->worker->output_pending = PENDING_NONE;

This should still release the lock; there must be a compensating error in a subsequent patch. I'll redo this one.

Comment 11 Simon McVittie 2011-11-01 19:16:01 UTC

Created attachment 200438 [details] [review]
[1/7 v2] GDBusWorker: distinguish between 3 sorts of output that might be pending

If the user calls flush_sync() with no messages in the queue, but an
async write call pending, then we ought to flush after that async write
returns (although we don't currently do that). If it was an async close
or flush that was pending, there's no need to flush (again) afterwards.
So, we need to distinguish.

---

Now without the too-early removal of an unlock.

Comment 12 Simon McVittie 2011-11-01 19:17:19 UTC

Created attachment 200439 [details] [review]
[4/7 v2] GDBusWorker: move flush async op into continue_writing()

This makes it easier to schedule a flush, by putting it on the same code
path as writing and closing.

Also change message_written to expect the lock to be held, since all
that's left in that function either wants to hold the lock or doesn't
care, and it's silly to release the lock immediately before calling
message_written, which just takes it again.

---

v2 removes the unlock that shouldn't have been removed as early as [1/7].

Comment 13 Simon McVittie 2011-11-01 19:18:03 UTC

These also apply to master and pass tests there, with a lot of conflicts caused by g_mutex_lock (x) -> g_mutex_lock (&x) noise.

Comment 14 David Zeuthen (not reading bugmail) 2011-11-14 15:42:38 UTC

Sorry for being slow. I don't have time to review the patches in detail but I trust that they work. Thanks for also writing test cases - it is much appreciated. Please push to both master and glib-2-30. Thanks.

Comment 15 Simon McVittie 2011-11-16 16:29:51 UTC

Thanks. Cosimo Alfarano (KA on IRC) is doing an unofficial review of this to make sure I haven't made any implementation mistakes; assuming I haven't, I'll merge when he's finished that.

Comment 16 Simon McVittie 2011-11-21 18:31:14 UTC

Thanks, fixed in master as a795e563dfec2..4bb411948c9c for 2.31.1, 2.30 as 439f3b8e540a..d09677881b7e463 for 2.30.3.

Comment 17 Matthias Clasen 2011-11-23 06:17:02 UTC

After these commits, I am seeing various gdbus testcases segfault, e.g.

$ ./gdbus-connection
/gdbus/connection/basic: OK
/gdbus/connection/life-cycle: Segmentation fault


git bisect points at your patch no. 4

Comment 18 Simon McVittie 2011-11-23 10:38:50 UTC

(In reply to comment #17)
> After these commits, I am seeing various gdbus testcases segfault

I didn't see this when I ran the tests myself. Could you give me a backtrace for all threads, please?

Is this on master, 2.30 or both?

Which tests segfault and in what proportion of runs?

Comment 19 Simon McVittie 2011-11-23 11:05:14 UTC

This might be the same as Bug #664617, which has a traceback.

Comment 20 Simon McVittie 2011-11-23 11:26:04 UTC

Please try my patch from Bug #664617, which fixes an uninitialized variable that could cause a segfault (although it doesn't do that for me).

I notice the reporter of that bug is on 32-bit - are you? Perhaps I'm not seeing the segfault because I'm using x86-64? I'll try building an i386 GLib.

Comment 21 Matthias Clasen 2011-11-23 11:47:28 UTC

This patch fixes the segfaults that I've seen.
But no, this is 4-way 64bit machine.

Comment 22 Simon McVittie 2011-11-23 12:19:12 UTC

Should be fixed in a8ee10cc7 (2.30.3), 968ef5f5f (2.31.3).