GNOME Bugzilla – Bug 662395
_g_dbus_worker_flush_sync doesn't always flush when it should
Last modified: 2011-11-23 12:19:12 UTC
I believe _g_dbus_worker_flush_sync doesn't always flush when it should, although I still need to write a regression test to prove whether this is the case. I think the two problem situations are: * Send one or more messages, wait for them to all be sent, then flush. The transport should be flushed, but isn't, and flush_sync returns immediately. * Send a message, then flush after the async write has started, but before it finishes. If no other messages are pending, the transport won't be flushed, and flush_sync will return immediately even though that message is still in-flight. I have some patches which I believe should fix this; I'll work on a regression test next week. The first problem situation should be easy to reproduce with a transport with a large buffer that is only written when you flush, and the second should be easy to reproduce with a transport where writing takes a long time.
Ah, yeah, I can see how that's wrong (at least the first case).
(In reply to comment #0) > * Send one or more messages, wait for them to all be sent, then flush. > The transport should be flushed, but isn't, and flush_sync returns > immediately. This is /gdbus/connection/flush/idle in the test I'm about to attach. > * Send a message, then flush after the async write has started, > but before it finishes. If no other messages are pending, the > transport won't be flushed, and flush_sync will return immediately > even though that message is still in-flight. This is /gdbus/connection/flush/busy in the test I'm about to attach.
Created attachment 200430 [details] [review] [1/7] GDBusWorker: distinguish between 3 sorts of output that might be pending If the user calls flush_sync() with no messages in the queue, but an async write call pending, then we ought to flush after that async write returns (although we don't currently do that). If it was an async close or flush that was pending, there's no need to flush (again) afterwards. So, we need to distinguish.
Created attachment 200431 [details] [review] [2/7] GDBusWorker: rename some functions maybe_write_next_message now also closes, and I'm about to make it consider whether to flush as well, so its name is increasingly inappropriate. Similarly, write_message_in_idle_cb is a wrapper around it which could do any of those things.
Created attachment 200432 [details] [review] [3/7] schedule_write_in_worker_thread: require caller to lock; rename accordingly When we use this function to schedule a flush, it'll be called with the lock held. Releasing and immediately re-taking the lock would be pointless.
Created attachment 200433 [details] [review] [4/7] GDBusWorker: move flush async op into continue_writing() This makes it easier to schedule a flush, by putting it on the same code path as writing and closing. Also change message_written to expect the lock to be held, since all that's left in that function either wants to hold the lock or doesn't care, and it's silly to release the lock immediately before calling message_written, which just takes it again.
Created attachment 200434 [details] [review] [5/7] _g_dbus_worker_flush_sync: always flush if we need to We didn't previously flush in a couple of cases where we should have done: * a write is running when flush is called: we should flush after it finishes * writes have been made since the last flush, but none are pending or running right now: we should flush the underlying transport straight away
Created attachment 200435 [details] [review] [6/7] DBus tests: factor out TestIOStream, test_pipe and test_bidi_pipe These might even make useful public API if they grew a Windows implementation, but for now they can be Unix-only test API.
Created attachment 200436 [details] [review] [7/7] Add test case for #662395 --- Confirmed to fail (with Attachment #200435 [details] applied first) on glib-2-30, but succeed with the other attachments from this bug applied first. These patches are all relative to glib-2-30 - I'll check whether they apply cleanly to master next. If this bugfix is too intrusive for glib-2-30, please say.
Review of attachment 200430 [details] [review]: ::: gio/gdbusprivate.c @@ +1285,3 @@ g_mutex_lock (data->worker->write_lock); + g_assert (data->worker->output_pending == PENDING_WRITE); + data->worker->output_pending = PENDING_NONE; This should still release the lock; there must be a compensating error in a subsequent patch. I'll redo this one.
Created attachment 200438 [details] [review] [1/7 v2] GDBusWorker: distinguish between 3 sorts of output that might be pending If the user calls flush_sync() with no messages in the queue, but an async write call pending, then we ought to flush after that async write returns (although we don't currently do that). If it was an async close or flush that was pending, there's no need to flush (again) afterwards. So, we need to distinguish. --- Now without the too-early removal of an unlock.
Created attachment 200439 [details] [review] [4/7 v2] GDBusWorker: move flush async op into continue_writing() This makes it easier to schedule a flush, by putting it on the same code path as writing and closing. Also change message_written to expect the lock to be held, since all that's left in that function either wants to hold the lock or doesn't care, and it's silly to release the lock immediately before calling message_written, which just takes it again. --- v2 removes the unlock that shouldn't have been removed as early as [1/7].
These also apply to master and pass tests there, with a lot of conflicts caused by g_mutex_lock (x) -> g_mutex_lock (&x) noise.
Sorry for being slow. I don't have time to review the patches in detail but I trust that they work. Thanks for also writing test cases - it is much appreciated. Please push to both master and glib-2-30. Thanks.
Thanks. Cosimo Alfarano (KA on IRC) is doing an unofficial review of this to make sure I haven't made any implementation mistakes; assuming I haven't, I'll merge when he's finished that.
Thanks, fixed in master as a795e563dfec2..4bb411948c9c for 2.31.1, 2.30 as 439f3b8e540a..d09677881b7e463 for 2.30.3.
After these commits, I am seeing various gdbus testcases segfault, e.g. $ ./gdbus-connection /gdbus/connection/basic: OK /gdbus/connection/life-cycle: Segmentation fault git bisect points at your patch no. 4
(In reply to comment #17) > After these commits, I am seeing various gdbus testcases segfault I didn't see this when I ran the tests myself. Could you give me a backtrace for all threads, please? Is this on master, 2.30 or both? Which tests segfault and in what proportion of runs?
This might be the same as Bug #664617, which has a traceback.
Please try my patch from Bug #664617, which fixes an uninitialized variable that could cause a segfault (although it doesn't do that for me). I notice the reporter of that bug is on 32-bit - are you? Perhaps I'm not seeing the segfault because I'm using x86-64? I'll try building an i386 GLib.
This patch fixes the segfaults that I've seen. But no, this is 4-way 64bit machine.
Should be fixed in a8ee10cc7 (2.30.3), 968ef5f5f (2.31.3).