GNOME Bugzilla – Bug 583511
race condition means g_main_loop_quit() does not work to exit loop
Last modified: 2011-09-10 02:32:49 UTC
Please describe the problem: If one thread calls g_main_loop_run(), and another calls g_main_loop_quit(), it is possible that g_main_loop_run() will never return, being stuck in poll(). Steps to reproduce: There is a small window of opportunity for glib functions g_main_loop_quit and g_main_loop_wakeup to fail to function. ** Steps To Reproduce: NOTE: it takes 36 hours to reproduce this problem using stress testing, but the addition of a sleep() to the glib source makes it happen every time. 1) In g_main_context_iterate(), put a sleep(10); statement between UNLOCK_CONTEXT, and g_main_context_prepare(). 2) In one thread, do g_main_loop_run(), In a second thread, do g_main_loop_quit() such that it's called 5 seconds later (after g_main_loop_run() was called in the first thread.) The variable poll_waiting will be false when g_main_loop_quit / g_main_loop_wakeup executes and therefore no write() will be done to the wake_up_pipe. When the main loop finally reaches poll(), it will be stuck forever. Actual results: Unless there are other sources, the poll() in the main loop never returns in spite of g_main_loop_quit() having been called. The app is hung. Expected results: g_main_loop_quit() always causes the main loop to exit. Does this happen every time? yes (provided the sleep() call is added to exacerbate the race condition.) Other information:
Isn't it a programmer error to call g_main_loop_quit without having ensured that g_main_loop_run was already called?
g_main_loop_run() has been called; it hasn't necessarily returned yet, but it's been called in this scenario. Since it's being called from a different thread than is calling g_main_loop_quit() it's hard to guarantee the order. If there's to be synchronization required, does it belong in glib or in calling apps?
Just to be clear, in the scenario that demonstrates this bug Thread_A calls g_main_loop_run() *before* Thread_B calls g_main_loop_quit(). Thread_A is inside g_main_context_iterate() and has not yet reached the poll() call; instead it's about to call g_main_context_prepare() when it gets swapped out and Thread_B starts executing inside g_main_loop_quit(). This is a race condition that you won't see unless you call g_main_loop_quit() fairly soon after another thread calls g_main_loop_run(). It's not something the programmer can fix.
(Note I'm not a glib developer, just an observer) I see, that sounds plausible. When I write threaded code interacting with a mainloop, pretty much the only function I use is g_idle_add (or the more lowlevel g_source_attach that it wraps) from threads other than the one running the loop. In this case a possible workaround would be to add an idle to the mainloop, and inside that idle call g_main_loop_quit. But it would be good to fix this bug. I wonder if removing the guard on poll_waiting for the write to wake_up_pipe would work; possibly would result in multiple wakeups in some scenarios?
Note that g_idle_add() also will sometimes fail because g_idle_add invokes g_source_attach(), which then calls g_main_context_wakeup(). Any code that calls g_main_context_wakeup() may fail in a multi-threaded applications. This is why g_main_loop_quit() also fails.
There's no race condition with the idle - the wakeup isn't needed if we haven't run the source prepare functions yet. I think you are right about the possibility of a race condition in the case of quit(). Probably the easiest way to handle it would be to get rid of poll_waiting. (Some discussion at the end of bug 320888.) There are also possibilities for making poll_waiting a count/serial instead of a boolean. If poll_waiting is eliminated, care must be taken that adding a source (say a timeout) *from the main thread* does not cause the next poll to wake up immediately. (Probably OK to require that g_main_context_acquire() is called *before any loop where you do while (!condition) g_main_context_iterate()))
*** Bug 632301 has been marked as a duplicate of this bug. ***
Bug 320888 has a patch that should fix this issue too.