GNOME Bugzilla – Bug 578295
gtester has a race condition
Last modified: 2010-08-08 00:02:44 UTC
every now and then gtester will randomly fail to exit. expected output would be something like -------------------- TEST: 1bit-mutex... (pid=27615) /glib/1bit-mutex: OK PASS: 1bit-mutex -------------------- but you see this instead: -------------------- TEST: 1bit-mutex... (pid=27750) /glib/1bit-mutex: OK -------------------- and it hangs forever. 'ps f' shows: 27732 pts/4 S+ 0:00 [snip] \_ make check 27733 pts/4 S+ 0:00 [snip] \_ make check-local 27734 pts/4 S+ 0:00 [snip] \_ /bin/bash -c test -z "1bit-mutex" || ../../glib/gtester??? --verbose 1bit-mutex 27735 pts/4 S+ 0:00 [snip] \_ /home/desrt/code/glib/glib-2.20.1/_build/glib/.libs/lt-gtester --verbose 1bit-mutex 27750 pts/4 Z+ 0:07 [snip] \_ [lt-1bit-mutex] <defunct> so clearly, the lt-1bit-mutex process has quit and the gtester process missed the SIGCHLD. here's a backtrace of what gtester is doing:
+ Trace 214264
notice that it's watching no fds at all. that's just not cool. there's no way that this process could possibly wake up except by receiving a signal. if the SIGCHLD comes -just- before the poll syscall is made then it won't wake up the poll and there will be no way to not do the poll (since once the signal handler returns we could already have given control over to libc poll() call).
one more note, though: i decided that maybe i could wake the process by resizing the terminal window (SIGWINCH) but that's actually not enough to wake it anymore either so it's not merely a case of the SIGCHLD handler failing to wake up the poll() but a matter of the SIGCHLD handler not running at all....
Sounds like a generic race condition, probably not hit very often outside of things like gtester because most people are using the glib main loop which will have other sources set up. signalfd to the rescue? http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd.2.html
Looks similar to bug 572861
Can't we just launch tests synchronously and avoid this issue altogether? The race condition may still exist but that would avoid it in gtester and I can't see why it needs to launch tests asynchronously. With respect to the race condition, there's bug 398418 tracking it. Let's keep this bug open in case we want to switch to synchronous forks. If not, we can just dup it.
Bug 572861 and this one are the same, but I can't mark bugs as duplicates...
After reading bug 572861, I thought about another workaround, which works: call g_thread_init(NULL) in gtester's main function. However that requires gtester to link to libgthread, which seems to need changes in the build system (maybe move gtester from glib/ to a new tools/), but that's probably too much for a workaround :)
*** Bug 572861 has been marked as a duplicate of this bug. ***
*** Bug 602782 has been marked as a duplicate of this bug. ***
The link-to-gthread solution is the easiest, but unfortunately, it's vaguely impossible. We need glib/ to be compiled before gtester builds, but we also need gtester to be built before entering glib/tests/. Therefore gtester needs to be in glib/. We need glib/ to build gthread/, so glib/ has to come before it. So unless we have a two-pass build system, or do some very serious shake-ups, we can't link gtester to libgthread. I think the best way to fix this is to enable threads without enabling threads. ie: get the mainloop to think that we've switched threads on, but without requiring linking to libgthread.
As I mentioned in bug 572861 adding a dummy timeout in gtester.c main avoids this, and doesn't require threads. I don't quite understand why it works though. Why is the child watch not working the same way as main context wake_up_pipe (sigchild handler always writing to a pipe waking up main context) but instead does odd looking special casing between single and multiple threads?
Tommi: this is my plan for now. The problem with the wake-up pipe is that it is only enabled in threaded situations.
Okay. Did that. Of course, we should fix this properly. See bug #398418