GNOME Bugzilla – Bug 154827
g_child_watch_add doesn't work if target process has already become zombie
Last modified: 2011-02-18 16:09:33 UTC
g_child_watch_add doesn't work if the child process being monitored terminates too quickly. Or maybe this could be 'if the process no longer exists'. I don't know exactly in which order the events occur... I have a test case ;-)
Created attachment 32368 [details] test case The test case demonstrates what happens: some time the notification occurs, sometimes it doesn't. If I change the program to {"/bin/sleep", "1", NULL}, then it always works.
My tests in pygtk are telling me that the problem is not directly the speed of termination of the child process. I did a test (sorry, in pygtk, no C code) where the child sleeps 1 second before quitting, and the parent waits 2 seconds before calling g_child_watch_add. In this case the callback is never called. Conclusion: g_child_watch_add doesn't work if the target process no longer exists. It is imperative that we fix this, otherwise the API is almost useless.
Without looking at glib code, I think the following pseudo-code should solve the problem without race conditions: g_child_watch_add(pid, cb, data): 1. src = setup_child_watch_notifier(pid, cb, data) 2. pid1 = waitpid(pid, NULL, WNOHANG) 3. if (pid1 == pid): /* child exited */ destroy_child_watch_notifier(src) cb(data)
I don't understand why the child no lonbger exists in your example ... with DO_NOT_REAP_CHILD you should get a zombie process until the child watch waits for it. You *CANNOT* reliably wait for a process that no longer exists, because the PID may have been reused for a different process. Plus you can no longer get the exit status. I don't think we should try to make GChildWatch work for the case where the child has exited and been reaped.
I did a new test, where the child sleeps 1 second before quitting, and the parent waits 10 seconds before calling g_child_watch_add. I did a "ps x" and saw the child zombie process. The child notification callback is never called.
Created attachment 33252 [details] [review] Patch to fix test case in unix This fixes the problem, on unix. One thing bothers me, though. Notice the commented code: +/* if (g_child_watch_check (source)) */ +/* g_message("Child %i exited", pid); */ In principle, it should be enough to call g_child_watch_check to fix the problem. However, that function has some weird child_watch_count guard that never allows the function to do its work. Perhaps that is the root of the problem...
2004-11-08 Matthias Clasen <mclasen@redhat.com> * glib/gmain.c: Initialize child_watch_count to 1, so that we don't miss the very first child if it exits before we set up the child watch. In that case we had previously source->count == child_watch_count == 0, causing g_child_watch_check() to skip the waitpid() call. (#154827, Gustavo Carneiro) * glib/gmain.c (g_child_watch_source_init_single) (g_child_watch_source_init_multi_threaded): Use sigaction() instead of signal(). (#136867, Jonas Jonsson, patch by Archana Shah)