GNOME Bugzilla – Bug 136867
child-test is still hanging
Last modified: 2011-02-18 16:09:12 UTC
Much as described in http://bugzilla.gnome.org/show_bug.cgi?id=136539, child-test hangs. However, it's not really the same thing here. When running tests, it's only the first child that exits: adbjsjn@lilith:glib-2.3.6/tests>./child-test child 10295 (ttl 10) exited, status 0 A "ps" from another x-terminal: adbjsjn 10282 10055 0 10:08:56 pts/2 0:00 ...glib/work/main.d/glib-2. adbjsjn 10032 10014 0 10:01:32 pts/2 0:00 -sh adbjsjn 10296 10282 0 10:08:56 pts/2 0:00 <defunct> adbjsjn 10055 10032 0 10:01:33 pts/2 0:00 bash From gdb: adbjsjn@lilith:tests/.libs>gdb child-test Detected 64-bit executable. Invoking /opt/langtools/bin/gdb64. HP gdb 3.2 for PA-RISC 2.0 (wide), HP-UX 11.00. Copyright 1986 - 2001 Free Software Foundation, Inc. Hewlett-Packard Wildebeest 3.2 (based on GDB) is covered by the GNU General Public License. Type "show copying" to see the conditions to change it and/or distribute copies. Type "show warranty" for warranty/support. .. (gdb) r Starting program:.../glib-2.3.6/tests/.libs/child-test [New process 10206] Detaching after fork from process 10206 [New process 10209] Detaching after fork from process 10209 [New process 10210] Detaching after fork from process 10210 child 10209 (ttl 10) exited, status 0 Program received signal SIGINT, Interrupt. 0x800003ffff5dcc74 in _poll_sys+0x2c () from /lib/pa20_64/libc.2 (gdb) Quit (gdb) /usr/local/pa64/bin/gcc -v Reading specs from /usr/local/pa64/lib/gcc-lib/hppa64-hp-hpux11.11/3.3.2/specs Configured with: /scratch/root/gcc-pkg/3.3.1/hpux-11/gcc-3.3.2/configure --enable-languages=c,c++ --enable-threads=posix --disable-nls --with-gnu-as --with-gnu-ld --with-as=/usr/local/pa64/bin/as --with-ld=/usr/local/pa64/bin/ld --host=hppa64-hp-hpux11.11 --target=hppa64-hp-hpux11.11 --prefix=/usr/local/pa64 Thread model: posix gcc version 3.3.2
And glib is 2.3.6 ...
The following patch is actually unrelated to this bug report, but it implements a new version of g_child_watch. Could you please try, whether it works around your problem es well. http://bugzilla.gnome.org/showattachment.cgi?attach_id=25511
It DOES work! gmake[4]: Entering directory `/alcesys/build/garnome-0.30.1/platform/glib/work/main.d/glib-2.3.6/tests' PASS: atomic-test PASS: array-test PASS: cxx-test whee! created pid: 28126 (ttl 4) whee! created pid: 28127 (ttl 2) whee! created pid: 28129 (ttl 5) whee! created pid: 28128 (ttl 3) child 28127 (ttl 2) exited, status 0 child 28128 (ttl 3) exited, status 0 child 28126 (ttl 4) exited, status 0 child 28129 (ttl 5) exited, status 0 whee! created pid: 28130 (ttl 2) whee! created pid: 28131 (ttl 6) whee! created pid: 28132 (ttl 4) child 28130 (ttl 2) exited, status 0 child 28132 (ttl 4) exited, status 0 child 28131 (ttl 6) exited, status 0 whee! created pid: 28134 (ttl 2) whee! created pid: 28133 (ttl 2) child 28133 (ttl 2) exited, status 0 child 28134 (ttl 2) exited, status 0 PASS: child-test
I have some reservations about Sebastian's new approach, so I'd like to try to get debugging on what is going on here; this isn't a problem with threading, since in 2.3.6, child-test isn't threaded. Can you add some debugging into: g_child_watch_signal_handler() At the beginning, add: write (2, "SIG\n", 4); And in g_child_watch_prepare g_child_watch_check g_child_watch_dispatch add g_printerr ("prepare: Checking for %d, counts = %d\n", ((GChildWatchSource *)source)->pid, ((GChildWatchSource *)source)->count, child_watch_count, (Same for check and dispatch, but with check:, dispatch: instead) And see what that logs? I suspect we have some simple logic error in the code, but I can't figure it out offhand.
>And in > > g_child_watch_prepare > g_child_watch_check > g_child_watch_dispatch > >add > > g_printerr ("prepare: Checking for %d, counts = %d\n", > ((GChildWatchSource *)source)->pid, > ((GChildWatchSource *)source)->count, > child_watch_count, > >(Same for check and dispatch, but with check:, dispatch: instead) No, I couldn't add this. GChildWatchSource doesn't have a count attribute (at least in my glib-2.3.6 tar-ball) and 'child_watch_count' is totally unknown to my compiler. Should it be a gint? Where should I initiate the variable, where should it be increased/decreased etc? Should there be something after child_watch_count in the g_printerr() call, I assumed no -> will g_printerr() handle the extra argument, there's only formatting for two of them? Would really like to get this going on HP-UX, but I'm afraid I can't put down the amount of time needed, I might be able to put in half an hour here, half an hour there .....
Now with glib-2.4.0 So, I've tried this with HP's Ansi C-compiler, same result with original code. what /usr/bin/cc /usr/bin/cc: $Revision: 92453-07 linker linker crt0.o B.11.16.01 030316 $ LINT B.11.11.08 CXREF B.11.11.08 HP92453-01 B.11.11.08 HP C Compiler $ PATCH/11.00:PHCO_27774 Oct 3 2002 09:45:59 $ CFLAGS = -Ae +DA2.0W -g When running the program in gdb, I get this output: tests/.libs>gdb64 child-test HP gdb 3.2 for PA-RISC 2.0 (wide), HP-UX 11.00. Copyright 1986 - 2001 Free Software Foundation, Inc. Hewlett-Packard Wildebeest 3.2 (based on GDB) is covered by the GNU General Public License. Type "show copying" to see the conditions to change it and/or distribute copies. Type "show warranty" for warranty/support. .. (gdb) r Starting program: /alcesys/build/garnome-0.30.1/platform/glib/work/main.d/glib-2.4.0/tests/.libs/child-test [New process 19453] Detaching after fork from process 19453 [New process 19456] warning: reading `r3' register: No data warning: reading `r3' register: No data Detaching after fork from process 19456 [New process 19457] warning: reading `r3' register: No data warning: reading `r3' register: No data Detaching after fork from process 19457 child 19456 (ttl 10) exited, status 0 Program received signal SIGINT, Interrupt. 0x800003ffff5cac74 in _poll_sys+0x2c () from /lib/pa20_64/libc.2 (gdb) kill Kill the program being debugged? (y or n) y (gdb) quit
Created attachment 25733 [details] [review] Debugging patch
I've attached a patch that adds the debugging output as described above. Could you apply this patch and then run child-watch (not under gdb, the gdb output is confusing rather than helpful here) For comparison, I (on Linux) get: === prepare: Checking pid 5413, counts = 0/0 prepare: Checking pid 5414, counts = 0/0 SIG check: Checking pid 5413, counts = 0/1 check: Checking pid 5414, counts = 0/1 child 5413 (ttl 10) exited, status 0 prepare: Checking pid 5414, counts = 1/1 SIG check: Checking pid 5414, counts = 1/2 child 5414 (ttl 20) exited, status 0 ===
prepare: Checking pid 12255, counts = 0/0 prepare: Checking pid 12256, counts = 0/0 SIG check: Checking pid 12255, counts = 0/1 check: Checking pid 12256, counts = 0/1 child 12255 (ttl 10) exited, status 0 prepare: Checking pid 12256, counts = 1/1 gmake[4]: *** [check-TESTS] Error 130 Looking with a ps -fu ME from another terminal shows one process (12256) as <defunct>. After the third prepare, nothing happens .... on HP-UX This is compiled as from my initial report (gcc). Unless there's any change in result, I won't report the result from HP-UX Ansi C ...
Could you add a line to tests/child-test.c - after sleep(ttl) add: g_printerr ("Exiting, ttl=%d pid=%d\n", ttl, getpid()); And try it again?
./child-test prepare: Checking pid 10846, counts = 0/0 prepare: Checking pid 10847, counts = 0/0 Exiting ..., ttl=10, pid=10846 SIG check: Checking pid 10846, counts = 0/1 check: Checking pid 10847, counts = 0/1 child 10846 (ttl 10) exited, status 0 prepare: Checking pid 10847, counts = 1/1 Exiting ..., ttl=20, pid=10847
After some investigation, it looks like this is a very old BSD / SysV compatiblity issue - SysV resets handlers installed with signal() after they are called, BSD (and current Linux) doesn't. Could you, to test this theory, put: signal (SIGCHLD, g_child_watch_signal_handler); as the very last line of g_child_watch_signal_handler() and see if that fixes the problem? I think the best long-term solution is probably to switch to using sigaction() rather than signal() to install the signal handler.
Created attachment 28285 [details] Program that doesn't work on HP-UX Yes, it seems to be the case here. The attached program runs fine on Linux (RH) and cygwin but fails on HP-UX (11.23). Very soon (probably next week), I'll have a machine again, and then this problem *WILL* be solved :).
This is happening on Solaris also. There gnome-terminal does not exit because SIGCHLD is emitted only once. So only one window gets closed when we say 'exit' and rest all just hang. We can have two solutions for this, either of them fixes this bug. One solution is to use sigaction instead of signal. Here is the change: g_child_watch_source_init_multi_threaded (void) { GError *error = NULL; + struct sigaction action; g_assert (g_thread_supported()); @@ -3630,7 +3631,10 @@ g_child_watch_source_init_multi_threaded if (g_thread_create (child_watch_helper_thread, NULL, FALSE, &error) == NULL) g_error ("Cannot create a thread to monitor child exit status: %s\n", error->message); child_watch_init_state = CHILD_WATCH_INITIALIZED_THREADED; - signal (SIGCHLD, g_child_watch_signal_handler); + action.sa_handler = g_child_watch_signal_handler ; + sigemptyset (&action.sa_mask); + action.sa_flags = SA_RESTART | SA_NOCLDSTOP; + sigaction (SIGCHLD, &action, NULL); } Other solution is to re-install signal every time it is caught. For this the change that has to be mads is : @@ -3551,6 +3551,8 @@ g_child_watch_signal_handler (int signum { child_watch_count ++; + signal (SIGCHLD, g_child_watch_signal_handler); + if (child_watch_init_state == CHILD_WATCH_INITIALIZED_THREADED) { write (child_watch_wake_up_pipe[1], "B", 1);
I am attaching both the patches here
Created attachment 30273 [details] [review] Patch with signal () replaced by sigaction ()
Created attachment 30274 [details] [review] Patch which reinstalls signal using signal () call
The "sigaction" patch fixes the problem for gnome-terminal in GNOME 2.6, glib updated to 2.4.7. I didn't try the other patch. The child-test still hangs though. Testing on Solaris 9/SPARC.
I changed glib to use sigaction now. Please reopen if there are still issues.
*** Bug 145597 has been marked as a duplicate of this bug. ***