GNOME Bugzilla – Bug 723743
g_child_watch_add() doesn't check for non-pids
Last modified: 2017-10-12 11:23:32 UTC
While trying to construct a minimal test-case for the g_spawn_* weirdness in <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=737380>, I discovered that g_child_watch_source_new (p) does not check whether the value of p is "a real PID", or a pseudo-PID like -1; and both g_child_watch_source_new() and dispatch_unix_signals() will call waitpid (source->pid, ., .) regardless. This means that a call to g_child_watch_source_new(), g_child_watch_add_full() or g_child_watch_add() with non-positive pid will result in the behaviour that dispatch_unix_signals() warns about: "Most likely the process is ignoring SIGCHLD, or some other thread is invoking waitpid() with a nonpositive first argument; either behavior can break applications that use g_spawn_sync either directly or indirectly." If use of those functions with pid <= 0 is going to cause problems, shouldn't they g_return_[val_]if_fail (pid > 0) on Unix platforms?
Unfortunately, I don't think this can necessarily explain what's happening to the GApplication test, unless there are bugs in the implementation of g_spawn_async_with_pipes() that are causing it to return with neither an error set, nor a valid child_pid filled in, because it does this: g_spawn_async_with_pipes (NULL, args, NULL, G_SPAWN_DO_NOT_REAP_CHILD, NULL, NULL, &pid, NULL, &data->stdout_pipe, NULL, &error); g_assert_no_error (error); g_child_watch_add (pid, child_quit, data); When I have a compiled GLib tree I'll try inserting a g_assert_cmpint (pid, >, 0); but the only machine where I can reproduce this is a 444 MHz mipsel, and compiling GLib there takes several hours. The test failure was > # Start of gapplication tests > ok 1 /gapplication/no-dbus > ok 2 /gapplication/basic > # GLib-FATAL-WARNING: In call to g_spawn_sync(), exit status of a > child process was requested but ECHILD was received by waitpid(). > Most likely the process is ignoring SIGCHLD, or some other thread is > invoking waitpid() with a nonpositive first argument; either behavior > can break applications that use g_spawn_sync either directly or > indirectly. > (/«PKGBUILDDIR»/debian/build/deb/gio/tests/.libs/lt-gapplication:10577): > GLib-WARNING **: In call to g_spawn_sync(), exit status of a child > process was requested but ECHILD was received by waitpid(). Most > likely the process is ignoring SIGCHLD, or some other thread is > invoking waitpid() with a nonpositive first argument; either behavior > can break applications that use g_spawn_sync either directly or > indirectly. and when I investigated it, I found that it was attempting D-Bus autolaunch (with g_spawn_sync()) in the GDBus thread, in parallel with the main thread executing the code I mentioned above. Bug #723506 is that I don't think it should be attempting D-Bus autolaunch without a $DISPLAY, which works around the test failure.
Patch happily accepted to add the non-negative check that you mention.
(In reply to comment #0) > "Most likely the process is ignoring SIGCHLD, or some other thread is > invoking waitpid() with a nonpositive first argument; either behavior > can break applications that use g_spawn_sync either directly or > indirectly." While looking into adding this I noticed that that message can also appear if you try to watch the same child twice. When dispatch_unix_signals_unlocked() gets to the most recently-added instance of that pid (because we prepend to the list), waitpid() will return that pid and all is fine. When it gets to the older instance, waitpid() fails with ECHILD and you get the warning. The documentation for g_child_watch_add() does say: GLib supports only a single callback per process id so perhaps that should be diagnosed as programmer error too?
(In reply to comment #0) > While trying to construct a minimal test-case for the g_spawn_* weirdness in > <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=737380> I discovered when I accidentally scp'd the wrong version of gmain.c onto my test machine that this is in fact Bug #711090, which was fixed in master recently but hasn't made it to the stable-branch yet...
Created attachment 268326 [details] [review] g_child_watch_source_new: POSIX pid must be positive If we used a non-positive pid, we'd call waitpid(that_pid, ...) which is exactly the situation this function can't deal with. On Windows, GPid is a HANDLE (pointer), so I don't think the same thing applies.
Created attachment 268327 [details] [review] g_child_watch_new: enforce one-watch-per-pid --- I don't know whether this is going to break anything... but the code doesn't cope unless this assumption is satisfied, and it is documented.
Review of attachment 268326 [details] [review]: Can you add the checks to the convenience APIs as well? It seems like they will segfault if the source_new returns NULL....
Review of attachment 268327 [details] [review]: Not crazy about this due to the linear scan... and it will only catch the multiple-add in a case that termination of the child has not interceded... I wonder if we could as well ask the kernel "is this waitable?" without actually waiting in order to catch the full range of possible bad values to this function... I'm still not sure that we would want to do that. Did this actually bite you?
(In reply to comment #7) > Can you add the checks to the convenience APIs as well? It seems like they > will segfault if the source_new returns NULL.... I added an early-return in g_child_watch_add_full() too. g_child_watch_add() just calls g_child_watch_add_full() with an extra NULL argument, so there's nothing extra to do there. (In reply to comment #8) > Did this actually bite you? No; it was one theory for why a test might be failing on a slow autobuilder. I'd be fine with not checking this... but when there are these additional ways to trigger the g_warning() I quoted, the warning seems a bit misleading. Perhaps it would be better to collect all the things that will break child-watch sources into the g_child_watch_source_new() docs, and change the warning to point to that, maybe even by URL? If you don't like the linear scan, a map { pid => watch } (perhaps a GHashTable?) would make it obvious that there can be at most one watch per pid.
(In reply to comment #9) > Perhaps it would be better to collect all the things that will break > child-watch sources into the g_child_watch_source_new() docs, and change the > warning to point to that, maybe even by URL? Here's a sketch of that (untested, gtk-doc might not like my Markdown): /** * g_child_watch_source_new: ... * On POSIX platforms, the following restrictions apply to this API: * * * @pid must be a child of this process * * @pid must be positive * * the application must not call `waitpid` with a non-positive * first argument, for instance in another thread * * the application must not wait for @pid to exit by any other * mechanism, including `waitpid(pid, ...)` or a second child-watch * source for the same @pid * * the application must not ignore SIGCHILD * * If any of those are not followed, GLib will warn that `ECHILD` was * received by `waitpid`. ... * Since: 2.4 **/ /** * g_child_watch_add_full: ... * GLib supports only a single callback per process id. * On POSIX platforms, the same restrictions mentioned for * g_child_watch_source_new() apply to this function. ... */ /** * g_child_watch_add: ... * GLib supports only a single callback per process id. * On POSIX platforms, the same restrictions mentioned for * g_child_watch_source_new() apply to this function. ... */ else if (pid == -1 && errno == ECHILD) { g_warning ("GChildWatchSource: Exit status of a child process was requested but ECHILD was received by waitpid(). See the g_child_watch_source_new() documentation."); ... /** * g_spawn_sync: ... * If @exit_status is non-%NULL, the platform-specific exit status of * the child is stored there; see the documentation of * g_spawn_check_exit_status() for how to use and interpret this. * Note that it is invalid to pass %G_SPAWN_DO_NOT_REAP_CHILD in * @flags, and on POSIX platforms, the same restrictions as for * g_child_watch_source_new() apply. */ else if (errno == ECHILD) { if (exit_status) { g_warning ("In call to g_spawn_sync(), exit status of a child process was requested but ECHILD was received by waitpid(). See the g_child_watch_source_new() documentation"); } else
Review of attachment 268326 [details] [review]: You're right. Please go ahead.
Review of attachment 268327 [details] [review]: I'm not keen on adding extra data structures that we maintain solely for the purpose of throwing criticals...
(In reply to comment #12) > I'm not keen on adding extra data structures that we maintain solely for the > purpose of throwing criticals... That's reasonable. As an alternative, how about the text I suggested in Comment #10? I'll turn it into a proper patch if someone agrees in principle with the approach.
Review of attachment 268326 [details] [review]: commit 125913e9, should be in 2.39.5
(In reply to Simon McVittie from comment #13) > (In reply to comment #12) > > I'm not keen on adding extra data structures that we maintain solely for the > > purpose of throwing criticals... > > That's reasonable. As an alternative, how about the text I suggested in > Comment #10? I'll turn it into a proper patch if someone agrees in principle > with the approach. Seems like a reasonable approach. Patch away at your leisure.
Created attachment 361327 [details] [review] g_child_watch_source_new: Document restrictions for POSIX platforms The warnings issued when dealing with waitpid() raising ECHILD are somewhat misleading: there are lots of reasons why waitpid() might fail in this way, and we can't tell which one has happened. In particular, passing a non-child or a non-pid, waiting for the same pid elsewhere, or creating a duplicate watch for the same pid would all fail in the same way. Consolidate the restrictions into one place, and change all the other places they were (or should have been!) mentioned to point to that one place.
Review of attachment 361327 [details] [review]: ++
Comment on attachment 361327 [details] [review] g_child_watch_source_new: Document restrictions for POSIX platforms 6e480634c
Fixed in git for 2.55.0, thanks for reviewing