Bug 687061 – g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 687061 - g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)


Summary:	g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)


Status:	RESOLVED WONTFIX

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-10-28 20:18 UTC by Colin Walters
Modified:	2013-01-19 17:33 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...) (14.39 KB, patch) 2012-10-28 20:18 UTC, Colin Walters	none	Details \| Review
demo/test program (3.55 KB, text/plain) 2012-10-28 20:20 UTC, Colin Walters		Details
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...) (18.21 KB, patch) 2012-10-28 20:44 UTC, Colin Walters	reviewed	Details \| Review
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...) (17.87 KB, patch) 2012-10-29 15:31 UTC, Colin Walters	none	Details \| Review
gmain: Handle EINTR from waitpid() (1.18 KB, patch) 2012-10-29 20:26 UTC, Colin Walters	none	Details \| Review
gspawn: Defer waitpid() to GLib worker thread (3.83 KB, patch) 2012-10-29 20:26 UTC, Colin Walters	needs-work	Details \| Review
GChildWatchSource: Allow passing -1 for pid (17.77 KB, patch) 2012-10-29 20:26 UTC, Colin Walters	none	Details \| Review
gmain: Merge warning about incompatible SIGCHLD from gspawn (2.20 KB, patch) 2012-10-29 20:45 UTC, Colin Walters	none	Details \| Review
gmain: Handle EINTR from waitpid() (1.18 KB, patch) 2012-10-30 20:36 UTC, Colin Walters	committed	Details \| Review
gspawn: Defer waitpid() to GLib worker thread (5.30 KB, patch) 2012-10-30 20:36 UTC, Colin Walters	committed	Details \| Review
GChildWatchSource: Allow passing -1 for pid (22.36 KB, patch) 2012-10-30 20:37 UTC, Colin Walters	needs-work	Details \| Review
GChildWatchSource: Allow passing -1 for pid (24.29 KB, patch) 2012-11-05 14:04 UTC, Colin Walters	none	Details \| Review
GChildWatchSource: Allow passing -1 for pid (24.87 KB, patch) 2012-11-05 14:30 UTC, Colin Walters	none	Details \| Review

Description Colin Walters 2012-10-28 20:18:14 UTC

With the goal of modifying gnome-session to use the new Linux
PR_SET_CHILD_SUBREAPER, it needs to be able to reap arbitary children
that may be reparented to it.

But if gnome-session is going to continue to work with GLib and GIO,
even if we converted all child watches in gnome-session to use this
new API, we'd still have to contend with the possibility of it having
a process spawned indirectly (stuff like the GDBus dbus-launch
invocation).

There is the separate but related issue of continuing to support
spawning processes in multiple threads.  gnome-session should be able
to call g_spawn_sync() from a helper thread, without its waitpid()
invocation racing with the waitpid(-1, ...) from the GLib worker
thread.

In order to make that case work, explicitly block the waitpid(-1, )
while a g_spawn_sync() is in progress.  This isn't ideal, but it's
better than being racy.

Comment 1 Colin Walters 2012-10-28 20:18:17 UTC

Created attachment 227474 [details] [review]
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)

Comment 2 Colin Walters 2012-10-28 20:20:01 UTC

Created attachment 227475 [details]
demo/test program

With PR_SET_SUBCHILD_REAPER, you correctly see the "sleep" process as a child of "gsubinit".

Comment 3 Colin Walters 2012-10-28 20:44:43 UTC

Created attachment 227476 [details] [review]
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)

Actually, I realized the previous patch had a race condition - if you
had both a regular and catchall source, there was a window in which
waitpid(pid, ) wouldn't catch it, but waitpid(-1) would.
    
Fix this by explicitly separating the "no catchall source" and "have
catchall source" cases.  If we have a catchall, just use waitpid(-1,
), and look at the pid, dispatching it to the correct sources.
    
Added a test case for this too.

Comment 4 Paul Eggert 2012-10-29 00:32:36 UTC

Also please see GNOME bug 687075, which attempts to fix
a documentation problem in this area.  It's not clear to
me how the two proposed fixes interact.

Comment 5 Allison Karlitskaya (desrt) 2012-10-29 08:32:14 UTC

I see 'gnome-session' and I think 'violation of Colin's two-modules rule'.

I also wonder if we can't just use the existing API with a pid of 0 or 1...

Comment 6 Dan Winship 2012-10-29 12:48:36 UTC

Comment on attachment 227476 [details] [review]
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)

>+          /* We only expect there to be one or two at most, so
>+           * it's not worth refcounting the UnixWaitStatus.
>+           */

It would not make sense to create a catchall source from a library (for the same reason it doesn't make sense to call waitpid(-1)), so the only code creating a catchall source should be the app itself, and in that case it ought to be able to put all of its catchall-child-handling code in one place. So I think you ought to say there can be only one catchall source.

>+  for (iter = catchall_source->pending_waits; iter; iter = iter->next)
>+    {
>+      UnixWaitStatus *status = iter->data;
>+
>+      (child_callback) (status->pid, status->estatus, user_data);
>+      
>+      g_slice_free (UnixWaitStatus, status);
>+    }
>+
>+  g_clear_pointer (&catchall_source->pending_waits, (GDestroyNotify) g_slist_free);

There's a race condition here; the worker thread may prepend additional UnixWaitStatuses to catchall_source->pending_waits while the main thread is iterating it, and then you'll free them without having dispatched them. You need a G_LOCK(unix_signal_lock) and a local pending_waits variable, etc.

>+ * g_unix_catchall_child_source_new:
>+ *
>+ * Normally, g_child_watch_add() is used to monitor child processes.
>+ * This function exists to support very unusual circumstances, such as

You should say what it does before getting into the rationale.

>+ * In other words, this function provides the equivalent of
>+ * <literal>waitpid(-1, ...)</literal>.  However, the source is only
>+ * triggered for child PIDs which do not already have an active child
>+ * watch.

Why? That doesn't seem like the semantic you'd want; if there is something that you want to do for "all child processes", then why would you want to *not* do it just because you (or someone else) wanted to do something else for one particular child process? Likewise for not getting notified of the SIGCHLD inside g_spawn_sync().

Comment 7 Colin Walters 2012-10-29 15:18:24 UTC

(In reply to comment #6)
> So I think you ought to say there can be only one catchall source.

Ok.  I guess we can relax the restriction later if anyone complains.

> There's a race condition here; 

Fixed.

> You should say what it does before getting into the rationale.

Fixed.
 
> >+ * In other words, this function provides the equivalent of
> >+ * <literal>waitpid(-1, ...)</literal>.  However, the source is only
> >+ * triggered for child PIDs which do not already have an active child
> >+ * watch.
> 
> Why? That doesn't seem like the semantic you'd want; if there is something that
> you want to do for "all child processes", then why would you want to *not* do
> it just because you (or someone else) wanted to do something else for one
> particular child process?

Basically for gnome-session, I want to warn if a random child that got reparented (think the a11y bus daemon) exits with an error.  Ideally, we don't "double error" if GLib launches a helper process which exits unsuccessfully.

> Likewise for not getting notified of the SIGCHLD
> inside g_spawn_sync().

We do get notified of SIGCHLD - we just fall back to the "no catchall source" case.  In other words, while you have a g_spawn_sync() active, just the catchall source won't catch things.

But...I could probably make g_spawn_sync() defer to the GLib worker thread for its waitpid() call.  That would avoid the race condition.

Comment 8 Colin Walters 2012-10-29 15:31:02 UTC

Created attachment 227545 [details] [review]
g_unix_catchall_child_source_new: New API to do waitpid(-1, ...)

Some comments addressed, but doesn't yet unify g_spawn_sync()'s waitpid() call

Comment 9 Colin Walters 2012-10-29 15:39:11 UTC

(In reply to comment #5)
> I see 'gnome-session' and I think 'violation of Colin's two-modules rule'.

True...but one thing that makes this case different from some others is there's just no clean and reasonable way to do it outside of GLib. 

> I also wonder if we can't just use the existing API with a pid of 0 or 1...

Mmm...that's a perennial debate in kernel land.  Using -1 would be clearly analogous to waitpid(), which would be nice.  But on the other hand, sometimes major new functionality should be new API, even if you could in theory wedge it into a flag value for an existing one.

I'm a bit on the fence; do you think -1 is better?

Comment 10 Allison Karlitskaya (desrt) 2012-10-29 16:20:10 UTC

(In reply to comment #9)
> I'm a bit on the fence; do you think -1 is better?

Mostly because I don't like the new name and because waitpid(-1) is a very well-established idea.

On the other hand, we should probably consider what this means to Windows, where a GPid is not an int....

Comment 11 Dan Winship 2012-10-29 18:11:03 UTC

(In reply to comment #7)
> Basically for gnome-session, I want to warn if a random child that got
> reparented (think the a11y bus daemon) exits with an error.  Ideally, we don't
> "double error" if GLib launches a helper process which exits unsuccessfully.

Ah, so basically you're *only* interested in SIGCHLDs that no one else is interested in. (Making this API even more single-purpose...)

But wait... in order for this to be useful, you have to be able to map the returned pid to the right process, so you know what service to restart, right? Which implies you have to already know the pids of all of the child/grandchild/etc processes that you're watching. So can't you just create ordinary GChildWatchSources for them?

Comment 12 Colin Walters 2012-10-29 18:32:10 UTC

(In reply to comment #11)
> (In reply to comment #7)
> > Basically for gnome-session, I want to warn if a random child that got
> > reparented (think the a11y bus daemon) exits with an error.  Ideally, we don't
> > "double error" if GLib launches a helper process which exits unsuccessfully.
> 
> Ah, so basically you're *only* interested in SIGCHLDs that no one else is
> interested in. (Making this API even more single-purpose...)

Yes.  Rather than "SIGCHLDs", let's say "zombie children".  It's more correct, and also *way* more fun to use in a technical conversation.

> But wait... in order for this to be useful, you have to be able to map the
> returned pid to the right process, so you know what service to restart, right?

I'm not interested in service restarting - yet.  It's basically just about a pretty "ps auxwf", which is actually fairly useful for debugging and such.

In theory we could parse /proc/<pid>/exe and use heuristics on that to determine what service it is.  But long term it will be better to just fix things to not double fork.  In the meantime though, this helps.

Comment 13 Dan Winship 2012-10-29 18:49:39 UTC

OK. So, random non-glib-modifying idea:

waitid() (no 'p') has a WNOWAIT option to make it return information but not actually reap the child. (No, I had never heard of this before today :). So you could have your catchall SIGCHLD handler call waitid(P_ALL, 0, &info, WNOWAIT|WNOHANG) to see if there is at least one pending zombie, and if so, save info.si_pid, and schedule a timeout for 1 second later. In the timeout callback, call waitpid(saved_pid, &status, WNOHANG), which will either reap the child or return ECHILD if someone else already reaped it for you. Then call the SIGCHLD handler again to do another waitid() to see if there's another zombie waiting.

The "catchall" may end up never seeing some zombies, but if your goal is just to ensure that every zombie gets reaped, then that doesn't matter. The "1 second" timeout can be adjusted to find the right mix between "doesn't accidentally reap other people's zombies" and "doesn't leave unsightly zombies cluttering up your ps output for too long"

(And this works if some library you're using uses non-glib APIs to spawn/wait too.)

Comment 14 Colin Walters 2012-10-29 20:26:39 UTC

Created attachment 227581 [details] [review]
gmain: Handle EINTR from waitpid()

Just on general principle.

Comment 15 Colin Walters 2012-10-29 20:26:44 UTC

Created attachment 227582 [details] [review]
gspawn: Defer waitpid() to GLib worker thread

This is preparatory work for a future commit which will add a
"catchall" waitpid API.  If we don't synchronize here with the worker
thread, race conditions are possible.

It's also just a nice code cleanup - gmain.c has all the gunk to
handle EINTR for waitpid(), so let's just reuse that.

Comment 16 Colin Walters 2012-10-29 20:26:51 UTC

Created attachment 227583 [details] [review]
GChildWatchSource: Allow passing -1 for pid

With the goal of modifying gnome-session to use the new Linux
PR_SET_CHILD_SUBREAPER, it needs to be able to reap arbitary children
that may be reparented to it.

But if gnome-session is going to continue to work with GLib and GIO,
even if we converted all child watches in gnome-session to use this
new API, we'd still have to contend with the possibility of it having
a process spawned indirectly (stuff like the GDBus dbus-launch
invocation).  Thus, we don't give the -1 child watch *every* waitpid,
only the ones which don't have a specific watcher.

Comment 17 Colin Walters 2012-10-29 20:32:22 UTC

(In reply to comment #13)
>
> waitid() (no 'p') has a WNOWAIT option to make it return information but not
> actually reap the child. (No, I had never heard of this before today :). 

You probably had actually in https://bugzilla.gnome.org/show_bug.cgi?id=672102#c27

> So you
> could have your catchall SIGCHLD handler

How can I install a handler for SIGCHLD in gnome-session without stomping on glib's?

> call waitid(P_ALL, 0, &info,
> WNOWAIT|WNOHANG) to see if there is at least one pending zombie, and if so,
> save info.si_pid, and schedule a timeout for 1 second later.

These sorts of arbitrary delays are lame. 

> The "1
> second" timeout can be adjusted to find the right mix between "doesn't
> accidentally reap other people's zombies" 

"How much race condition do you want?"

Comment 18 Colin Walters 2012-10-29 20:45:10 UTC

Created attachment 227587 [details] [review]
gmain: Merge warning about incompatible SIGCHLD from gspawn

This misconfiguration warning got lost when shuffling the code over to
gmain.

Comment 19 Dan Winship 2012-10-29 20:53:00 UTC

(In reply to comment #17)
> > actually reap the child. (No, I had never heard of this before today :). 
> 
> You probably had actually in
> https://bugzilla.gnome.org/show_bug.cgi?id=672102#c27

Huh.

> > So you
> > could have your catchall SIGCHLD handler
> 
> How can I install a handler for SIGCHLD in gnome-session without stomping on
> glib's?

When I wrote that I was thinking you could both just register your own SIGCHLD handlers... apparently I've managed to avoid dealing with UNIX signals for long enough that my brain is now confusing them with gobject signals...

But anyway, sigaction returns the previous handler, so you *could* chain handlers, although no one actually does...

> These sorts of arbitrary delays are lame. 

Yeah well, we're building on top of some pretty lame APIs. There's a limit to how awesome we can be without more help from the kernel.

Comment 20 Colin Walters 2012-10-29 21:17:06 UTC

Review of attachment 227582 [details] [review]:

Blah, not quite good enough - when spawning a child process in a thread, there's still a race condition where the worker's waitpid(-1) might eat it before we install a child watch.  This patch gets us to a better place though - we just need to lock the worker thread out until we get a child watch source set up.

Comment 21 Allison Karlitskaya (desrt) 2012-10-29 23:23:50 UTC

A couple of random notes:

1: glib's SIGCHLD handler only gets registered if you actually start watching for child exits.  If you don't do that, you're safe to deal with it yourself.  Since gnome-session is a program (and not a library) you are capable of doing that.

2: for the same reason, it's possible to use signalfd() from gnome-session sanely.  Just set the signal mask for ignoring SIGCHLD before the first thread is spawned and signalfd() is now reliable.


#1 is your solution, I think.

#2 could simplify the implementation of #1, but only until you realise that gnome-session should probably continue to work on non-Linux, so you'll have to write the code "the hard way" anyway...

Comment 22 Colin Walters 2012-10-30 12:19:59 UTC

(In reply to comment #21)
> A couple of random notes:
> 
> 1: glib's SIGCHLD handler only gets registered if you actually start watching
> for child exits.  If you don't do that, you're safe to deal with it yourself. 
> Since gnome-session is a program (and not a library) you are capable of doing
> that.

Except GLib can create child watches internally for helper programs.

Comment 23 Allison Karlitskaya (desrt) 2012-10-30 16:40:38 UTC

(In reply to comment #22)
> Except GLib can create child watches internally for helper programs.

This is an interesting point that I didn't consider.  Do you have a list?

Comment 24 Colin Walters 2012-10-30 17:13:53 UTC

(In reply to comment #23)
> (In reply to comment #22)
> > Except GLib can create child watches internally for helper programs.
> 
> This is an interesting point that I didn't consider.  Do you have a list?

I already gave one example in the commit message of the patch in https://bugzilla.gnome.org/show_bug.cgi?id=687061#c16

Comment 25 Colin Walters 2012-10-30 20:36:34 UTC

Created attachment 227681 [details] [review]
gmain: Handle EINTR from waitpid()

Rebased.

Comment 26 Colin Walters 2012-10-30 20:36:57 UTC

Created attachment 227682 [details] [review]
gspawn: Defer waitpid() to GLib worker thread

Now merged with gmain: Merge warning about incompatible SIGCHLD from gspawn

Comment 27 Colin Walters 2012-10-30 20:37:29 UTC

Created attachment 227683 [details] [review]
GChildWatchSource: Allow passing -1 for pid

Fixed g_spawn_sync() race more cleanly.

Comment 28 Dan Winship 2012-11-02 12:40:54 UTC

Comment on attachment 227681 [details] [review]
gmain: Handle EINTR from waitpid()

sure

Comment 29 Dan Winship 2012-11-02 12:48:30 UTC

Comment on attachment 227682 [details] [review]
gspawn: Defer waitpid() to GLib worker thread

>-              int res;
>+              pid_t pid;

hm... parts of this patch should really be squashed back into the previous one

other than that, this seems to be right, though given the number of tries it took, i'm wary that there's still some problem we're not thinking of (so if anyone else wants to double-check...)

Comment 30 Dan Winship 2012-11-02 13:40:11 UTC

Comment on attachment 227683 [details] [review]
GChildWatchSource: Allow passing -1 for pid

>+      /* Normal EINTR loop */
>+      do
>+        pid = waitpid (-1, &estatus, WNOHANG);
>+      while (pid == -1 && errno == EINTR);

slightly annoying that in the catchall case you do

  do
    waitpid
  while EINTR
  normal processing

and in the invidividually case you do

  do
    waitpid
    normal processing
  while EINTR

>+GSource *
>+g_unix_catchall_child_source_new (void)

This should have gone away, and the test program should use g_child_watch_source_new(-1)

>+ * Since GLib 2.36, -1 may be given as @pid.  In this mode, the source

Since GLib 2.36, on UNIX, ...

(or else add FIXMEs for a Windows implementation)

>+ * GLib-using process to perform a function equivalent to calling
>+ * <literal>waitpid(-1, ...)</literal>.  There can at present be at

"similar to", not "equivalent to", since you don't see children that have their own watches.

>+    /* Assumes we have the unix signal lock */
>+    source = _g_main_child_watch_source_new_unix_unlocked (pid);
>     g_source_set_callback (source, (GSourceFunc)on_sync_waitpid, &waitpid_data, NULL);
>     g_source_attach (source, context);
>     g_source_unref (source);
>+
>+    /* Now we safely have our waitpid() queued up for the GLib worker
>+     * thread to process.  Unlock the mutex so it can continue.
>+     */
>+    g_assert (have_unix_signal_lock);
>+    _g_main_unix_signal_unlock ();
>+    have_unix_signal_lock = FALSE;

This is pretty tricky...

And actually, g_spawn_async() has the same problem anyway; if you pass G_SPAWN_DO_NOT_REAP_CHILD, then you need to arrange for the catchall reaper to not reap that child, even if no child watch has been created for it yet when the child exits. (I guess it's not really "not reap", but "remember the exit status and return it immediately when the child watch for that pid does get created".)

And if the API was something like "GPid g_main_unix_fork_and_do_not_reap_child (void);", then you wouldn't need to also expose lock/unlock ops.

Comment 31 Colin Walters 2012-11-02 20:29:58 UTC

(In reply to comment #30)

[everything not replied to is fixed]

> And actually, g_spawn_async() has the same problem anyway;

Right.  I need to think about this.  I had been setting it aside temporarily some under the (kind of invalid) assumption that glib itself wouldn't be using g_spawn_async() *internally*, but that's not really valid.

And even if it was, we still want a way for gnome-session to use g_spawn_async() and use individual watches.

> if you pass
> G_SPAWN_DO_NOT_REAP_CHILD, then you need to arrange for the catchall reaper to
> not reap that child, even if no child watch has been created for it yet when
> the child exits. (I guess it's not really "not reap", but "remember the exit
> status and return it immediately when the child watch for that pid does get
> created".)

The problem is that we have no way to atomically fork() *and* ensure a child watch is created.  Even discarding threads in the app, we still have to contend with the fact that the GLib helper thread and the main thread will race.

> And if the API was something like "GPid g_main_unix_fork_and_do_not_reap_child
> (void);", then you wouldn't need to also expose lock/unlock ops.

Super ugly to implement, as far as I can see - I'd have to move the first 1/3 of fork_exec_with_pipes() into gmain.c.  Maybe that's right, but let me consider some more alternatives.

Comment 32 Dan Winship 2012-11-02 21:28:26 UTC

(In reply to comment #31)
> > And if the API was something like "GPid g_main_unix_fork_and_do_not_reap_child
> > (void);", then you wouldn't need to also expose lock/unlock ops.
> 
> Super ugly to implement, as far as I can see - I'd have to move the first 1/3
> of fork_exec_with_pipes() into gmain.c.  Maybe that's right, but let me
> consider some more alternatives.

what I was imagining:

GPid
g_main_unix_fork_and_do_not_reap_child (void)
{
  pid_t pid;

  G_LOCK (unix_signal_lock);

  pid = fork ();
  if (pid > 0)
    {
      UnixWaitStatus *waitstatus = g_slice_new (UnixWaitStatus);

      waitstatus->pid = pid;
      waitstatus->status = FIXME something that's not a valid wait() result;
      /* or add a gboolean to UnixWaitStatus */

      do_not_reap_pids = g_slist_prepend (do_not_reap_pids, waitstatus);
    }

  G_UNLOCK (unix_signal_lock);

  return pid;
}

(oh, probably want to preserve errno around the G_UNLOCK if pid < 0)

fork_and_exec_with_pipes() would just replace its call to fork() with a call to g_main_unix_fork_and_do_not_reap_child() (if DO_NOT_REAP).

dispatch_sigchld_with_catchall() would have a loop over do_not_reap_pids in addition to its loop over unix_child_watches, which would fill it in and unset give_to_catchall if it found a match.

g_child_watch_source_new() would check if its pid was in do_not_reap_pids, and if so, remove it, and if it already had a valid status, then it would set child_watch_source->child_status from that and set ->child_exited so the source would trigger right away.

Comment 33 Colin Walters 2012-11-05 14:04:16 UTC

Created attachment 228110 [details] [review]
GChildWatchSource: Allow passing -1 for pid

With the goal of modifying gnome-session to use the new Linux
PR_SET_CHILD_SUBREAPER, it needs to be able to reap arbitary children
that may be reparented to it.

But if gnome-session is going to continue to work with GLib and GIO,
even if we converted all child watches in gnome-session to use this
new API, we'd still have to contend with the possibility of it having
a process spawned indirectly (stuff like the GDBus dbus-launch
invocation).  Thus, we don't give the -1 child watch *every* waitpid,
only the ones which don't have a specific watcher.

Also, fix up g_spawn_sync() to hold the unix signal lock when forking.
This way we ensure that GLib itself gets the waitpid() result for the
synchronously spawned child, instead of racing with the worker thread.

Comment 34 Colin Walters 2012-11-05 14:08:33 UTC

That actually worked out quite well, Dan!  Thanks a lot for the concept.  The minor tweak I did here to avoid adding a gboolean to UnixWaitStatus is to set pid = -1 until it's exited.

Also, there's now a test case that actually uses PR_SET_SUBCHILD_REAPER (if available).

One thing occurs to me - I don't actually have test coverage of the g_spawn_async() + catchall case.  I'll add one.

Comment 35 Allison Karlitskaya (desrt) 2012-11-05 14:13:54 UTC

I love the mutex that gets locked once and unlocked twice....

Some scary code there, Dan. :)

Comment 36 Allison Karlitskaya (desrt) 2012-11-05 14:29:56 UTC

On a more serious, note though: the case that this code tries to catch sort of illustrates the conceptual dirtiness of what we're trying to do here...

We have this hash table that contains a list of PIDs not to reap.  If I read correctly, it never gets cleaned out except when doing a child watch.

Problem 1:  I do a lot of this:

  setup_catchall();

  for (big number)
    {
       pid = g_spawn();
       waitpid(pid);
    }

and the table gets bigger and bigger.....


Problem 2: Imagine this situation (in gnome-session):

  setup_catchall();

  pid = g_spawn();
  waitpid(pid);

  ... time passes, 'pid' gets recycled ...

  ... 'pid' quits and gets punted back up to gnome-session ...

  ... it is never reaped, because it's still in the don't-reap table ...



Another problem (or a different angle on the same problem?) is that if glib ever calls waitpid(-1) and sees something that it isn't supposed to see then no matter how hard we try, we can never put that child's information back into the kernel, do-not-reap-list or otherwise.  In that sense, it's not so much do-not-reap as it is reap-then-ignore.  This means that it's completely unsafe to call any variant of wait() from outside of GLib in any process that is using the catch-all.

We might have an idea that we are able to impose this restriction for something like gnome-session but the argument falls apart when it comes to libraries (along much the same lines as goes your argument for why we need this patch in the first place).

Comment 37 Colin Walters 2012-11-05 14:30:19 UTC

Created attachment 228118 [details] [review]
GChildWatchSource: Allow passing -1 for pid

Now with more tests, and also an updated commit message

Comment 38 Colin Walters 2012-11-05 14:57:05 UTC

(In reply to comment #36)

>   setup_catchall();
> 
>   for (big number)
>     {
>        pid = g_spawn();
>        waitpid(pid);
>     }

Such an application can not exist currently because it can't be using the catchall watch yet - and the hash table is only used in the case where there is one.

> Problem 2: Imagine this situation (in gnome-session):
> 
>   setup_catchall();
> 
>   pid = g_spawn();
>   waitpid(pid);

Likewise here.  I control gnome-session's codebase, it doesn't invoke waitpid() directly.

> Another problem (or a different angle on the same problem?) This means that it's completely unsafe
> to call any variant of wait() from outside of GLib in any process that is using
> the catch-all.

Right, so by "using the catch-all" we mean "gnome-session".

> We might have an idea that we are able to impose this restriction for something
> like gnome-session but the argument falls apart when it comes to libraries
> (along much the same lines as goes your argument for why we need this patch in
> the first place).

You do have an interesting point here - what about libraries outside of GNOME like say libcanberra?   It is entirely possible it tries to auto-activate pulseaudio and that involves finding some helper binary.  I don't have a great answer for that.  We'd have to have non-GLib libraries export an API where we can "plug in" process helper APIs, just like we expect these libraries to export mainloop integration. 

Actually libdbus-1 is probably the best example here.  Which in fact does spawn dbus-launch.  And gnome-session does link to it =/

systemd also links to libdbus-1 too of course, and as far as I can see they simply don't handle this at all.  In practice, it's not really a problem since the autolaunching just doesn't happen for systemd, and that's really the only case where we launch helper binaries.

So we do have a general path for solving the "non-GLib helper library" case:

1) Change library to export hooks for integrating external launching
2) Drop dependency on library, reimplement with GLib-using library (e.g. port from dbus-glib to GDBus)
3) Split library into separate helper process

I can iterate on this for gnome-session, and come back to this patch when it's ready.   But do you agree that after having done so, this patch is mergable and would make sense to use in gnome-session?

Comment 39 Dan Winship 2012-11-05 15:31:16 UTC

(In reply to comment #38)
> > We might have an idea that we are able to impose this restriction for something
> > like gnome-session but the argument falls apart when it comes to libraries
> > (along much the same lines as goes your argument for why we need this patch in
> > the first place).
> 
> You do have an interesting point here - what about libraries outside of GNOME
> like say libcanberra?

Actually, it is not just libraries outside of GNOME; the g_spawn_async_with_pipes() docs allow you to use waitpid() directly (on unix).

> So we do have a general path for solving the "non-GLib helper library" case:
> 
> 1) Change library to export hooks for integrating external launching

Ugh

> 2) Drop dependency on library, reimplement with GLib-using library (e.g. port
> from dbus-glib to GDBus)

Sure, but glib-using libraries can still cause problems, and even after you fix them they could easily get broken again since their maintainers aren't going to be considering this use case which only matters to one program.

> 3) Split library into separate helper process

might make sense in some cases, probably doesn't in other cases

> I can iterate on this for gnome-session, and come back to this patch when it's
> ready.   But do you agree that after having done so, this patch is mergable and
> would make sense to use in gnome-session?

Every time I think we've figured this out another problem comes up. Given that PR_SET_CHILD_SUBREAPER is linux-specific anyway, wouldn't having a linux-specific sane new kernel API really be a much better solution?

Comment 40 Colin Walters 2012-11-05 16:20:22 UTC

(In reply to comment #39)

> Actually, it is not just libraries outside of GNOME; the
> g_spawn_async_with_pipes() docs allow you to use waitpid() directly (on unix).

Yes, but I have the source code to every dependency of gnome-session, and the ability to fix it, and:

> Sure, but glib-using libraries can still cause problems, and even after you fix
> them they could easily get broken again since their maintainers aren't going to
> be considering this use case which only matters to one program.

But would they really go and change code back from using g_child_watch_add() to raw waitpid()?  I can't imagine that happening in practice.

> Every time I think we've figured this out another problem comes up. Given that
> PR_SET_CHILD_SUBREAPER is linux-specific anyway, wouldn't having a
> linux-specific sane new kernel API really be a much better solution?

Like some new waitpid (-1, &estatus, WNOHANG | LINUX_ONLY_REPARENTED) ?

The cost for the kernel would be tracking processes which were reparented - probably not too hard.

But I do think this problem is a bit overblown...we haven't yet identified any *in actual practice* case where gnome-session will call into a library that will spawn a process *and* the library calls waitpid() directly.

Comment 41 Dan Winship 2012-11-05 17:47:41 UTC

(In reply to comment #40)
> But would they really go and change code back from using g_child_watch_add() to
> raw waitpid()?  I can't imagine that happening in practice.

No, but they may add new code that needs to spawn something, and if they want to do a blocking wait, using a child watch would be annoying and seemingly-unnecessary.

> Like some new waitpid (-1, &estatus, WNOHANG | LINUX_ONLY_REPARENTED) ?

I meant the more general problem of "SIGCHLD/wait sucks as an API".

Comment 42 Allison Karlitskaya (desrt) 2013-01-19 17:33:19 UTC

(In reply to comment #39)
> Every time I think we've figured this out another problem comes up. Given that
> PR_SET_CHILD_SUBREAPER is linux-specific anyway, wouldn't having a
> linux-specific sane new kernel API really be a much better solution?

This pretty much sums up my opinion of this bug.  Add to that the fact that this would only be for one process in the entire session (even though I understand that GLib has the unique position here in terms of being able to implement it in a way that stays 'friendly' with other things that may get spawned in that same process).

I think that, as is, this is a WONTFIX.