Bug 725660 – [review] refactor killing and reaping of child process [th/kill_child]

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 725660 - [review] refactor killing and reaping of child process [th/kill_child]


Summary:	[review] refactor killing and reaping of child process [th/kill_child]


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-03-04 15:31 UTC by Thomas Haller
Modified:	2014-07-10 19:12 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
kill-child-log-fmt.patch (17.88 KB, patch) 2014-05-02 20:46 UTC, Dan Williams	none	Details \| Review
dcbw-suggestions.patch (30.38 KB, patch) 2014-05-07 16:18 UTC, Dan Williams	none	Details \| Review

Description Thomas Haller 2014-03-04 15:31:03 UTC

There are several places in NM where we kill child processes. This code is sometimes synchronous (although it doesn't have to) and always different.

I wanted to add one function that does it correctly and adds proper (standardized) logging of what is going on.

Could you please review the branch "th/kill_child"?





Related, I keep seeing the error log:

   <warn> (em1): DHCP client pid 21759 didn't exit, will kill it.

This is easily reproducible for me by `ip set em1 down` the interface.

Apparently DHCP client refused to get killed, meaning we block the NM mainloop for 3 seconds(!)

Comment 1 Thomas Haller 2014-03-04 15:41:00 UTC

(In reply to comment #0)

> Related, I keep seeing the error log:
> 
>    <warn> (em1): DHCP client pid 21759 didn't exit, will kill it.
> 
> This is easily reproducible for me by `ip set em1 down` the interface.
> 
> Apparently DHCP client refused to get killed, meaning we block the NM mainloop
> for 3 seconds(!)

I'd say, we should always prefer to kill child process asynchronously. And if that is difficult, we need to take give it a shorter timeout.

Comment 2 Dan Winship 2014-03-20 16:31:12 UTC

> core: add nm_utils_kill_child_async() and nm_utils_kill_child_sync() function

>+nm_utils_kill_child_sync (int pid, guint64 log_domain, const char *log_name, int *out_pid, int *child_status,

"int pid" should be "pid_t pid" everywhere

The out_pid arg is not useful; either the function returns TRUE and out_pid==pid, or it returns FALSE and out_pid==-1.

Well, or you passed a negative pid, but you shouldn't do that.

>+	int waiting_logged = 0;

This should be gboolean/TRUE/FALSE, not int/1/0.

>+	errno = 0;

Setting errno to 0 is a code smell unless you're calling one of the handful of functions with ambiguous return values, which you aren't here.

>+		errno2 = errno;
>+		if ((errno2 = errno) != ECHILD) {

setting errno2 twice. Also, FWIW, in glib the convention is to use "errsv" for the name of variable you store errno in to keep it from getting clobbered.

>+			if (!name)
>+				name = g_strdup_printf ("(%d)", pid);

The way this is handled is weird; you set name NULL originally but then set it to "($pid)" in several different places later. Just set it to that originally and use "if (log_domain != LOGD_NONE)" instead of "if (name)" the nm_log_dbg() cases.

>+		if (errno2 != ESRCH) {

I had to look that one up. Maybe put a comment?

>+		ret = waitpid (pid, &status, WNOHANG);

You handle EINTR after the KILL waitpid(), but not after the TERM waitpid.

>+nm_utils_kill_child_async (int pid, guint64 log_domain, const char *log_name, guint grace_period_msec,

It's weird that the arguments don't quite match the sync version.

Also, there is a lot of duplicated code here with the sync version that should be split out into a helper function.

>+ * returns: returns %TRUE, if the callback was already invoked synchronously, at return
>+ * time. %FALSE means, that the callback will invoked later (asynchronously). The same
>+ * value will be passed to the@callback as is_sync.

It's preferred to have async functions always call the callback later, so that callers don't have to handle two different possibilities.


Skimming the second patch, it looks great, though I didn't analyze it in detail to make sure no functionality was lost.

Comment 3 Thomas Haller 2014-03-21 14:59:10 UTC

(In reply to comment #2)

Repushed two fixup! commits addressing your remarks.


> > core: add nm_utils_kill_child_async() and nm_utils_kill_child_sync() function
> 
> >+nm_utils_kill_child_sync (int pid, guint64 log_domain, const char *log_name, int *out_pid, int *child_status,
> 
> "int pid" should be "pid_t pid" everywhere
> 
> The out_pid arg is not useful; either the function returns TRUE and
> out_pid==pid, or it returns FALSE and out_pid==-1.

Done

> Well, or you passed a negative pid, but you shouldn't do that.
> 
> >+	int waiting_logged = 0;
> 
> This should be gboolean/TRUE/FALSE, not int/1/0.
> 
> >+	errno = 0;
> 
> Setting errno to 0 is a code smell unless you're calling one of the handful of
> functions with ambiguous return values, which you aren't here.
> 
> >+		errno2 = errno;
> >+		if ((errno2 = errno) != ECHILD) {
> 
> setting errno2 twice. Also, FWIW, in glib the convention is to use "errsv" for
> the name of variable you store errno in to keep it from getting clobbered.

Done


> >+			if (!name)
> >+				name = g_strdup_printf ("(%d)", pid);
> 
> The way this is handled is weird; you set name NULL originally but then set it
> to "($pid)" in several different places later. Just set it to that originally
> and use "if (log_domain != LOGD_NONE)" instead of "if (name)" the nm_log_dbg()
> cases.

Done something. I think it's much better now(?)



> >+		if (errno2 != ESRCH) {
> 
> I had to look that one up. Maybe put a comment?

Not sure what to say there. RTM :)



> >+		ret = waitpid (pid, &status, WNOHANG);
> 
> You handle EINTR after the KILL waitpid(), but not after the TERM waitpid.

With WNOHANG EINTR cannot happend (according to my docs).


> >+nm_utils_kill_child_async (int pid, guint64 log_domain, const char *log_name, guint grace_period_msec,
> 
> It's weird that the arguments don't quite match the sync version.

Changed something. Is it better now?


> Also, there is a lot of duplicated code here with the sync version that should
> be split out into a helper function.

Unsure how to do that. I want different log lines in each error case.


> >+ * returns: returns %TRUE, if the callback was already invoked synchronously, at return
> >+ * time. %FALSE means, that the callback will invoked later (asynchronously). The same
> >+ * value will be passed to the@callback as is_sync.
> 
> It's preferred to have async functions always call the callback later, so that
> callers don't have to handle two different possibilities.

Indeed it simplifies things, at the expense of scheduling something async that could be handled right away. Other opinions?



> Skimming the second patch, it looks great, though I didn't analyze it in detail
> to make sure no functionality was lost.

Comment 4 Dan Winship 2014-03-26 19:57:36 UTC

(In reply to comment #3)
> > >+			if (!name)
> > >+				name = g_strdup_printf ("(%d)", pid);
> > 
> > The way this is handled is weird; you set name NULL originally but then set it
> > to "($pid)" in several different places later. Just set it to that originally
> > and use "if (log_domain != LOGD_NONE)" instead of "if (name)" the nm_log_dbg()
> > cases.
> 
> Done something. I think it's much better now(?)

Ugh. There's no reason to make the code so messy just to avoid a few mallocs. (Especially since even with all the overcomplication, you still have to do "g_free(name_free)" everywhere!) Just have stick with always using an allocated string like you did before. It takes half as many lines of code, and doesn't require comments explaining what the code is doing.

> > >+		if (errno2 != ESRCH) {
> > 
> > I had to look that one up. Maybe put a comment?
> 
> Not sure what to say there. RTM :)

That error code is both relatively obscure and completely non-mnemonically named. Just something like:

  /* ESRCH means the process is already a zombie. */

> > >+nm_utils_kill_child_async (int pid, guint64 log_domain, const char *log_name, guint grace_period_msec,
> > 
> > It's weird that the arguments don't quite match the sync version.
> 
> Changed something. Is it better now?

Better, but not completely. You still have "guint32 grace_period_msec" in the async version, but "gint64 max_wait_time_usec" in the sync version; they both control how long you wait between SIGTERM and SIGKILL, but they have different names, and one is milliseconds and the other microseconds.

> > It's preferred to have async functions always call the callback later, so that
> > callers don't have to handle two different possibilities.
> 
> Indeed it simplifies things, at the expense of scheduling something async that
> could be handled right away. Other opinions?

The difference between "right away" and "asynchronously" is small enough to not be important here.

Comment 5 Thomas Haller 2014-04-01 16:22:47 UTC

Pushed fixup! commits, also extending the test cases.

Now you can specify the first signal that will be sent (including no-signal).
You can also suppress sending of SIGKILL (by setting wait_before_kill_msec==0).
And I extended the logging to log how long it took until the process was reaped.




(In reply to comment #4)

three things that you pointed out are not yet done.

(1) macro _LOG_NAME() is still here
(2) async callback can be called right away
(3) remove code duplication (how to do that best? The code looks similar, but it's quite different to put it in a function).


(1) and (2) is a matter of taste. I am not refusing it entirely, but are there any other opinions? In general I would say that in a library function there is more incentive to optimize (at the expense of simplicity).

(note, that I still want to change the logging macros to evaluate the arguments only if logging is enabled).

Comment 6 Dan Williams 2014-04-22 04:00:58 UTC

Could we rename struct kill_child_data -> typedef struct KillChildData to keep style with the rest of NM?

---

Also, the strsignal() call exists, so we could use that instead of doing the mapping ourselves.

---

For _kill_child_exit_to_string & _kill_child_waited_to_string: couldn't we use static buffers in these functions and just return them?  Since they won't ever be called from separate threads, and it'll save memory and should be simpler.

---

nm_utils_kill_child_sync(), nm_utils_kill_child_async() - put the function arguments on multiple lines

---

Child process name handling: I would simply make the name a pointer in the struct and just g_strdup_printf() what you want to call it, and release that when freeing the KillChildData struct.  I think it's simpler and less error-prone then figuring out when you need to free the buffer and when not.  The existing optimizations save 8 bytes of pointer size for less clear code IMHO.

Also, could the logging functions be mostly reduced to something like this?  Using log_name() as an example...

static const char *
_kill_child_log_name (pid_t pid, const char *log_name)
{
	static char buf[100];

	g_snprintf (buf, sizeof (buf), "'%s' (%ld)", log_name, (long) pid);
	return buf;
}
#define _LOG_NAME() _kill_child_log_name(pid, log_name)

---

source_child_watch_id - not used anywhere?  would it be used later when the
callers of kill child stuff start passing callbacks?

---

_kill_child_*_to_string() -> maybe just *_to_string() to make the debug logging
calls shorter and clearer?

---

nm_utils_kill_child_async()/nm_utils_kill_child_sync() should simply require
the log name argument since nothing calls them without one.

---

vpn-service: service_quit() can use priv->name or priv->program instead of the hardcoded "VPN service"

---

	if (sig != SIGKILL && wait_before_kill_msec > 0)
		data->source_timeout_kill_id = g_timeout_add (wait_before_kill_msec, _kill_child_cb_timeout_grace_period, data);
	else
		data->source_timeout_kill_id = 0;

	if (data->source_timeout_kill_id) {
		nm_log_dbg (log_domain, "kill child process %s: wait for process to terminate after sending %s (send SIGKILL in %ld milliseconds)...",
		            data->name,  _kill_child_signal_to_string (buf_sig, sig), (long) wait_before_kill_msec);
	} else {
		nm_log_dbg (log_domain, "kill child process %s: wait for process to terminate after sending %s...", data->name,
		            _kill_child_signal_to_string (buf_sig, sig));
	}

combine the two if statements to save some LOC, since the second if() will be true if the first if() is true.

Comment 7 Thomas Haller 2014-04-22 19:54:09 UTC

(In reply to comment #6)
> Could we rename struct kill_child_data -> typedef struct KillChildData to keep
> style with the rest of NM?

added typedef KillChildAsyncData


> Also, the strsignal() call exists, so we could use that instead of doing the
> mapping ourselves.

strsignal() returns the signal name as an adjective ("Killed" or "Terminated").
That sounds unexpected in the log messages:

  waiting up to 500 milliseconds for process to terminate normally
  after sending Terminated...

otherwise it will say:

  waiting up to 500 milliseconds for process to terminate normally
  after sending SIGTERM (15)...


> For _kill_child_exit_to_string & _kill_child_waited_to_string: couldn't we use
> static buffers in these functions and just return them?  Since they won't ever
> be called from separate threads, and it'll save memory and should be simpler.

The reason for passing the buffer as argument is to save the static memory. Otherwise the mere presence of these functions always consumes additional memory.

I think using a static buffer is good to make it convenient for the caller (such as for nm_platform_ip4_address_to_string()). But as implementation detail for one function, I think it's a waste. 



> nm_utils_kill_child_sync(), nm_utils_kill_child_async() - put the function
> arguments on multiple lines

done



> Child process name handling: I would simply make the name a pointer in the
> struct and just g_strdup_printf() what you want to call it, and release that
> when freeing the KillChildData struct.  I think it's simpler and less
> error-prone then figuring out when you need to free the buffer and when not. 
> The existing optimizations save 8 bytes of pointer size for less clear code
> IMHO.
> 
> Also, could the logging functions be mostly reduced to something like this? 
> Using log_name() as an example...
> 
> static const char *
> _kill_child_log_name (pid_t pid, const char *log_name)
> {
>     static char buf[100];
> 
>     g_snprintf (buf, sizeof (buf), "'%s' (%ld)", log_name, (long) pid);
>     return buf;
> }
> #define _LOG_NAME() _kill_child_log_name(pid, log_name)

This version uses also a static buffer (which I avoided not to waste memory)...
Also, your version does not do special handling of log_name==NULL and it will truncate the output if log_name is larger then the fixed size buffer.

Sure, could be all done too, and it's even less lines of code (because it treats special cases differently).

Regarding the name in KillChildAsyncData. As it is now, you don't have to free it at all because it is part of the KillChildAsyncData itself. I think this is super neat :)
Allocating the buffer dynamically would add two more lines of code. In order to simplify it, I would also have to refactor _kill_child_log_name().
I don't want to directly write:

  data->name = g_strdup_printf ("%s (%d)", log_name, pid);

because I want to have the formating for the name in one place: _kill_child_log_name().


Yeah. Could be done differently.


This branch adds two new functions with the aim to replace 7 places where we kill child processes. These library functions only make sense if they are better then the versions that are to be replaced. "Better" in this case conflicts with "simpler" and "shorter"... so I am more willing to accept a complex implementation in the library function if it is "better" by other metrics.


> source_child_watch_id - not used anywhere?  would it be used later when the
> callers of kill child stuff start passing callbacks?

Unused, I removed it. It would only make sense to have this id, if we want to cancel the killing... but I doubt that this ever makes sense because we already sent the kill signal.


> _kill_child_*_to_string() -> maybe just *_to_string() to make the debug logging
> calls shorter and clearer?

Changed "_kill_child_*" to "_kc_*" (to keep a common prefix).


> nm_utils_kill_child_async()/nm_utils_kill_child_sync() should simply require
> the log name argument since nothing calls them without one.

Yes, possible. Is in line with your suggested simplification of _kill_child_log_name() above...

I would prefer it for the function to be more forgiving in this regard.


 
> vpn-service: service_quit() can use priv->name or priv->program instead of the
> hardcoded "VPN service"

Ok


>     if (sig != SIGKILL && wait_before_kill_msec > 0)
>         data->source_timeout_kill_id = g_timeout_add (wait_before_kill_msec,
> _kill_child_cb_timeout_grace_period, data);
>     else
>         data->source_timeout_kill_id = 0;
> 
>     if (data->source_timeout_kill_id) {
>         nm_log_dbg (log_domain, "kill child process %s: wait for process to
> terminate after sending %s (send SIGKILL in %ld milliseconds)...",
>                     data->name,  _kill_child_signal_to_string (buf_sig, sig),
> (long) wait_before_kill_msec);
>     } else {
>         nm_log_dbg (log_domain, "kill child process %s: wait for process to
> terminate after sending %s...", data->name,
>                     _kill_child_signal_to_string (buf_sig, sig));
>     }
> 
> combine the two if statements to save some LOC, since the second if() will be
> true if the first if() is true.

Done


Repushed two fixup! commits

Comment 8 Dan Williams 2014-05-02 20:45:29 UTC

(In reply to comment #7)
> > Also, the strsignal() call exists, so we could use that instead of doing the
> > mapping ourselves.
> 
> strsignal() returns the signal name as an adjective ("Killed" or "Terminated").
> That sounds unexpected in the log messages:

There is sys_sigabbrev[] apparently, which can be indexed up to NSIG:

http://www.redhat.com/archives/anaconda-devel-list/2009-November/msg00076.html

which would save some code; but OK, that's not technically part of glibc ABI, and it's kinda ugly to rely on an external define that's not even in the headers.  So sure, the function is fine.

> > For _kill_child_exit_to_string & _kill_child_waited_to_string: couldn't we use
> > static buffers in these functions and just return them?  Since they won't ever
> > be called from separate threads, and it'll save memory and should be simpler.
> 
> The reason for passing the buffer as argument is to save the static memory.
> Otherwise the mere presence of these functions always consumes additional
> memory.

OK.

> > Child process name handling: I would simply make the name a pointer in the
> > struct and just g_strdup_printf() what you want to call it, and release that
> > when freeing the KillChildData struct.  I think it's simpler and less
> > error-prone then figuring out when you need to free the buffer and when not. 
> > The existing optimizations save 8 bytes of pointer size for less clear code
> > IMHO.
>
> Also, your version does not do special handling of log_name==NULL and it will
> truncate the output if log_name is larger then the fixed size buffer.

It doesn't do anything special for log_name == NULL, because that special-ness isn't needed, because nothing calls these functions with log_name==NULL at this point.  If we ever have something that would call it with log_name == NULL, then we could discuss adding it back.  But I don't really thing the added complexity is worth it, especially since nothing uses log_name == NULL right now.

Better I think to start simple and get complex later, if you need it.
 
> Regarding the name in KillChildAsyncData. As it is now, you don't have to free
> it at all because it is part of the KillChildAsyncData itself. I think this is
> super neat :)

Here's a thought.  Since you want to preserve the memory benefits, how about the following, which uses *less* memory than what you've currently got here, and also has less code and fewer instructions.  It also keeps the formatting in one place, like you want:

#define LOG_NAME_FMT "kill child process '%s' (%ld)"
#define LOG_NAME_ARGS log_name,(long)pid

nm_log_dbg (log_domain, LOG_NAME_FMT ": %ld already terminated %s",
            LOG_NAME_ARGS, (long) ret, _kc_exit_to_string (buf_exit, status));
...
g_snprintf (data->name, name_len, LOG_NAME_FMT, LOG_NAME_ARGS);

If you really want to be memory and code efficient, this is the way to go.  No buffer, no jumps, no buffer-size checking required.  The kernel does this quite often, and we also do it in NM for things like MAC_FMT/MAC_ARGS.

If you go this way, I'd reflow the log statements so that they don't overflow 80-cols so much.  I don't think 80-cols is a hard limit, but if we get to > 100, we probably want to newline.

See the attached patch which implements this, what do you think?

> This branch adds two new functions with the aim to replace 7 places where we
> kill child processes. These library functions only make sense if they are
> better then the versions that are to be replaced. "Better" in this case
> conflicts with "simpler" and "shorter"... so I am more willing to accept a
> complex implementation in the library function if it is "better" by other
> metrics.

I agree, and definitely agree that consolidating complexity is good.  I think we only differ on what level of complexity to use.  But we'll work that out.

> > nm_utils_kill_child_async()/nm_utils_kill_child_sync() should simply require
> > the log name argument since nothing calls them without one.
> 
> Yes, possible. Is in line with your suggested simplification of
> _kill_child_log_name() above...
> 
> I would prefer it for the function to be more forgiving in this regard.

I might agree if something actually passed log_name == NULL.  But I can't think of a case where something wouldn't pass that...  I hope we always know the name of the process we're trying to kill.

Comment 9 Dan Williams 2014-05-02 20:46:05 UTC

Created attachment 275701 [details] [review]
kill-child-log-fmt.patch

Comment 10 Thomas Haller 2014-05-02 22:29:40 UTC

(In reply to comment #9)
> Created an attachment (id=275701) [details] [review]
> kill-child-log-fmt.patch

This patch is really great.


I pushed it on top of the branch (+ 3 minor fixup commits)


ready for another review :)

Comment 11 Dan Williams 2014-05-07 16:17:42 UTC

One more thing I think we should consider is the 'callback' argument for nm_utils_kill_child_async(). I know that's only used in testcases for now, but there could be a problem if we start using it in the real code with a timeout. If the thing that calls nm_utils_kill_child_async() and passes a callback then gets disposed, and the callback triggers, that would be a bug. Currently there is no way to easily remove a callback so that this situation doesn't happen.

I think the best solution to that would be a GCancellable passed into nm_utils_kill_child_async() along with the callback, so that the callback could check g_cancellable_is_cancelled() and just return. Then the caller of nm_utils_kill_child_async() could cancel the cancellable when it was disposed and thus have its callback aware of the disposal.

---

Next up, do we need the is_sync variable in the the NMUtilsKillChildAsyncCb arguments? I think in general we should either make everything a scheduled callback, or require the callback not to care whether it's sync or not. That way we have very defined semantics for the callback function. For now I think we should just require the callback not to care; at the moment I don't think anything uses that argument.

---

Also, could we collapse the callback arguments for 'success' and 'child_status' and just pass -1 for child status? Currently, 'success' is always FALSE when 'child_status' is -1, and it is never TRUE when child_status is -1. So the callback would just look at child_status for success/failure perhaps?

---

Also, any particular reason that test-general.c passes the address of a the expected_signal_XXX into test_nm_utils_kill_child_async_do()? To get around the "assignment of read-only member" error that you probably encountered, you can do this:

struct test_nm_utils_kill_child_async_data data = {
.pid = pid,
.expected_success = expected_success,
.expected_child_status = expected_child_status };

---

Another thought; for _kc_signal_to_string(), it seems quite unlikely that we'll ever use signals that are outside the range given here. So we could skip the buffer (saving memory) and require that signals are in the constant list and just return the constant string. If we're worried about code size, we could always remove some of the less-used signals from the constant list too.

Comment 12 Dan Williams 2014-05-07 16:18:34 UTC

Created attachment 276084 [details] [review]
dcbw-suggestions.patch

Possible implementation of my suggestions from comment 11

Comment 13 Thomas Haller 2014-05-07 17:27:17 UTC

(In reply to comment #12)
> Created an attachment (id=276084) [details] [review]
> dcbw-suggestions.patch
> 
> Possible implementation of my suggestions from comment 11

The patch did not apply cleanly for my. Anyway, I pushed it on top of the current branch for the moment

Comment 14 Thomas Haller 2014-05-07 17:48:30 UTC

(In reply to comment #11)
> One more thing I think we should consider is the 'callback' argument for
> nm_utils_kill_child_async().  I know that's only used in testcases for now, but
> there could be a problem if we start using it in the real code with a timeout. 
> If the thing that calls nm_utils_kill_child_async() and passes a callback then
> gets disposed, and the callback triggers, that would be a bug.  Currently there
> is no way to easily remove a callback so that this situation doesn't happen.
> 
> I think the best solution to that would be a GCancellable passed into
> nm_utils_kill_child_async() along with the callback, so that the callback could
> check g_cancellable_is_cancelled() and just return.  Then the caller of
> nm_utils_kill_child_async() could cancel the cancellable when it was disposed
> and thus have its callback aware of the disposal.

I am not convinced, because

- in most cases, the callback won't be needed at all. Actually, I over-designed it and it's almost not used for now -- except testing :)

- in many (of the remaining) cases, this won't be a problem, because the caller either doesn't pass on anything that needs cleanup or he knows that the user_data will never be freed. I.e. the caller would even bother to use a GCancellable.

- in the other cases, the caller can
  - g_object_ref() himself
  - install a weakref to the user_data
  - pass on a GCancellable as part of the userdata himself.


Note that your GCancellable also does not "remove a callback so that this situation doesn't happen". It is really not more then passing a flag.







> Next up, do we need the is_sync variable in the the NMUtilsKillChildAsyncCb
> arguments?  I think in general we should either make everything a scheduled
> callback, or require the callback not to care whether it's sync or not.  That
> way we have very defined semantics for the callback function.  For now I think
> we should just require the callback not to care; at the moment I don't think
> anything uses that argument.

Danw opted for calling async always. I wouldn't mind this additional argument. If the caller really doesn't care, he can just ignore the @is_sync argument. I consider this a well understood pattern.




> Also, could we collapse the callback arguments for 'success' and 'child_status'
> and just pass -1 for child status?  Currently, 'success' is always FALSE when
> 'child_status' is -1, and it is never TRUE when child_status is -1.  So the
> callback would just look at child_status for success/failure perhaps?

Can the exit status as returned by wait never be -1? I think it can...





(respond to other pointer later, and rework branch later too...)

Comment 15 Dan Winship 2014-05-07 17:56:34 UTC

(In reply to comment #14)
> > Next up, do we need the is_sync variable in the the NMUtilsKillChildAsyncCb
> > arguments?  I think in general we should either make everything a scheduled
> > callback, or require the callback not to care whether it's sync or not.  That
> > way we have very defined semantics for the callback function.  For now I think
> > we should just require the callback not to care; at the moment I don't think
> > anything uses that argument.
> 
> Danw opted for calling async always.

which I still think we should do. Both because that's the standard in GLib-based programs, and because sometimes-async callbacks tend to lead to bugs (which is why the GNOME stack got rid of them).

Comment 16 Dan Williams 2014-05-07 22:07:23 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > > Next up, do we need the is_sync variable in the the NMUtilsKillChildAsyncCb
> > > arguments?  I think in general we should either make everything a scheduled
> > > callback, or require the callback not to care whether it's sync or not.  That
> > > way we have very defined semantics for the callback function.  For now I think
> > > we should just require the callback not to care; at the moment I don't think
> > > anything uses that argument.
> > 
> > Danw opted for calling async always.
> 
> which I still think we should do. Both because that's the standard in
> GLib-based programs, and because sometimes-async callbacks tend to lead to bugs
> (which is why the GNOME stack got rid of them).

Yeah, that's why I was at least suggesting a cancellable.  But I'm perfectly happy to have async-always too.

Comment 17 Thomas Haller 2014-05-13 12:05:58 UTC

(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > > Next up, do we need the is_sync variable in the the NMUtilsKillChildAsyncCb
> > > > arguments?  I think in general we should either make everything a scheduled
> > > > callback, or require the callback not to care whether it's sync or not.  That
> > > > way we have very defined semantics for the callback function.  For now I think
> > > > we should just require the callback not to care; at the moment I don't think
> > > > anything uses that argument.
> > > 
> > > Danw opted for calling async always.
> > 
> > which I still think we should do. Both because that's the standard in
> > GLib-based programs, and because sometimes-async callbacks tend to lead to bugs
> > (which is why the GNOME stack got rid of them).
> 
> Yeah, that's why I was at least suggesting a cancellable.  But I'm perfectly
> happy to have async-always too.


I think, the GCancellable is independent of whether we call always async, or possibly synchronously.

I argue for omitting the GCancellable [1], because the few users who might use this functionality, can implement it themselves by setting a flag in the @user_data argument.

If we decide to always call async [2a], the @is_sync argument of course becomes useless.

If we continue to call sync if possible [2b], the @is_sync argument can be omitted with the same argument as GCancellable: users who care, can implement it themselves by setting a flag in the @user_data.


I still vote for [2b], because it saves the effort to schedule the callback async, when it's not need. danw favors [2a], and dcbw is "perfectly
happy to have async-always too".

So, I will change that... new version follows

Comment 18 Thomas Haller 2014-05-13 13:55:30 UTC

Pushed fixups.

Comment 19 Dan Williams 2014-06-04 04:09:00 UTC

Any chance you could rebase onto master?  There's a couple conflicts and I think some of the fixups don't quite squash correctly...

But anyway, in nm_utils_kill_child_async() is there a race between sending the signals to the child process, and adding the child watch handler?  If the child quits before NM gets to adding the child watch, would that cause the SIGCHLD signal to be lost?  I'm not sure there is a race, but it was just a question I had.

----

When terminating a VPN service daemon (which is async) I get:

NetworkManager[26519]: <debug> [1401854377.455213] [NetworkManagerUtils.c:381] nm_utils_kill_child_async(): kill child process 'vpnc' (26703): wait for process to terminate after sending SIGTERM (15) (send SIGKILL in 2000 milliseconds)...

(NetworkManager:26519): GLib-WARNING **: GChildWatchSource: Exit status of a child process was requested but ECHILD was received by waitpid(). Most likely the process is ignoring SIGCHLD, or some other thread is invoking waitpid() with a nonpositive first argument; either behavior can break applications that use g_child_watch_add()/g_spawn_sync() either directly or indirectly.
NetworkManager[26519]: <debug> [1401854377.458167] [NetworkManagerUtils.c:238] _kc_cb_watch_child(): kill child process 'vpnc' (26703): terminated normally with status 0 (2989 usec elapsed)
NetworkManager[26519]: <info> VPN service 'vpnc' disappeared

Any idea what that glib error is?

Comment 20 Jiri Klimes 2014-06-04 12:19:10 UTC

_kc_signal_to_string()
 - printing signal number of unhandled signals would be useful.
 - consider using SIGKILL and SIGTERM instead of numbers.


> * @sig: send first signal immediatly. Set to 0, not to sent any signal. If set the %SIGKILL,
> * we don't send a second %SIGKILL signal after %wait_before_kill_msec milliseconds.
> * Set to zero, not to send any signal.
* @sig: signal to sent initially. If 0 no signal is sent. If %SIGKILL, the second %SIGKILL signal is not
    sent after @wait_before_kill_msec milliseconds.

The branch won't compile for me (probably some issues on my side). But it would be useful to squash and rebase onto master anyway.

Comment 21 Thomas Haller 2014-06-04 16:04:35 UTC

(In reply to comment #19)
> Any chance you could rebase onto master?  There's a couple conflicts and I
> think some of the fixups don't quite squash correctly...

Right it doesn't rebase cleanly, I would like to figure out first what we agree on before squashing it all together.


> But anyway, in nm_utils_kill_child_async() is there a race between sending the
> signals to the child process, and adding the child watch handler?  If the child
> quits before NM gets to adding the child watch, would that cause the SIGCHLD
> signal to be lost?  I'm not sure there is a race, but it was just a question I
> had.

Hm, I don't think there is an error here.

First we waitpid() and exit early if the child already terminated there.
Otherwise, we g_child_watch_add() registers the handler. Even if the child terminates in between (and we receive SIGCHLD), glib simply *must* function correctly in such a case, otherwise it would not be possible to do this race-free.

I *think* glib solves this by just remembering a flag in case of SIGCHLD. Afterwards in the main loop, it will iterate over all watched childs and call waitpid on each of them. So, this should work correctly.



> When terminating a VPN service daemon (which is async) I get:
> 
> NetworkManager[26519]: <debug> [1401854377.455213] [NetworkManagerUtils.c:381]
> nm_utils_kill_child_async(): kill child process 'vpnc' (26703): wait for
> process to terminate after sending SIGTERM (15) (send SIGKILL in 2000
> milliseconds)...
> 
> (NetworkManager:26519): GLib-WARNING **: GChildWatchSource: Exit status of a
> child process was requested but ECHILD was received by waitpid(). Most likely
> the process is ignoring SIGCHLD, or some other thread is invoking waitpid()
> with a nonpositive first argument; either behavior can break applications that
> use g_child_watch_add()/g_spawn_sync() either directly or indirectly.
> NetworkManager[26519]: <debug> [1401854377.458167] [NetworkManagerUtils.c:238]
> _kc_cb_watch_child(): kill child process 'vpnc' (26703): terminated normally
> with status 0 (2989 usec elapsed)
> NetworkManager[26519]: <info> VPN service 'vpnc' disappeared
> 
> Any idea what that glib error is?

Interesting. Will investigate. It would look like, somebody already reaped the child (supposing we did not block SIGCHLD and nobody calls waitpid(-1)).





(In reply to comment #20)
> _kc_signal_to_string()
>  - printing signal number of unhandled signals would be useful.

That was removed in a previous fixup. It saves LOC and strings in the binary. And really, nobody is ever going to kill a child with something else then 0, SIGTERM or SIGKILL. And if we ever need to, just extend the function (which is NM-internal only).

>  - consider using SIGKILL and SIGTERM instead of numbers.

Done.


> > * @sig: send first signal immediatly. Set to 0, not to sent any signal. If set the %SIGKILL,
> > * we don't send a second %SIGKILL signal after %wait_before_kill_msec milliseconds.
> > * Set to zero, not to send any signal.
> * @sig: signal to sent initially. If 0 no signal is sent. If %SIGKILL, the
> second %SIGKILL signal is not
>     sent after @wait_before_kill_msec milliseconds.

Done.



> The branch won't compile for me (probably some issues on my side). But it would
> be useful to squash and rebase onto master anyway.

Will do

Comment 22 Dan Williams 2014-06-04 19:14:23 UTC

(In reply to comment #21)
> (In reply to comment #19)
> > Any chance you could rebase onto master?  There's a couple conflicts and I
> > think some of the fixups don't quite squash correctly...
> 
> Right it doesn't rebase cleanly, I would like to figure out first what we agree
> on before squashing it all together.

Sure.

> 
> > But anyway, in nm_utils_kill_child_async() is there a race between sending the
> > signals to the child process, and adding the child watch handler?  If the child
> > quits before NM gets to adding the child watch, would that cause the SIGCHLD
> > signal to be lost?  I'm not sure there is a race, but it was just a question I
> > had.
> 
> Hm, I don't think there is an error here.
> 
> First we waitpid() and exit early if the child already terminated there.
> Otherwise, we g_child_watch_add() registers the handler. Even if the child
> terminates in between (and we receive SIGCHLD), glib simply *must* function
> correctly in such a case, otherwise it would not be possible to do this
> race-free.
> 
> I *think* glib solves this by just remembering a flag in case of SIGCHLD.
> Afterwards in the main loop, it will iterate over all watched childs and call
> waitpid on each of them. So, this should work correctly.

Yes, you're right.  unix_signal_pending in gmain.c is the magic array.

> > When terminating a VPN service daemon (which is async) I get:
> > 
> > NetworkManager[26519]: <debug> [1401854377.455213] [NetworkManagerUtils.c:381]
> > nm_utils_kill_child_async(): kill child process 'vpnc' (26703): wait for
> > process to terminate after sending SIGTERM (15) (send SIGKILL in 2000
> > milliseconds)...
> > 
> > (NetworkManager:26519): GLib-WARNING **: GChildWatchSource: Exit status of a
> > child process was requested but ECHILD was received by waitpid(). Most likely
> > the process is ignoring SIGCHLD, or some other thread is invoking waitpid()
> > with a nonpositive first argument; either behavior can break applications that
> > use g_child_watch_add()/g_spawn_sync() either directly or indirectly.
> > NetworkManager[26519]: <debug> [1401854377.458167] [NetworkManagerUtils.c:238]
> > _kc_cb_watch_child(): kill child process 'vpnc' (26703): terminated normally
> > with status 0 (2989 usec elapsed)
> > NetworkManager[26519]: <info> VPN service 'vpnc' disappeared
> > 
> > Any idea what that glib error is?
> 
> Interesting. Will investigate. It would look like, somebody already reaped the
> child (supposing we did not block SIGCHLD and nobody calls waitpid(-1)).

I didn't have time to track it down last night, but would be good to figure out what's going on here.  This only happened with asynchronous child kill, the sync tests I did were fine. So maybe some logic issue in the _async() bits?

---

So after the VPN SIGCHLD error gets fixed, I'm happy with the state of the branch.

Comment 23 Thomas Haller 2014-06-05 13:20:57 UTC

Rebased an master, all commits squashed together.

remaining issues:

- still investigate the VPN SIGCHLD error (e.g. do more testing).
- merge after th/bgo706293_platform_tests and move test-general over to NMTST, so that we can control the logging output.
- find a way to ensure that all child process are killed if the test fails.

Comment 24 Thomas Haller 2014-06-05 16:59:11 UTC

(In reply to comment #23)
> Rebased an master, all commits squashed together.
> 
> remaining issues:
> 
> - still investigate the VPN SIGCHLD error (e.g. do more testing).

still TODO

> - merge after th/bgo706293_platform_tests and move test-general over to NMTST,
> so that we can control the logging output.

Done.

> - find a way to ensure that all child process are killed if the test fails.

still TODO, but now all child processes have "nm_test_kill_child_process" in their command line. So at least you can grep for the remaining processes.

Comment 25 Dan Williams 2014-06-12 21:11:48 UTC

Any luck on testing the VPN thing?

Comment 26 Thomas Haller 2014-06-14 00:48:26 UTC

(In reply to comment #24)
> (In reply to comment #23)
> > Rebased an master, all commits squashed together.
> > 
> > remaining issues:
> > 
> > - still investigate the VPN SIGCHLD error (e.g. do more testing).
> 

The problem was that we called nm_utils_kill_child_async() for VPN without
g_source_remove() the child watcher. As glib only supports one watcher per child (at a time), this was a bug.

Since then, killing and watching VPN processes was removed from master.


I also added a code comment to nm_utils_kill_child_async().


> > - find a way to ensure that all child process are killed if the test fails.
> 
> still TODO, but now all child processes have "nm_test_kill_child_process" in
> their command line. So at least you can grep for the remaining processes.

Done. Test now places all child processes in a separate process group and starts a watchdog process to kill the entire group after 4 seconds as fallback.


Rebased to master. Please re-review.

Comment 27 Dan Williams 2014-07-01 21:45:11 UTC

Looks good to me.  Appears to work for DHCP and PPP, I did not test dnsmasq for sharing or DNS plugins though.  I'll assume you have...

Comment 28 Thomas Haller 2014-07-10 14:40:26 UTC

Merged as http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=397762c2c7211dc804d8ba7e2f1c7be8247aa0d0

I did test dnsmasq. All seems good, no issues seen.


Before merging I did another change however. Originally the tests for kill-child were in src/tests/test-general.c , which I modified to use nm-test-utils.h.

However, I wanted to assert also against the logged messages (which has the additional advantage, that a normal test run is silent). Mixing tests with expect-message and without does not work well because we setup nmtst_init() differently depending on the mode.

So instead I added the tests to a new file src/tests/test-general-with-expect.c . This file contains general-purpose tests that use g_test_expect_message() and g_test_assert_expected_messages().


So, that's it :)