GNOME Bugzilla – Bug 737731
[review] fix killing extenal dhcp process by not waitpid for it (th/bgo737731-kill-process)
Last modified: 2014-10-10 10:37:06 UTC
When assuming an existing connection, NM later tries to kill the dhcp process. It does so by calling nm_utils_kill_child_sync(). But the process is not a child process of NM. Therefore we cannot reap the process and waitpid will fail. This causes NM to block for 500 ms and to log an error such as: <debug> [1412167360.400201] [NetworkManagerUtils.c:534] nm_utils_kill_child_sync(): kill child process 'dhcp-client' (7109): waiting up to 500 milliseconds for process to terminate normally afte <debug> [1412167360.900298] [NetworkManagerUtils.c:549] nm_utils_kill_child_sync(): kill child process 'dhcp-client' (7109): sending SIGKILL... <error> [1412167360.900369] [NetworkManagerUtils.c:576] nm_utils_kill_child_sync(): kill child process 'dhcp-client' (7109): after sending SIGTERM (15) and SIGKILL, waitpid failed with No child Please review
Should we really be calling the get_start_time_for_pid() in the wait loop? Seems like a lot of filesystem operation when we can just rely on ESRCH from kill() just below? Also, why do we care about the start time of the previous DHCP client process when killing it?
(In reply to comment #1) > Should we really be calling the get_start_time_for_pid() in the wait loop? > Seems like a lot of filesystem operation when we can just rely on ESRCH from > kill() just below? I think that nm_utils_get_start_time_for_pid() is still very cheap -- especially as NM is sleeping for milliseconds(!) while waiting for the process to terminate... so, it doesn't hurt, does it? > Also, why do we care about the start time of the previous DHCP client process > when killing it? The idea is that PID+start_time identifies a process (more) uniquely. It should help to avoid killing the wrong process -- in case the $PID gets reused. (yes, there is still a potential race).
(In reply to comment #2) > (In reply to comment #1) > > Should we really be calling the get_start_time_for_pid() in the wait loop? > > Seems like a lot of filesystem operation when we can just rely on ESRCH from > > kill() just below? > > I think that nm_utils_get_start_time_for_pid() is still very cheap -- > especially as NM is sleeping for milliseconds(!) while waiting for the process > to terminate... so, it doesn't hurt, does it? It's still a bunch of cycles, but I suppose there's a small race if the child process just quit and something is spawning a lot of processes and PIDs have wrapped or something. > > Also, why do we care about the start time of the previous DHCP client process > > when killing it? > > The idea is that PID+start_time identifies a process (more) uniquely. It should > help to avoid killing the wrong process -- in case the $PID gets reused. > (yes, there is still a potential race). Ok. While you're in the nm-dhcp-client.c code anyway, the method that it uses to check the process command line is faulty. It's checking strrchr on cmdline, but for example: $ cat /proc/2835/cmdline tail-f/var/log/messages is obviously going to return the wrong result. So instead, we can use "/proc/XXX/comm" and just strcmp the binary name. I'd say lets fix this in a separate commit, or in "core/dhcp: kill external dhcp process using nm_utils_kill_process_sync()" if you want.
apparently, comm is truncated to 16 characters and is not always available (depending on kernel version/build-options). But I think, current code is correct because the individual parts of cmdline are separated by '\0'. g_file_get_contents() reads all all of $0$@, but as we interpret @proc_contents as NULL-terminated string it is still correct. ... I rebased the branch to master, added another fixup! and one new commit: >> dhcp: relax check valid PID for dhcp process
> core: add nm_utils_kill_process_sync() function + sleep_duration_usec = (sleep_duration_msec <= 0) ? (G_USEC_PER_SEC / 20) : MIN (G_MAXULONG, ((gint64) sleep_duration_msec) * 1000L); sleep_duration_msec is guint32 so the "<= 0" will warn in coverity. Also, since sleep_duration_usec is 'gulong' should the cast in the MIN() be 'glong' instead of gint64? Or maybe just make sleep_time/sleep_duration_usec gint64 instead of glong/gulong? The rest looks good to me.
merged as: http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=a6b4d544996ad565cb0e175e78f0670618243d0c