GNOME Bugzilla – Bug 728342
[review] dcbw/dns-plugin-children-bgo728342: avoid child restart race, respawn child on exit
Last modified: 2015-03-27 21:27:28 UTC
Version: NetworkManager.i686 1:0.9.9.1-4.git20140319.fc20 rhughes-f20-gnome-3-12-i386 I have the option dns=dnsmasq in networkManager.conf What happens: if I do kill dnsmasq, then dnsmasq is gone, and /run/NetworkManager/dnsmasq.conf disappear What should happen: if I do kill dnsmasq, then dnsmasq is respawned by netWorkManager and /run/NetworkManager/dnsmasq.conf stay
Related: There seems to be a race condition when NetworkManager restarts dnsmasq due to the establishment of a VPN connection. aug 27 20:35:03 e6520 NetworkManager[895]: <info> VPN connection 'XXX' (IP Config Get) complete. aug 27 20:35:03 e6520 NetworkManager[895]: <info> DNS: starting dnsmasq... aug 27 20:35:03 e6520 dnsmasq[1127]: exiting on receipt of SIGTERM aug 27 20:35:03 e6520 NetworkManager[895]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use aug 27 20:35:03 e6520 dnsmasq[2843]: failed to create listening socket for 127.0.0.1: Address already in use aug 27 20:35:03 e6520 dnsmasq[2843]: FAILED to start up When this happens resolv.conf reverts to using the upstream DNS servers, and therefore fails to properly direct queries to different DNS servers as dnsmasq is capable of.
confirming a problem with a race condition, dnsmasq fails to start, because previous dnsmasq instance didn't free the port: Oct 25 21:17:55 archi dnsmasq[8949]: exiting on receipt of SIGTERM Oct 25 21:17:55 archi NetworkManager[375]: <info> DNS: starting dnsmasq... Oct 25 21:17:55 archi dnsmasq[9094]: failed to create listening socket for 127.0.0.1: Address already in use Oct 25 21:17:55 archi dnsmasq[9094]: FAILED to start up Oct 25 21:17:55 archi NetworkManager[375]: <info> Writing DNS information to /usr/bin/resolvconf Oct 25 21:17:55 archi NetworkManager[375]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use Oct 25 21:17:56 archi NetworkManager[375]: <info> Activation (wlp3s0) successful, device activated. Oct 25 21:17:56 archi NetworkManager[375]: <warn> dnsmasq exited with error: Network access problem (address in use; permissions; etc) (2) Oct 25 21:17:56 archi NetworkManager[375]: <info> Writing DNS information to /usr/bin/resolvconf networkmanager version is 0.9.10.0
looks like a race condition problem commit is http://cgit.freedesktop.org/NetworkManager/NetworkManager/diff/src/dnsmasq-manager/nm-dnsmasq-manager.c?id=397762c2c7211dc804d8ba7e2f1c7be8247aa0d0 the networkmanager does not wait dnsmasq to exist, should waitpid be used here?
(In reply to comment #3) > looks like a race condition problem commit is > http://cgit.freedesktop.org/NetworkManager/NetworkManager/diff/src/dnsmasq-manager/nm-dnsmasq-manager.c?id=397762c2c7211dc804d8ba7e2f1c7be8247aa0d0 > > the networkmanager does not wait dnsmasq to exist, > > should waitpid be used here? Note that nm-0-9-10 does not contain nm_utils_kill_child_async() or the mentioned patch. But that doesn't matter, because commit [1] did _not_ change behaviour. Already before NM would not wait for the process to be killed. Hence, any race was already present before. [1] http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=ff3b7538578454bd8cd9a0800588631b3f6a609e Yes, NM kills the child process without waiting for it. And when restarting, it does not ensure that the child process is really gone. But do you really see the issue because of a child-process that did not terminate in time [A]? That would be the case if you re-activate a connection. Or is it that you restart NM (which keeps dnsmasq running), and ~then~ kill_existing_for_iface() fails to kill the running (non-child) process? [B] I pushed th/bgo728342_kill_dnsmask, but that will not help with [A]. And is still ugly, even if it probably works well to avoid [B]. A debug log would be helpful.
I'm seeing the problem when I active a VPN connection (that was not previously active).
(In reply to comment #4) > A debug log would be helpful. Yes. the problem was before that commit. debug log: http://pastebin.com/6M8yqTWh
Also a point if no network connection, dnsmasq is not started by NetworkManager, but I think it should work, because a user can use it for resolving local addresses. currently if no network connections my resolve.conf looks like: ᐅ cat /etc/resolv.conf # Generated by resolvconf this mean my ".dev" domain names (I am using them for development purpose) resolve doesn't work.
I'm seeing this problem when my Wi-fi network connects on startup: dnsmasq is started, killed, then fails to come back up. I've reproduced this 4 out of 4 times when restarting NetworkManager. I'm on Fedora 21 with NetworkMManager 0.9.10.0-13.git20140704.
Created attachment 290576 [details] dnsmasq failure log
The original bug here was that manually killing dnsmasq doesn't result in NM respawning it. The rest of the comments here are about a race condition where NM tries to restart dnsmasq too quickly under normal operation, when updating DNS information. I'm going to pick the second interpretation since that's an actual bug, and the original report was originally intended (though perhaps no longer wanted) behavior.
Created attachment 294555 [details] [review] 0001-dns-kill-plugin-child-synchronously-to-avoid-restart.patch NM was killing the dnsmasq local caching nameserver process and immediately starting a new one, and new process couldn't bind to 127.0.0.1 because the old one hadn't quit yet. Thus the new process quit, and the user was left with no split DNS at all. While this does introduce more synchronous waiting into the connection process, it's not that much time and NM will kill dnsmasq if it hasn't quit after 1 second. The longer-term fix is to use dnsmasq's D-Bus interface to update DNS without respawning it.
Created attachment 294559 [details] [review] 0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch While we're at it, respawn the child if it quits unexpectedly.
(In reply to comment #12) > Created an attachment (id=294559) [details] [review] > 0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch maybe do the refactoring with build_plugin_config_lists() in a separate commit? otherwise, both patches LGTM
(In reply to comment #13) > (In reply to comment #12) > > Created an attachment (id=294559) [details] [review] [details] [review] > > 0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch > > maybe do the refactoring with build_plugin_config_lists() in a separate commit? > > otherwise, both patches LGTM Done, pushed as dcbw/dns-plugin-children-bgo728342
With this change, we kill the child process synchronously and respawn dnsmasq. Although not an optimal solution, I'd say it qualifies as "RESOLVED", doesn't it?
Maybe I don't understand NetworkManager git conventions, but I don't see this fix on master? e.g. http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/src/dns-manager/nm-dns-manager.c - that file is touched by the patches but there's no trace of this fix in its history.
(In reply to Adam Williamson from comment #16) > Maybe I don't understand NetworkManager git conventions, but I don't see > this fix on master? Yeah, it hasn't been committed yet. (I'm not sure what Thomas meant when he said 'it qualifies as "RESOLVED"'. I guess there was an implied "if we commit this".)
> dns: refresh DNS if plugin child quits unexpectedly (bgo #728342) >+ nm_log_warn (LOGD_DNS, "could not commit DNS changes: (%d) %s", >+ error ? error->code : -1, >+ error && error->message ? error->message : "(unknown)"); nm_log_warn (LOGD_DNS, "could not commit DNS changes: %s", error->message); everything else looks good
(In reply to Dan Winship from comment #17) > (In reply to Adam Williamson from comment #16) > > Maybe I don't understand NetworkManager git conventions, but I don't see > > this fix on master? > > Yeah, it hasn't been committed yet. (I'm not sure what Thomas meant when he > said 'it qualifies as "RESOLVED"'. I guess there was an implied "if we > commit this".) I don't remember ... but I agree with your interpretation :)
I rebased and repushed the branch. To handle danw's requested error ? XXX cleanup, I had to add a new commit that ensures that when update_dns() returns FALSE, 'error' is always set, which wasn't the case before. Please review!
>> dns: ensure that update_dns() always returns a GError on failure maybe nm_spawn_process() should use a @local error variable, and continue to log a warning in case of error -- at least if @error==NULL. Having a logging about failure seems useful. And then g_propagate_error(error, local); - retval &= (pclose (f) == 0); + (void) pclose (f); Why not signal an error if pclose() fails? Or maybe a comment why we ignore it? otherwise LGTM
Fixed those up and repushed.
> fixup! dns: ensure that update_dns() always returns a GError on failure >+ /* Ignore pclose() failure; can't do anything about it */ >+ err = pclose (f); >+ if (err < 0) { >+ errnosv = errno; >+ g_set_error (error, G_IO_ERROR, g_io_error_from_errno (errnosv), >+ "Failed to close pipe to resolvconf: %d", errnosv); >+ retval = FALSE; >+ } else if (ret > 0) { >+ nm_log_warn (LOGD_DNS, "resolvconf failed with status %d", ret); The comment appears to be lying, and "ret" in the last two lines is probably supposed to be "err"?
Fixed that up and pushed to git master and backported to nm-1-0.