After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 728342 - [review] dcbw/dns-plugin-children-bgo728342: avoid child restart race, respawn child on exit
[review] dcbw/dns-plugin-children-bgo728342: avoid child restart race, respaw...
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: NetworkManager maintainer(s)
NetworkManager maintainer(s)
Depends on:
Blocks: nm-review
 
 
Reported: 2014-04-16 14:00 UTC by CircleCode
Modified: 2015-03-27 21:27 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
dnsmasq failure log (2.99 KB, text/plain)
2014-11-13 04:27 UTC, Daniel Miranda
  Details
0001-dns-kill-plugin-child-synchronously-to-avoid-restart.patch (1.74 KB, patch)
2015-01-14 22:39 UTC, Dan Williams
none Details | Review
0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch (6.56 KB, patch)
2015-01-14 23:04 UTC, Dan Williams
none Details | Review

Description CircleCode 2014-04-16 14:00:37 UTC
Version:
NetworkManager.i686   1:0.9.9.1-4.git20140319.fc20 rhughes-f20-gnome-3-12-i386

I have the option dns=dnsmasq in networkManager.conf

What happens:
if I do kill dnsmasq, then dnsmasq is gone, and /run/NetworkManager/dnsmasq.conf disappear

What should happen:
if I do kill dnsmasq, then dnsmasq is respawned by netWorkManager and /run/NetworkManager/dnsmasq.conf stay
Comment 1 Oskar Berggren 2014-08-27 18:50:47 UTC
Related:

There seems to be a race condition when NetworkManager restarts dnsmasq due to the establishment of a VPN connection.

aug 27 20:35:03 e6520 NetworkManager[895]: <info> VPN connection 'XXX' (IP Config Get) complete.
aug 27 20:35:03 e6520 NetworkManager[895]: <info> DNS: starting dnsmasq...
aug 27 20:35:03 e6520 dnsmasq[1127]: exiting on receipt of SIGTERM
aug 27 20:35:03 e6520 NetworkManager[895]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use
aug 27 20:35:03 e6520 dnsmasq[2843]: failed to create listening socket for 127.0.0.1: Address already in use
aug 27 20:35:03 e6520 dnsmasq[2843]: FAILED to start up


When this happens resolv.conf reverts to using the upstream DNS servers, and therefore fails to properly direct queries to different DNS servers as dnsmasq is capable of.
Comment 2 Maksim Kraev 2014-10-25 17:35:24 UTC
confirming a problem with a race condition, dnsmasq fails to start, because previous dnsmasq instance didn't free the port:

Oct 25 21:17:55 archi dnsmasq[8949]: exiting on receipt of SIGTERM
Oct 25 21:17:55 archi NetworkManager[375]: <info> DNS: starting dnsmasq...
Oct 25 21:17:55 archi dnsmasq[9094]: failed to create listening socket for 127.0.0.1: Address already in use
Oct 25 21:17:55 archi dnsmasq[9094]: FAILED to start up
Oct 25 21:17:55 archi NetworkManager[375]: <info> Writing DNS information to /usr/bin/resolvconf
Oct 25 21:17:55 archi NetworkManager[375]: dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use
Oct 25 21:17:56 archi NetworkManager[375]: <info> Activation (wlp3s0) successful, device activated.
Oct 25 21:17:56 archi NetworkManager[375]: <warn> dnsmasq exited with error: Network access problem (address in use; permissions; etc) (2)
Oct 25 21:17:56 archi NetworkManager[375]: <info> Writing DNS information to /usr/bin/resolvconf

networkmanager version is 0.9.10.0
Comment 3 Maksim Kraev 2014-10-29 04:25:39 UTC
looks like a race condition problem commit is http://cgit.freedesktop.org/NetworkManager/NetworkManager/diff/src/dnsmasq-manager/nm-dnsmasq-manager.c?id=397762c2c7211dc804d8ba7e2f1c7be8247aa0d0

the networkmanager does not wait dnsmasq to exist,

should waitpid be used here?
Comment 4 Thomas Haller 2014-10-29 14:13:48 UTC
(In reply to comment #3)
> looks like a race condition problem commit is
> http://cgit.freedesktop.org/NetworkManager/NetworkManager/diff/src/dnsmasq-manager/nm-dnsmasq-manager.c?id=397762c2c7211dc804d8ba7e2f1c7be8247aa0d0
> 
> the networkmanager does not wait dnsmasq to exist,
> 
> should waitpid be used here?

Note that nm-0-9-10 does not contain nm_utils_kill_child_async() or the mentioned patch. But that doesn't matter, because commit [1] did _not_ change behaviour. Already before NM would not wait for the process to be killed.

Hence, any race was already present before.

[1] http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=ff3b7538578454bd8cd9a0800588631b3f6a609e 



Yes, NM kills the child process without waiting for it. And when restarting, it does not ensure that the child process is really gone.


But do you really see the issue because of a child-process that did not terminate in time [A]? That would be the case if you re-activate a connection.

Or is it that you restart NM (which keeps dnsmasq running), and ~then~ kill_existing_for_iface() fails to kill the running (non-child) process? [B]



I pushed th/bgo728342_kill_dnsmask, but that will not help with [A].
And is still ugly, even if it probably works well to avoid [B].



A debug log would be helpful.
Comment 5 Oskar Berggren 2014-10-29 14:56:54 UTC
I'm seeing the problem when I active a VPN connection (that was not previously active).
Comment 6 Maksim Kraev 2014-10-30 03:46:27 UTC
(In reply to comment #4)

> A debug log would be helpful.
Yes. the problem was before that commit.

debug log: http://pastebin.com/6M8yqTWh
Comment 7 Maksim Kraev 2014-10-30 03:57:22 UTC
Also a point if no network connection, dnsmasq is not started by NetworkManager, but I think it should work, because a user can use it for resolving local addresses.

currently if no network connections my resolve.conf looks like:
ᐅ cat /etc/resolv.conf
# Generated by resolvconf

this mean my ".dev" domain names (I am using them for development purpose) resolve doesn't work.
Comment 8 Daniel Miranda 2014-11-13 04:25:12 UTC
I'm seeing this problem when my Wi-fi network connects on startup: dnsmasq is started, killed, then fails to come back up. I've reproduced this 4 out of 4 times when restarting NetworkManager. I'm on Fedora 21 with NetworkMManager 0.9.10.0-13.git20140704.
Comment 9 Daniel Miranda 2014-11-13 04:27:00 UTC
Created attachment 290576 [details]
dnsmasq failure log
Comment 10 Dan Williams 2015-01-14 22:33:40 UTC
The original bug here was that manually killing dnsmasq doesn't result in NM respawning it.

The rest of the comments here are about a race condition where NM tries to restart dnsmasq too quickly under normal operation, when updating DNS information.

I'm going to pick the second interpretation since that's an actual bug, and the original report was originally intended (though perhaps no longer wanted) behavior.
Comment 11 Dan Williams 2015-01-14 22:39:05 UTC
Created attachment 294555 [details] [review]
0001-dns-kill-plugin-child-synchronously-to-avoid-restart.patch

NM was killing the dnsmasq local caching nameserver process and immediately
starting a new one, and new process couldn't bind to 127.0.0.1 because the
old one hadn't quit yet.  Thus the new process quit, and the user was
left with no split DNS at all.

While this does introduce more synchronous waiting into the connection
process, it's not that much time and NM will kill dnsmasq if it hasn't
quit after 1 second.  The longer-term fix is to use dnsmasq's D-Bus
interface to update DNS without respawning it.
Comment 12 Dan Williams 2015-01-14 23:04:20 UTC
Created attachment 294559 [details] [review]
0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch

While we're at it, respawn the child if it quits unexpectedly.
Comment 13 Thomas Haller 2015-01-15 15:07:15 UTC
(In reply to comment #12)
> Created an attachment (id=294559) [details] [review]
> 0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch

maybe do the refactoring with build_plugin_config_lists() in a separate commit?

otherwise, both patches LGTM
Comment 14 Dan Williams 2015-01-15 17:40:31 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Created an attachment (id=294559) [details] [review] [details] [review]
> > 0002-dns-refresh-DNS-if-plugin-child-quits-unexpectedly-b.patch
> 
> maybe do the refactoring with build_plugin_config_lists() in a separate commit?
> 
> otherwise, both patches LGTM

Done, pushed as dcbw/dns-plugin-children-bgo728342
Comment 15 Thomas Haller 2015-01-18 20:07:05 UTC
With this change, we kill the child process synchronously and respawn dnsmasq. Although not an optimal solution, I'd say it qualifies as "RESOLVED", doesn't it?
Comment 16 Adam Williamson 2015-02-18 19:51:15 UTC
Maybe I don't understand NetworkManager git conventions, but I don't see this fix on master? e.g. http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/src/dns-manager/nm-dns-manager.c - that file is touched by the patches but there's no trace of this fix in its history.
Comment 17 Dan Winship 2015-02-25 15:53:57 UTC
(In reply to Adam Williamson from comment #16)
> Maybe I don't understand NetworkManager git conventions, but I don't see
> this fix on master?

Yeah, it hasn't been committed yet. (I'm not sure what Thomas meant when he said 'it qualifies as "RESOLVED"'. I guess there was an implied "if we commit this".)
Comment 18 Dan Winship 2015-02-25 16:00:34 UTC
> dns: refresh DNS if plugin child quits unexpectedly (bgo #728342)

>+		nm_log_warn (LOGD_DNS, "could not commit DNS changes: (%d) %s",
>+		             error ? error->code : -1,
>+		             error && error->message ? error->message : "(unknown)");

    nm_log_warn (LOGD_DNS, "could not commit DNS changes: %s", error->message);

everything else looks good
Comment 19 Thomas Haller 2015-02-25 17:51:21 UTC
(In reply to Dan Winship from comment #17)
> (In reply to Adam Williamson from comment #16)
> > Maybe I don't understand NetworkManager git conventions, but I don't see
> > this fix on master?
> 
> Yeah, it hasn't been committed yet. (I'm not sure what Thomas meant when he
> said 'it qualifies as "RESOLVED"'. I guess there was an implied "if we
> commit this".)

I don't remember ... but I agree with your interpretation :)
Comment 20 Dan Williams 2015-02-26 21:14:01 UTC
I rebased and repushed the branch.

To handle danw's requested error ? XXX cleanup, I had to add a new commit that ensures that when update_dns() returns FALSE, 'error' is always set, which wasn't the case before.

Please review!
Comment 21 Thomas Haller 2015-03-02 11:37:10 UTC
>> dns: ensure that update_dns() always returns a GError on failure
    
maybe nm_spawn_process() should use a @local error variable, and continue to log a warning in case of error -- at least if @error==NULL.

Having a logging about failure seems useful.

And then g_propagate_error(error, local);





-              retval &= (pclose (f) == 0);
+              (void) pclose (f);

Why not signal an error if pclose() fails? Or maybe a comment why we ignore it?



otherwise LGTM
Comment 22 Dan Williams 2015-03-09 16:03:03 UTC
Fixed those up and repushed.
Comment 23 Dan Winship 2015-03-20 20:52:27 UTC
> fixup! dns: ensure that update_dns() always returns a GError on failure

>+			/* Ignore pclose() failure; can't do anything about it */
>+			err = pclose (f);
>+			if (err < 0) {
>+				errnosv = errno;
>+				g_set_error (error, G_IO_ERROR, g_io_error_from_errno (errnosv),
>+				             "Failed to close pipe to resolvconf: %d", errnosv);
>+				retval = FALSE;
>+			} else if (ret > 0) {
>+				nm_log_warn (LOGD_DNS, "resolvconf failed with status %d", ret);

The comment appears to be lying, and "ret" in the last two lines is probably supposed to be "err"?
Comment 24 Dan Williams 2015-03-27 21:27:20 UTC
Fixed that up and pushed to git master and backported to nm-1-0.