After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 765153 - dns=dnsmasq doesn't play well with multiple connections
dns=dnsmasq doesn't play well with multiple connections
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: IP and DNS config
git master
Other Linux
: Normal normal
: ---
Assigned To: NetworkManager maintainer(s)
NetworkManager maintainer(s)
Depends on:
Blocks: nm-review
 
 
Reported: 2016-04-16 18:33 UTC by Tore Anderson
Modified: 2016-05-18 09:22 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
dns: specify egress interface for each dnsmasq upstream server (4.85 KB, patch)
2016-04-19 14:49 UTC, Beniamino Galvani
none Details | Review

Description Tore Anderson 2016-04-16 18:33:11 UTC
NM will configure dnsmasq with all available upstream servers from all active connections, however it will do nothing to ensure that connection X is being used to transport the DNS requests sent to DNS server X. That is, if connection Y is more preferred, DNS server X will be contacted through connection Y. In that case, DNS server X might refuse the request, because it only allows requests arriving from a certain whitelisted networks.

Dnsmasq forwards an incoming request to all configured DNS servers simultaneously, and will respond back with the first answer that comes back. Thus, if upstream server's refusal is the first response, the stub resolver will also receive a refusal, and the user will see a "hostname not found" error or something similar. In the event of cold caches, it is very likely that the refusal is the first answer that comes back, because it does not require any further processing. A server that's willing to actually process the request will by comparsion need to contact the authoritative servers for the domain in order to do so, which takes time.

Below you'll find a concrete example of this happening. I have two active connections, wifi and wwan.

/var/run/NetworkManager/dnsmasq.conf contains the following:

  server=10.0.79.14
  server=fe80::120d:7fff:fe6d:a593%wlo1
  server=2001:4600:4:fff::52

The first two servers are learned from the wifi, while the third is learned from the wwan. The connection metrics are at their defaults, so the wifi is preferred for outgoing packets. The two connections are served by different ISPs, and crucially, 2001:4600:4:fff::52 is not usable for customers of the wwan ISP.

Reproducing the problem:

$ host -t CNAME foobar.github.io.
Host foobar.github.io not found: 5(REFUSED)

If you look at a packet capture, this is what happens:

111  49.897601    127.0.0.1 -> 127.0.0.1    DNS 78 Standard query 0xc81a  CNAME foobar.github.io

The above packet is /usr/bin/host's request to the locally running dnsmasq process.

112  49.897745 2a02:fe0:c420:1c97:95e4:8bd5:c947:9f40 -> 2001:4600:4:fff::52 DNS 98 Standard query 0x340b  CNAME foobar.github.io
113  49.897794 fe80::eed8:dfa6:862e:3e32 -> fe80::120d:7fff:fe6d:a593 DNS 98 Standard query 0x340b  CNAME foobar.github.io
114  49.897841  10.0.79.155 -> 10.0.79.14   DNS 78 Standard query 0x340b  CNAME foobar.github.io

The dnsmasq process forwards the request to all its configured upstream servers more-or-less simultaneously. It is important to note that all of these packets went out the wifi interface; the wwan one was not used at all.

115  49.906245 2001:4600:4:fff::52 -> 2a02:fe0:c420:1c97:95e4:8bd5:c947:9f40 DNS 98 Standard query response 0x340b Refused

The response from the wwan ISP is the first one, and it's a refusal - because it originated from a source IP address belonging to the wifi ISP.

116  49.906421    127.0.0.1 -> 127.0.0.1    DNS 78 Standard query response 0xc81a Refused

Since dnsmsaq now has an answer, it returns that answer to the "host" command, and the failure was returned back to the user.

117  49.931072   10.0.79.14 -> 10.0.79.155  DNS 113 Standard query response 0x340b  CNAME github.map.fastly.net
118  49.934423 fe80::120d:7fff:fe6d:a593 -> fe80::eed8:dfa6:862e:3e32 DNS 133 Standard query response 0x340b  CNAME github.map.fastly.net

Not long after, the wifi ISP's DNS servers finished resolving the request (actually it was handled by forwarders in the wireless home gateway, but that's not really relevant here). However it is too late to "fix" the failure that was already returned to the caller.

If however I retry the request a few seconds later, it succeeds:

$ host -t CNAME foobar.github.io.
foobar.github.io is an alias for github.map.fastly.net.

119  52.911891    127.0.0.1 -> 127.0.0.1    DNS 78 Standard query 0x7c68  CNAME foobar.github.io

Request from /usr/bin/host to the local dnsmasq above.

120  52.912013 2a02:fe0:c420:1c97:95e4:8bd5:c947:9f40 -> 2001:4600:4:fff::52 DNS 98 Standard query 0x6e72  CNAME foobar.github.io
121  52.912063 fe80::eed8:dfa6:862e:3e32 -> fe80::120d:7fff:fe6d:a593 DNS 98 Standard query 0x6e72  CNAME foobar.github.io
122  52.912105  10.0.79.155 -> 10.0.79.14   DNS 78 Standard query 0x6e72  CNAME foobar.github.io

dnsmasq again forwards the request to all three upstream servers simultaneously:

123  52.921235   10.0.79.14 -> 10.0.79.155  DNS 113 Standard query response 0x6e72  CNAME github.map.fastly.net

This time, the first reponse to come back is from the wifi ISP. It's marginally faster than the wwan ISP, presumably because the previous request seeded its cache.

124  52.921532 2001:4600:4:fff::52 -> 2a02:fe0:c420:1c97:95e4:8bd5:c947:9f40 DNS 98 Standard query response 0x6e72 Refused

Here's the refusal from the wwan ISP, just 297 µsec too late to cause any problems.

125  52.921532    127.0.0.1 -> 127.0.0.1    DNS 113 Standard query response 0x7c68  CNAME github.map.fastly.net

And here we can see dnsmasq returning the expected response to /usr/bin/host.

126  52.923477 fe80::120d:7fff:fe6d:a593 -> fe80::eed8:dfa6:862e:3e32 DNS 133 Standard query response 0x6e72  CNAME github.map.fastly.net

And this is the last response we're expecting (not that it matters at this point).

I've done some testing, and one way to fix the problem is to suffix "@<interface>" to the dnsmasq.conf server=foo statements. In my case, this means changing /var/run/NetworkManager/dnsmasq.conf to contain the following:

  server=10.0.79.14@wlo1 # not really necessary since wlo1 is already preferred, but doesn't hurt and seems more consistent
  server=fe80::120d:7fff:fe6d:a593%wlo1 # ipv6 link-locals already provide interface so these need no change
  server=2001:4600:4:fff::52@wwp0s26u1u5i6 # ensures wwan interface, and crucially wwan source ip address, is used

Other approaches include setting up explicit host routes to the DNS servers out their appropriate interface, or to only configure dnsmasq with the DNS servers belonging to the primary selected connection (for each address family).
Comment 1 Thomas Haller 2016-04-19 10:01:55 UTC
(In reply to Tore Anderson from comment #0)
>   server=10.0.79.14@wlo1 # not really necessary since wlo1 is already
> preferred, but doesn't hurt and seems more consistent
>   server=fe80::120d:7fff:fe6d:a593%wlo1 # ipv6 link-locals already provide
> interface so these need no change
>   server=2001:4600:4:fff::52@wwp0s26u1u5i6 # ensures wwan interface, and
> crucially wwan source ip address, is used

How does this work in the face of https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=c328cf52f2b219feb5ae03fd40175f6b6ca73975 ?

We should do this, but it only fixes the problem for dnsmasq.


> Other approaches include setting up explicit host routes to the DNS servers
> out their appropriate interface, or to only configure dnsmasq with the DNS
> servers belonging to the primary selected connection (for each address
> family).

We probably should do this too.
Comment 2 Tore Anderson 2016-04-19 10:33:05 UTC
IPv6 link-locals need to be specified with the '%' delimiter to work with current versions of dnsmasq (as explained in c328cf5). This issue impacts non-link-locals only. So you'll want something that works like this:

dnsmasq_conf = sprintf("server=%s%s%s", addr, IN6_IS_ADDR_LINKLOCAL(addr) ? "%" : "@", iface)

The issue doesn't really affect the glibc stub resolver because it checks with each server in resolv.conf in the order given. NM sorts that list according to the associated connection preference, so most likely the server tried first is a usable server.
Comment 3 Beniamino Galvani 2016-04-19 12:49:17 UTC
(In reply to Tore Anderson from comment #0)
> I've done some testing, and one way to fix the problem is to suffix
> "@<interface>" to the dnsmasq.conf server=foo statements.

The main problem I see with this approach is that multiple connections
can have the same DNS server, but we can only put a single
"server=IP@iface" entry in dnsmasq configuration for that address.

In case of duplicate entries probably we should look at routes to
decide which interface is the best for a given name server (however
this is not trivial to do right), or simply omit the "@interface"
suffix (but then the original issue would be still present).
Comment 4 Thomas Haller 2016-04-19 12:58:23 UTC
(In reply to Beniamino Galvani from comment #3)
> (In reply to Tore Anderson from comment #0)
> > I've done some testing, and one way to fix the problem is to suffix
> > "@<interface>" to the dnsmasq.conf server=foo statements.
> 
> The main problem I see with this approach is that multiple connections
> can have the same DNS server, but we can only put a single
> "server=IP@iface" entry in dnsmasq configuration for that address.
> 


> In case of duplicate entries probably we should look at routes to
> decide which interface is the best for a given name server (however
> this is not trivial to do right), or simply omit the "@interface"
> suffix (but then the original issue would be still present).

Reenacting what kernel routing does would be much too complicated. There is RTM_GETROUTE netlink request for that though...

But that still seems complicated to do, because routing can change, and then it's unclear what do to about that (and how to detect that).


Adding a direct route to the DNS server might be a good idea. But not trival either: what would be the gateway for that route? Guess for that we could use RTM_GETROUTE...
Comment 5 Tore Anderson 2016-04-19 13:30:39 UTC
(In reply to Beniamino Galvani from comment #3)

> The main problem I see with this approach is that multiple connections
> can have the same DNS server, but we can only put a single
> "server=IP@iface" entry in dnsmasq configuration for that address.

Actually it seems to work. Well, sort of. For example:

$ dnsmasq -d -R -S 84.208.20.110@wlo1 -S 84.208.20.110@enp0s25

...will send out forwarded requests to the server via both interfaces simultaneously. It only seems to process responses coming in via the interface belonging to the primary connection for some reason (maybe rp_filter?), but at least that's no worse than what we have today.
Comment 6 Beniamino Galvani 2016-04-19 14:49:56 UTC
Created attachment 326327 [details] [review]
dns: specify egress interface for each dnsmasq upstream server

(In reply to Tore Anderson from comment #5)
> ...will send out forwarded requests to the server via both interfaces
> simultaneously. It only seems to process responses coming in via the
> interface belonging to the primary connection for some reason (maybe
> rp_filter?), but at least that's no worse than what we have today.

This sounds fine. I think we should always specify an interface to
avoid the issues you reported. How about this patch? I did some basic
tests, but probably more are needed.
Comment 7 Tore Anderson 2016-04-19 18:11:03 UTC
I've tested that patch on top of git master and functionality-wise it looks good to me, the REFUSED lookups are completely gone. (I have no opinion of the code itself though.)
Comment 8 Thomas Haller 2016-04-20 07:32:07 UTC
Looks good. Just some minor nits:



	iface = g_object_get_data (G_OBJECT (ip4), IP_CONFIG_IFACE_TAG);
	g_assert (iface);

The assert is quite harsh. I'd drop the assert (also from the v6 case).
Or alternatively: keep the assert but add a g_return_if_fail(iface) to nm_dns_manager_add_ip6_config(), to catch the missing interface name when it is set, not later when it is used. There should be the assert early when the wrong thing happens (not when we later want to use it).






Btw, there is also nm_sprintf_buf(), but g_snprintf is fine too.









  buf = nm_utils_inet4_ntop (addr, NULL);
  /*use buf later*/

IMO using the static-buffer is error prone when the result is not used right away.

preferable would be

  char buf[NM_UTILS_INET_ADDRSTRLEN];
  nm_utils_inet4_ntop (addr, buf);
  /*use buf later*/

or

+              snprintf (buf, sizeof (buf), "%s%s%s",
+                        nm_utils_inet4_ntop (addr, NULL);
+                        iface[0] ? "@" : "",
+                        iface[0] ? iface : "");

(as it applies).



overall, lgtm
Comment 9 Beniamino Galvani 2016-04-20 16:52:13 UTC
(In reply to Thomas Haller from comment #8)
> The assert is quite harsh. I'd drop the assert (also from the v6 case).
> Or alternatively: keep the assert but add a g_return_if_fail(iface) to
> nm_dns_manager_add_ip6_config(), to catch the missing interface name when it
> is set, not later when it is used. There should be the assert early when the
> wrong thing happens (not when we later want to use it).
 
> Btw, there is also nm_sprintf_buf(), but g_snprintf is fine too.

> IMO using the static-buffer is error prone when the result is not used right
> away.

Fixed and re-pushed (with additional commit) to bg/dnsmasq-interfaces-bgo765153.
Comment 10 Thomas Haller 2016-04-20 17:02:19 UTC
lgtm
Comment 11 Lubomir Rintel 2016-05-18 08:54:33 UTC
+1