GNOME Bugzilla – Bug 783391
NM kills dhclient on DHCPNAK and prevents new offers.
Last modified: 2018-03-13 14:17:17 UTC
Created attachment 353126 [details] verbose (INFO) log of the failures In the attached log, you can see that when dhclient receives DHCPNAK from the server, it is killed (the pid changes on next log message, probably killed by NM itself) and restarted three times. Afterwards, the connection remains in a broken state for about 7 hours, or until I restart it from nm-applet. I lack internal knowledge, but it seems that NM moves to the expire state and then cancels the DHCP transaction (why ?). Even stranger, sometimes NM does not fail on DHCPNAK, an the log looks like this: > May 30 22:48:10 ankh-morpork dhclient[1252]: DHCPNAK from 192.168.1.1 > May 30 22:48:10 ankh-morpork NetworkManager[31596]: <info> [1496177290.0942] dhcp4 (wlp2s0): state changed unknown -> expire > May 30 22:48:10 ankh-morpork NetworkManager[31596]: <info> [1496177290.1003] dhcp4 (wlp2s0): state changed expire -> unknown > May 30 22:48:10 ankh-morpork dhclient[1252]: DHCPDISCOVER on wlp2s0 to 255.255.255.255 port 67 interval 8 > May 30 22:48:12 ankh-morpork dhclient[1252]: DHCPREQUEST on wlp2s0 to 255.255.255.255 port 67 > May 30 22:48:12 ankh-morpork dhclient[1252]: DHCPOFFER from 192.168.1.1 > May 30 22:48:12 ankh-morpork dhclient[1252]: DHCPACK from 192.168.1.1 So, "expire -> unknown" seems to be the best case... weird :-).
Looks like there's a partial workaround for this from https://bugzilla.gnome.org/show_bug.cgi?id=739482. But that only handles the case where the client isn't configured yet. This can also happen when a client gets an address then later tries to renew it. If the DHCP server needs to give the client a different address (e.g. server config changed or client network changed), the server will send a DHCPNAK, and dhclient will send an EXPIRE event. Then NetworkManager kills dhclient then schedules a restart. When dhclient starts, it goes into INIT-REBOOT state, where it first tries the (now invalid) last-known address. Which results in a DHCPNAK. And it goes around it circles. Maybe dhcp4_cleanup should run "dhclient -r" so dhclient doesn't keep trying to use the last-known address? Or maybe it should remove the leases file? Or maybe NetworkManager could let dhclient keep running, since it would sort things out itself? (https://bugzilla.gnome.org/show_bug.cgi?id=748268 also seems related.)
I can see two different fixes for this issue... and honestly, both should probably be implemented. Fix 1: change dhclient.c in isc-dhcp-client to destroy its lease *before* calling out to its script file. This ensures that by the time dhclient would be killed by NetworkManager, the lease has been invalidated. Fix 2: change nm-dhcp-dhclient.c to delete the lease file in stop(). The core assumption NetworkManager is making seems to be that stop() destroys all DHCP client state. The fact that the leases file hangs around violates that assumption. I have patches for both of these; I will attach both shortly.
Created attachment 369074 [details] [review] dhclient-lease-kill.diff Patch to delete the lease file when killing dhclient.
Created attachment 369075 [details] [review] dhclient-destroy-lease-before-script.diff
(Note that this also affects NM 1.8.x, and current HEAD.)
Both patches look good to me. Thanks !
(In reply to Justin King-Lacroix from comment #3) > Created attachment 369074 [details] [review] [review] > dhclient-lease-kill.diff > > Patch to delete the lease file when killing dhclient. I don't think this is correct, if the lease file is deleted every time the client is stopped, the interface will get a different IP address at the next connection attempt. In my opinion the correct way to fix this is to change dhcp4_fail() to keep dhclient running if the lease expires, instead of killing it and restarting multiple times. Patch coming...
Created attachment 369462 [details] [review] [PATCH] dhcp: handle expiry by letting the client continue for some time I think this is the right approach to fix the problem. The patch is a draft, I'm still doing tests to see if it breaks anything (also, there are some FIXMEs).
That also makes a whole lot of sense. Letting the dhcp client continue in order to allow the lease to be purged would fix the problem.
Created attachment 369604 [details] [review] [PATCH v2] dhcp: handle expiry by letting the client continue for some time I slightly modified the patch, now CI tests are all green.
(In reply to Beniamino Galvani from comment #10) > Created attachment 369604 [details] [review] [review] > [PATCH v2] dhcp: handle expiry by letting the client continue for some time > > I slightly modified the patch, now CI tests are all green. lgtm
Applied to master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=17009ed91da8b3e0b10ee7e94d220be9bd3fa84c