GNOME Bugzilla – Bug 784636
Because of a private DBUS socket issue, NetworkManager sometimes does not update IPv4 lease information
Last modified: 2020-11-12 14:28:33 UTC
Reported originally on Launchpad, but asked to upstream here. Link: https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1696415 We're having an issue on our office machines, that randomly drop their lease for about an hour every few days. After this, the lease is renewed and the machine comes back online. I've investigated the issue and closed down on the root cause being in communication between dhclient/nm-dhcp-helper and NetworkManager. In our network, we have a 7200 second lease time. What I normally see is that NetworkManager runs 'dhclient', which renews its IPv4 lease about every hour; this causes a call to nm-dhcp-helper which sends a message to NetworkManager over a private DBUS socket. From the logs: jun 13 19:26:21 cuba dhclient[13154]: DHCPREQUEST of 192.168.0.55 on eth1 to 192.168.0.3 port 67 (xid=0xfd7483b) jun 13 19:26:21 cuba dhclient[13154]: DHCPACK of 192.168.0.55 from 192.168.0.3 jun 13 19:26:21 cuba NetworkManager[28642]: <debug> [1497374781.0748] bus-manager: (dhcp) accepted connection 0x7f9ae000fc60 on private socket jun 13 19:26:21 cuba NetworkManager[28642]: <debug> [1497374781.0759] dhcp4 (eth1): DHCP reason 'RENEW' -> state 'bound' [...] NetworkManager receives lease information including validity, and configures the IP address lifetime on the interface according to the lease validity. However occasionally, when nm-dhcp-helper is invoked, it opens a connection to NetworkManager but this does not result in NetworkManager updating the lease information: jun 13 20:20:11 cuba dhclient[13154]: DHCPREQUEST of 192.168.0.55 on eth1 to 192.168.0.3 port 67 (xid=0xfd7483b) jun 13 20:20:11 cuba dhclient[13154]: DHCPACK of 192.168.0.55 from 192.168.0.3 jun 13 20:20:11 cuba NetworkManager[28642]: <debug> [1497378011.6527] bus-manager: (dhcp) accepted connection 0x7f9ae0019060 on private socket jun 13 20:20:11 cuba NetworkManager[28642]: <debug> [1497378011.6527] bus-manager: (dhcp) closed connection 0x7f9ae0019060 on private socket jun 13 20:20:11 cuba dhclient[13154]: bound to 192.168.0.55 -- renewal in 2731 seconds. When this happens, the IP address lifetime is not updated on the interface. Luckily, dhclient updates the lease twice within the 7200 seconds lease time, so oftentimes the second lease update does come through and the lifetime is updated. However, once every few days, both dhclient messages don't go through, causing the lifetime to expire and the IP to be dropped. Annoyingly, dhclient does not notice this, and keeps sending requests for renewal with a source IP that is not on the interface anymore. I guess this packet is dropped continuously, until about 1 hour later dhclient decides to send a DHCPREQUEST with source IP 255.255.255.255, which is picked up immediately by the DHCP server and the lease is reinstated. I've zoomed into this issue by replacing nm-dhcp-helper with a shell script that dumps the environment, parameters and output of the actual nm-dhcp-helper as it sends its messages. However, there is no observable difference in any of those between succeeding and failing calls. As a workaround, I've currently replaced nm-dhcp-helper with a script that calls the actual nm-dhcp-helper, then checks the journal to see if the update got through. If not, it tries again, up to 10 times, then it reports to a file in /tmp how many attempts were necessary to update NetworkManager. On one of our office machines, this file contains: Thu Jun 29 10:16:24 CEST 2017: needed 2 to update NetworkManager. Thu Jun 29 11:58:58 CEST 2017: needed 5 to update NetworkManager. Sat Jul 1 21:08:57 CEST 2017: needed 2 to update NetworkManager. Sat Jul 1 22:47:32 CEST 2017: needed 2 to update NetworkManager. Sun Jul 2 21:36:42 CEST 2017: needed 2 to update NetworkManager. Sun Jul 2 22:30:08 CEST 2017: needed 2 to update NetworkManager. Mon Jul 3 21:45:09 CEST 2017: needed 3 attempts to update NetworkManager. Tue Jul 4 19:48:18 CEST 2017: needed 3 attempts to update NetworkManager (RENEW). On another: Mon Jul 3 18:40:53 CEST 2017: needed 4 attempts to update NetworkManager. Mon Jul 3 19:29:39 CEST 2017: needed 5 attempts to update NetworkManager. Tue Jul 4 19:17:13 CEST 2017: needed 2 attempts to update NetworkManager (RENEW). Wed Jul 5 19:56:44 CEST 2017: needed 2 attempts to update NetworkManager (RENEW). Another: Mon Jul 3 20:37:33 CEST 2017: needed 2 attempts to update NetworkManager. Mon Jul 3 21:24:37 CEST 2017: needed 2 attempts to update NetworkManager. Wed Jul 5 20:12:22 CEST 2017: needed 2 attempts to update NetworkManager (RENEW). Thu Jul 6 20:02:04 CEST 2017: needed 2 attempts to update NetworkManager (RENEW). (As you see I've made some changes to the reporting over time, but the idea is the same. Also: most of the times are quite similar every time. I wonder why? The only thing different between 18:30-21:00 and the rest of the day is that most people will have just left the office, and we have a 'nightly' heavy processing job running everywhere around that time...) The script can be found in the Launchpad bug report. The benefit of this script is that it fixes the issue, while still reporting whether it still exists. The output of nm-dhcp-helper also goes to this file. I don't know how to continue debugging from here -- if you have any ideas, let me know and I can try it without annoying my colleagues. This issue is on a pretty old version of NetworkManager (1.2.x), because we're using Ubuntu 16.04. If possible I'd like not to stray from this version too far; I'd be more comfortable with backporting relevant fixes from newer versions than switching to newer versions completely.
could be related to https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=2856a658b38df88494187c811415a198a424e1f2 , which is backported to nm-1-4 branch as e678bd29a4e5fdd83b7d99392d54ecb5fcabc287 Are you able to test test with 1.4.2+ or 1.6+ ? Or try to backport this patch to 1.2, but that is probably more effort...
(btw, very helpful bug report. Well done!)
Hi Thomas, thanks for your replies! That indeed looks very much like a relevant commit. I've backported it to the 1.2 branch -- I haven't tested it yet, I only know it compiles and looks OK on first sight. Next Monday, I'll test if it works to begin with, then over the next few days, we'll see if the script still has to retry its transmissions. If not, we've found the cause and hopefully Ubuntu will include the patch in their repositories (or update to 1.4.2+ altogether).
Created attachment 355121 [details] [review] 1.4 race fix backported to 1.2 branch (untested)
Created attachment 355242 [details] [review] 1.4 race fix backported to 1.2 branch (untested) v2 Logging fixes (g_printerr() doesn't include \n automatically, while LOGE and LOGI did). It looks like nm-dhcp-helper still works after this patch in the normal case, without having updated NM. If I call it without the correct environment, I see it retrying a bunch of times before giving up; those messages should start appearing in the logs instead of the current "needed N attempts to update NetworkManager" messages. I'll roll it out on some more machines to see if it helps. I'll also test it after restarting NetworkManager, so it uses the new signal. That should properly fix it. I'll keep you up-to-date.
We've been running with the new tool for a while, but I'm afraid we're still seeing retry messages in the logs: failure to call notify: Timeout was reached (try signal via Event) Fri Jul 28 04:12:48 CEST 2017: needed 2 attempts to update NetworkManager (RENEW). Sat Jul 29 07:07:30 CEST 2017: needed 2 attempts to update NetworkManager (RENEW). Sun Jul 30 03:48:07 CEST 2017: needed 4 attempts to update NetworkManager (RENEW). I have yet to figure out why this still fails.
bugzilla.gnome.org is being shut down in favor of a GitLab instance. We are closing all old bug reports and feature requests in GNOME Bugzilla which have not seen updates for a long time. If you still use NetworkManager and if you still see this bug / want this feature in a recent and supported version of NetworkManager, then please feel free to report it at https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/ Thank you for creating this report and we are sorry it could not be implemented (workforce and time is unfortunately limited).