GNOME Bugzilla – Bug 779768
VPN – interface stays unconfigured (cached interface information not updated?)
Last modified: 2017-03-23 20:02:45 UTC
Created attachment 347501 [details] log of a broken attempt Frequently, after activating an OpenVPN connection, the corresponding tun device remains down and unconfigured. Based on NM logs, it seems it's trying to configure it through the ifindex of a *previous* instance. For example, when I connected to the VPN yesterday, tun0 had ifindex 48 and NM successfully configured it. Today it's a new connection and a new interface with ifindex 51 ... but the logs still have messages about trying to configure interface #48. Oddly, I'm pretty sure it only started happening after the kernel 4.10 upgrade... but it does seem like NM *is* receiving events about the new tun device, just doesn't process them. linux 4.10.1 networkmanager 1.6.2 and 1.7.1dev.r284.g468127ca6
it looks like that the platform cache is out of sync and has two links with name "tun0". https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/vpn/nm-vpn-connection.c?id=831286df3001e6b76b7baeb10a7723841ab8b35e#n1280 Could you attach the entire logfile, with debugging enabled from the program start? (meaning, to enable it via /etc/NetworkManager/NetworkManager.conf). [logging] level=TRACE domains=ALL Probably related to https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=f0e295d3d746eb1350e0af263263e683a7bb7746 , but this is not the only problem. It seems you don't get the signal from UDEV that the device is remove. That should look like first netlink signal: platform-linux: event-notification: DELLINK, seq 0: 11 platform-linux: update-cache-link: UPDATE: [link,0x5637e3e1a1c0,2,+cac... platform: signal: link removed: 11: tun0 <NOARP,DOWN;poin.... and later UDEV signal: platform-linux: UDEV event: action 'remove' subsys 'net' device 'tun0'.... platform-linux: udev-remove: IFINDEX=11 platform-linux: update-cache-link: REMOVE: [link,0x5637e3e1a1c0,2,+c....
Hmm, you're right, `udevadm monitor` doesn't show any remove uevents (although `ip monitor link` does notice the removal). So it's definitely caused by a kernel bug then?
(In reply to Mantas Mikulėnas (grawity) from comment #2) > Hmm, you're right, `udevadm monitor` doesn't show any remove uevents > (although `ip monitor link` does notice the removal). So it's definitely > caused by a kernel bug then? NM certainly expects to get a signal from UDEV that the device is gone, otherwise the zombie hangs in the cache. It seems that is a bug in udev/kernel. Probably NM should become more resilient to such errors and cleanup the zombie after timeout? Tricky...
Sigh, it seems there are no more 'remove' uevents for network interfaces *at all* – I just tried VLANs and a physical USB-Ethernet adapter, they don't report a removal either. (It's definitely a Linux bug, as `udevadm monitor` shows events from the kernel directly as well, not just from udev.) I'll try to figure out where to report this in the LKML maze... Meanwhile, if there's a new interface with an already-known name, NM probably should assume it was removed/re-added and update its cache regardless?
(In reply to Mantas Mikulėnas (grawity) from comment #4) > I'll try to figure out where to report this in the LKML maze... Meanwhile, > if there's a new interface with an already-known name, NM probably should > assume it was removed/re-added and update its cache regardless? The ID (primary key) of a link object is the ifindex. Here there is the situation, where we have two links with differing ifindex, but same ifname. ("tun0", ifindex 48, not-in-netlink, in-udev, invisible) ("tun0", ifindex 51, in-netlink, in-udev, visible) The invisible instance is wrong, and probably we could do some heuristics to properly guess which one is invalid. Maybe, instead of looking at two such links and guessing which one is wrong, we should prune links that are not seen in netlink for a short time (300ms?). Tracking timeouts to evict the cache is a bit complicated though. https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=f0e295d3d746eb1350e0af263263e683a7bb7746 fixes, that outside of the cache only the visible link can be found. So, that actually avoids your issue -- but still leaks the invisible instance.
Tracked it down to https://git.kernel.org/linus/002d8a1a6c11b9b2a8ac615095589111dd52749b, sent a report to netdev@
Looks like the patch will reach 4.10.x sometime soon. Meanwhile, I've tested https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=f0e295d3d746eb1350e0af263263e683a7bb7746 with unpatched 4.10.1 and at least VPN connections are working fine as well.
Ok, then I am closing this as fixed. I don't think we want to add extra complexity to the cache-management with handling such zombie link entries. Thanks grawity!
*** Bug 780387 has been marked as a duplicate of this bug. ***
The patch is now in mainline and 4.10.5 stable: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91864f5852f9996210fad400cf70fb85af091243 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=7ebf301d8476d3563611f72e68ad7138f29bee56