After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 779768 - VPN – interface stays unconfigured (cached interface information not updated?)
VPN – interface stays unconfigured (cached interface information not updated?)
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: general
1.6.x
Other Linux
: Normal normal
: ---
Assigned To: NetworkManager maintainer(s)
NetworkManager maintainer(s)
: 780387 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2017-03-08 20:18 UTC by Mantas Mikulėnas (grawity)
Modified: 2017-03-23 20:02 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
log of a broken attempt (28.00 KB, text/plain)
2017-03-08 20:18 UTC, Mantas Mikulėnas (grawity)
Details

Description Mantas Mikulėnas (grawity) 2017-03-08 20:18:58 UTC
Created attachment 347501 [details]
log of a broken attempt

Frequently, after activating an OpenVPN connection, the corresponding tun device remains down and unconfigured. Based on NM logs, it seems it's trying to configure it through the ifindex of a *previous* instance.

For example, when I connected to the VPN yesterday, tun0 had ifindex 48 and NM successfully configured it. Today it's a new connection and a new interface with ifindex 51 ... but the logs still have messages about trying to configure interface #48.

Oddly, I'm pretty sure it only started happening after the kernel 4.10 upgrade... but it does seem like NM *is* receiving events about the new tun device, just doesn't process them.

linux 4.10.1
networkmanager 1.6.2 and 1.7.1dev.r284.g468127ca6
Comment 1 Thomas Haller 2017-03-09 14:21:13 UTC
it looks like that the platform cache is out of sync and has two links with name "tun0".

https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/vpn/nm-vpn-connection.c?id=831286df3001e6b76b7baeb10a7723841ab8b35e#n1280

Could you attach the entire logfile, with debugging enabled from the program start? (meaning, to enable it via /etc/NetworkManager/NetworkManager.conf).

[logging]
level=TRACE
domains=ALL




Probably related to https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=f0e295d3d746eb1350e0af263263e683a7bb7746 , but this is not the only problem. It seems you don't get the signal from UDEV that the device is remove.
That should look like
  
first netlink signal:
  platform-linux: event-notification: DELLINK, seq 0: 11
  platform-linux: update-cache-link: UPDATE: [link,0x5637e3e1a1c0,2,+cac...
  platform: signal: link removed: 11: tun0 <NOARP,DOWN;poin....

and later UDEV signal:
  platform-linux: UDEV event: action 'remove' subsys 'net' device 'tun0'....
  platform-linux: udev-remove: IFINDEX=11
  platform-linux: update-cache-link: REMOVE: [link,0x5637e3e1a1c0,2,+c....
Comment 2 Mantas Mikulėnas (grawity) 2017-03-09 14:40:29 UTC
Hmm, you're right, `udevadm monitor` doesn't show any remove uevents (although `ip monitor link` does notice the removal). So it's definitely caused by a kernel bug then?
Comment 3 Thomas Haller 2017-03-09 14:54:47 UTC
(In reply to Mantas Mikulėnas (grawity) from comment #2)
> Hmm, you're right, `udevadm monitor` doesn't show any remove uevents
> (although `ip monitor link` does notice the removal). So it's definitely
> caused by a kernel bug then?

NM certainly expects to get a signal from UDEV that the device is gone, otherwise the zombie hangs in the cache. It seems that is a bug in udev/kernel.

Probably NM should become more resilient to such errors and cleanup the zombie after timeout? Tricky...
Comment 4 Mantas Mikulėnas (grawity) 2017-03-10 07:08:47 UTC
Sigh, it seems there are no more 'remove' uevents for network interfaces *at all* – I just tried VLANs and a physical USB-Ethernet adapter, they don't report a removal either.

(It's definitely a Linux bug, as `udevadm monitor` shows events from the kernel directly as well, not just from udev.)

I'll try to figure out where to report this in the LKML maze... Meanwhile, if there's a new interface with an already-known name, NM probably should assume it was removed/re-added and update its cache regardless?
Comment 5 Thomas Haller 2017-03-10 10:35:07 UTC
(In reply to Mantas Mikulėnas (grawity) from comment #4)
> I'll try to figure out where to report this in the LKML maze... Meanwhile,
> if there's a new interface with an already-known name, NM probably should
> assume it was removed/re-added and update its cache regardless?

The ID (primary key) of a link object is the ifindex.
Here there is the situation, where we have two links with differing ifindex, but same ifname.
  ("tun0", ifindex 48, not-in-netlink, in-udev, invisible)
  ("tun0", ifindex 51, in-netlink,     in-udev, visible)
The invisible instance is wrong, and probably we could do some heuristics to properly guess which one is invalid.

Maybe, instead of looking at two such links and guessing which one is wrong, we should prune links that are not seen in netlink for a short time (300ms?). Tracking timeouts to evict the cache is a bit complicated though.

https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=f0e295d3d746eb1350e0af263263e683a7bb7746 fixes, that outside of the cache only the visible link can be found. So, that actually avoids your issue -- but still leaks the invisible instance.
Comment 6 Mantas Mikulėnas (grawity) 2017-03-11 15:12:57 UTC
Tracked it down to https://git.kernel.org/linus/002d8a1a6c11b9b2a8ac615095589111dd52749b, sent a report to netdev@
Comment 7 Mantas Mikulėnas (grawity) 2017-03-14 18:59:17 UTC
Looks like the patch will reach 4.10.x sometime soon.

Meanwhile, I've tested https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=f0e295d3d746eb1350e0af263263e683a7bb7746 with unpatched 4.10.1 and at least VPN connections are working fine as well.
Comment 8 Thomas Haller 2017-03-14 22:05:34 UTC
Ok, then I am closing this as fixed.


I don't think we want to add extra complexity to the cache-management with handling such zombie link entries.


Thanks grawity!
Comment 9 Beniamino Galvani 2017-03-23 17:17:01 UTC
*** Bug 780387 has been marked as a duplicate of this bug. ***