GNOME Bugzilla – Bug 761151
NM sometimes does not detect wifi device removal
Last modified: 2020-11-12 14:29:26 UTC
Due to kernel bug [1], NM sometimes fails to detect the removal of a wireless device: # nmcli d DEVICE TYPE STATE CONNECTION wlp4s0 wifi connected guests # modprobe -r iwlmvm iwlwifi # ip l show wlp4s0 Device "wlp4s0" does not exist. # nmcli d DEVICE TYPE STATE CONNECTION wlp4s0 wifi connected guests I suspect this might be the root cause of other reported bugs as: https://bugzilla.gnome.org/show_bug.cgi?id=759162 https://bugzilla.redhat.com/show_bug.cgi?id=1298008 which probably happen when NM believes there are two different interfaces with the same name and removes one of them in nm-manager.c:device_ip_iface_changed(), unreferencing it and causing its destruction later in the middle of _set_state_full(). [1] https://bugzilla.redhat.com/show_bug.cgi?id=1302037
Created attachment 319797 [details] [review] [PATCH] platform: always try to refetch new ethernet links
(In reply to Beniamino Galvani from comment #1) > Created attachment 319797 [details] [review] [review] > [PATCH] platform: always try to refetch new ethernet links why does it check for NM_LINK_TYPE_ETHERNET, when the example is about Wi-Fi? Can you add a unit-test that reproduces the wrong cache entry? If a cache inconsistency can cause a crash related to NMDevices, the crash should be fixed too (or at least it warrants a new BZ).
(In reply to Thomas Haller from comment #2) > (In reply to Beniamino Galvani from comment #1) > > Created attachment 319797 [details] [review] [review] [review] > > [PATCH] platform: always try to refetch new ethernet links > > why does it check for NM_LINK_TYPE_ETHERNET, when the example is about Wi-Fi? Because the NEWLINK message does not contain any link-type information and, since the device does not exist, we can't determine the type reading /sys; so it defaults to ethernet. > Can you add a unit-test that reproduces the wrong cache entry? It happens only with wifi links when a connected interface is removed, so I guess it can't be reproduced by tests. > If a cache inconsistency can cause a crash related to NMDevices, the crash should be fixed too (or at least it warrants a new BZ). I'm not sure, the core is crashing because the platform is telling that there are two different interfaces with the same name, which is clearly impossible. Ok, probably we should fix it anyway.
(In reply to Beniamino Galvani from comment #3) > (In reply to Thomas Haller from comment #2) > > (In reply to Beniamino Galvani from comment #1) > > If a cache inconsistency can cause a crash related to NMDevices, the crash should be fixed too (or at least it warrants a new BZ). > > I'm not sure, the core is crashing because the platform is telling that > there are two different interfaces with the same name, which is clearly > impossible. Ok, probably we should fix it anyway. Hm, really? Why does the phantom-reappearance of a deleted interface cause "different interfaces with the same name"? If "ifindex:31,ifname:wlan0" gets deleted and reappears, there is only one interface/ifindex/ifname that disappeared and reappeared. In your case, the Wi-Fi device didn't really exist afterwards and platform cache contained a phantom link. But if you move your wlan0 interface to another netns and back, to the platform cache it also looks like a disappearing and reappearing device. Is that really different then the phantom case? And note that the phantom case isn't special either. A device can be deleted anytime and platform might not yet have processed the RTM_DELLINK message. In that sense, it is always unknown whether any of the links actually still exist and core must still work correctly. Maybe the crash is related to the link-type. At first, the linktype was wifi, then it was detected as ethernet.
(In reply to Thomas Haller from comment #4) > > Hm, really? Why does the phantom-reappearance of a deleted interface cause > "different interfaces with the same name"? > If "ifindex:31,ifname:wlan0" gets deleted and reappears, there is only one > interface/ifindex/ifname that disappeared and reappeared. Sorry, I forgot to explain the necessary next step: later the module is inserted again and the real "ifindex:32,ifname:wlan0" appears; which clashes with the fake "ifindex:31,ifname:wlan0" that NM already has (caused by spurious NEWLINK). > In your case, the Wi-Fi device didn't really exist afterwards and platform > cache contained a phantom link. But if you move your wlan0 interface to > another netns and back, to the platform cache it also looks like a > disappearing and reappearing device. Is that really different then the > phantom case? When you move the interface to a different namespace and back, the kernel sends DELLINK + NEWLINK events, and at the end the interface is there. In my case, NM still gets DELLINK + NEWLINK events, but the interface is missing. > And note that the phantom case isn't special either. A device can be deleted > anytime and platform might not yet have processed the RTM_DELLINK message. > In that sense, it is always unknown whether any of the links actually still > exist and core must still work correctly. Yeah, but in that case the wrong information is temporary and will be eventually fixed. And also, the consistency of the view NM has of the interfaces is preserved (apart from delays in updating it). In the case I described the fake entry is permanent (until the fake ifindex is reused). Btw, patches were sent to netdev to fix the kernel bug (but of course we need a fix in NM too for kernels that don't have them): http://marc.info/?l=linux-netdev&m=145389852004994
(In reply to Beniamino Galvani from comment #5) > (In reply to Thomas Haller from comment #4) > > > > Hm, really? Why does the phantom-reappearance of a deleted interface cause > > "different interfaces with the same name"? > > If "ifindex:31,ifname:wlan0" gets deleted and reappears, there is only one > > interface/ifindex/ifname that disappeared and reappeared. > > Sorry, I forgot to explain the necessary next step: later the module > is inserted again and the real "ifindex:32,ifname:wlan0" appears; > which clashes with the fake "ifindex:31,ifname:wlan0" that NM already > has (caused by spurious NEWLINK). I see. Indeed platform cache does not enforce unique ifnames. It does not, because there would be an additional performance penalty as every RTM_NEWLINK event would have to check whether a conflicting ifname already exists. Maybe we should detect duplicate ifnames and trigger a full-resync. After all, it's one additional cache lookup as there is also an index-by-ifname. Yeay :) > > In your case, the Wi-Fi device didn't really exist afterwards and platform > > cache contained a phantom link. But if you move your wlan0 interface to > > another netns and back, to the platform cache it also looks like a > > disappearing and reappearing device. Is that really different then the > > phantom case? > > When you move the interface to a different namespace and back, the > kernel sends DELLINK + NEWLINK events, and at the end the interface is > there. Netlink events can be lost (buffer overrun, kernel bug). Even if the events are seen by platform, NMDevice/NMManager processes platform events on an idle handler. It can miss the removed-added signals and think the device was just always there. Fixing that would be very difficult because state changes are not reentrant, so you cannot process the platform event right away. Also, if you do that, then you see the platform cache in an inconsistent state -- which you never do on an idle handler (except this bug at hand). Even worse, kernel sometimes sends wrong RTM_DELLINK messages, making the event unreliable. So, you are right, we shouldn't see duplicate ifnames. But a interface (be it identified via ifindex or ifname) could change completely from one moment to the next. > In my case, NM still gets DELLINK + NEWLINK events, but the interface > is missing. > > > And note that the phantom case isn't special either. A device can be deleted > > anytime and platform might not yet have processed the RTM_DELLINK message. > > In that sense, it is always unknown whether any of the links actually still > > exist and core must still work correctly. > > Yeah, but in that case the wrong information is temporary and will be > eventually fixed. And also, the consistency of the view NM has of the > interfaces is preserved (apart from delays in updating it). In the > case I described the fake entry is permanent (until the fake ifindex > is reused). indeed.
(In reply to Thomas Haller from comment #6) > I see. Indeed platform cache does not enforce unique ifnames. It does not, > because there would be an additional performance penalty as every > RTM_NEWLINK event would have to check whether a conflicting ifname already > exists. > > Maybe we should detect duplicate ifnames and trigger a full-resync. After > all, it's one additional cache lookup as there is also an index-by-ifname. > Yeay :) We could do this, but it wouldn't solve the root of problem: after removing the module (or unplugging the USB adapter) NM will still think that a interface is present. The patch I posted probably has some negative impact on performances because every new link with type=ETHERNET is fetched again, but it should be enough to fix entirely this problem. Maybe an additional check on duplicate ifnames will not hurt, anyway.
(In reply to Beniamino Galvani from comment #7) > (In reply to Thomas Haller from comment #6) > > I see. Indeed platform cache does not enforce unique ifnames. It does not, > > because there would be an additional performance penalty as every > > RTM_NEWLINK event would have to check whether a conflicting ifname already > > exists. > > > > Maybe we should detect duplicate ifnames and trigger a full-resync. After > > all, it's one additional cache lookup as there is also an index-by-ifname. > > Yeay :) > > We could do this, but it wouldn't solve the root of problem: after > removing the module (or unplugging the USB adapter) NM will still > think that a interface is present. The patch I posted probably has > some negative impact on performances because every new link with > type=ETHERNET is fetched again, but it should be enough to fix > entirely this problem. > > Maybe an additional check on duplicate ifnames will not hurt, anyway. Agree. I did not mean, that it would solve the phantom link, as we would only trigger the resync later when the module gets re-added. I meant, as additional check. Downside: extra work + some performance impact.
Created attachment 320884 [details] [review] [PATCH v2] platform: always try to refetch new ethernet links v2: minimized performance impact. We can avoid the refetch if the link contains an address because the spurious messages don't contain other information beside the interface name.
(In reply to Beniamino Galvani from comment #9) > Created attachment 320884 [details] [review] [review] > [PATCH v2] platform: always try to refetch new ethernet links > > v2: minimized performance impact. We can avoid the refetch if the link > contains an address because the spurious messages don't contain other > information beside the interface name. maybe update commit message and include URI to this BZ. otherwise, v2 LGTM
Merged the platform fix to master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=97be12b6625e738856814403195b60f7ebc13bfe and nm-1-0: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?h=nm-1-0&id=afb244778952b7cfff515963daa516aa29a779c5 Keeping this bz open since we probably should implement the additional check suggested in comment 6: "Maybe we should detect duplicate ifnames and trigger a full-resync. After all, it's one additional cache lookup as there is also an index-by-ifname."
bugzilla.gnome.org is being shut down in favor of a GitLab instance. We are closing all old bug reports and feature requests in GNOME Bugzilla which have not seen updates for a long time. If you still use NetworkManager and if you still see this bug / want this feature in a recent and supported version of NetworkManager, then please feel free to report it at https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/ Thank you for creating this report and we are sorry it could not be implemented (workforce and time is unfortunately limited).