After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 741694 - Crash removing externally added bridge
Crash removing externally added bridge
Status: RESOLVED OBSOLETE
Product: NetworkManager
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: NetworkManager maintainer(s)
NetworkManager maintainer(s)
Depends on:
Blocks:
 
 
Reported: 2014-12-18 05:07 UTC by Dan Williams
Modified: 2020-11-12 14:29 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Dan Williams 2014-12-18 05:07:38 UTC
With th/nm-1-0-integration:

ip link add name BBB type bridge
ip addr add 192.168.201.3/24 dev BBB
<wait for activated assumed connection>
ip link del BBB

This sequence results in one of two behaviors, but both are caused by the NMActiveConnection of an assumed+external device not being cleaned up before the device is removed.

1) NetworkManager crashes due to a queued state change that is executed after the device is disposed

In this case, seen on qe-dell-ovs5-vm-3:

Dec 16 18:04:55 qe-dell-ovs5-vm-3.dqe.lab.eng.bos.redhat.com NetworkManager[23069]: <debug> [1418771095.449211] [platform/nm-platform.c:2712] log_link(): signal: link removed: 18: BBB <DOWN> mtu 1500 bridge
<no relevant log messages about device removal>
Dec 16 18:04:56 qe-dell-ovs5-vm-3 NetworkManager[23069]: <info>  (BBB): device state change: activated -> deactivating (reason 'connection-removed') [100 110 38]
Dec 16 18:04:56 qe-dell-ovs5-vm-3 NetworkManager[23069]: <info>  (BBB): device state change: deactivating -> disconnected (reason 'connection-removed') [110 30 38]
Dec 16 18:04:57 qe-dell-ovs5-vm-3 NetworkManager[23069]: <debug> [1418771097.005705] [devices/nm-device.c:8203] finalize(): [0x7f646c258c30] (BBB): finalize(): NMDeviceBridge

nm-manager.c::remove_device() is called, NMManager drops its reference.  There's only one reference left, in the NMActiveConnection.  Somehow that connection gets removed due to the kernel interface removal, but not sure how.  Then we have this flow.

∘ connection_removed()
  ‣ _deactivate_if_active()
    • nm_manager_deactivate_connection()
      ∘ nm_device_state_changed(DEACTIVATING, CONNECTION_REMOVED)
        ‣ nm_dispatcher_call (DISCONNECTED, CONNECTION_REMOVED)
          • delete_on_deactivate_unschedule()
          • dispatcher_complete_proceed_state()
            ∘ nm_device_queue_state(DISCONNECTED, CONNECTION_REMOVED)

<queued state idle handler run>
• queued_set_state(DISCONNECTED)
  ∘ _set_state_full(DISCONNECTED)
    ‣ device_state_changed() (nm-activation-request.c)
      • nm_active_connection_set_state(DEACTIVATED)
        ∘ _device_cleanup (nm-active-connection.c)
          ‣ last reference to the device is dropped
          ‣ NMDevice::dispose() + finalize() called
    ‣ but we're still halfway through _set_state_full(), with a lot of priv-> access ahead of us
    ‣ crash

2) The device is removed from the manager's list but is not disposed, and its active connection still exists in 'nmcli con show --active'

In this case, the device is removed by the manager via platform_link_cb() -> remove_device(), but its state/management is not changed because the device unmanaged due to EXTERNAL_DOWN, because nothing set the device IFF_UP.  The device is simply dropped from priv->devices, exported, and unrefed.

Unfortunately, the NMActiveConnection still holds a reference to the device, so the device is still alive, and the NMActiveConnection is still exported.

--------

This is only a problem for unmanaged interfaces, and the reason the bridge is unmanaged at this time is because EXTERNAL_DOWN is set, because the the bridge is !IFF_UP yet, but has an assumed connection because it has IP config.

There's a couple options that I can see, I'm sure everyone can think of something more clever though:

a) don't assume connections for EXTERNAL_DOWN interfaces, even if they have IP config, since they are not actually usable.  Instead assume the configuration when EXTERNAL_DOWN is cleared and the device is set IFF_UP.  This might be the best approach since having an "activated" device that's not IFF_UP is a bit odd, I think?

b) make sure that when a connection is assumed, EXTERNAL_DOWN is set to FALSE even if the device is not IFF_UP.  This would ensure that when remove_device() is called the device is managed and will be deactivated fully when removed externally.

Thoughts?
Comment 1 Dan Williams 2014-12-18 05:11:35 UTC
This might be the fix for (a):

nm-manager.c::
static gboolean
recheck_assume_connection (NMDevice *device, gpointer user_data)
{
	NMManager *self = NM_MANAGER (user_data);
	NMConnection *connection;
	gboolean was_unmanaged = FALSE, success, generated;
	NMDeviceState state;

	if (manager_sleeping (self))
		return FALSE;
	if (nm_device_get_unmanaged_flag (device, NM_UNMANAGED_USER) ||
	    nm_device_get_unmanaged_flag (device, NM_UNMANAGED_INTERNAL) ||
+	    nm_device_get_unmanaged_flag (device, NM_UNMANAGED_EXTERNAL_DOWN) ||
	    nm_device_get_unmanaged_flag (device, NM_UNMANAGED_PARENT))
		return FALSE;

But on top of git master/nm-1-0 we also need https://bugzilla.gnome.org/show_bug.cgi?id=740702#c36 to make EXTERNAL_DOWN actually work.
Comment 2 Jiri Klimes 2014-12-18 10:31:44 UTC
(In reply to comment #1)
> This might be the fix for (a):
> 
> nm-manager.c::
> static gboolean
> recheck_assume_connection (NMDevice *device, gpointer user_data)
> {
>     NMManager *self = NM_MANAGER (user_data);
>     NMConnection *connection;
>     gboolean was_unmanaged = FALSE, success, generated;
>     NMDeviceState state;
> 
>     if (manager_sleeping (self))
>         return FALSE;
>     if (nm_device_get_unmanaged_flag (device, NM_UNMANAGED_USER) ||
>         nm_device_get_unmanaged_flag (device, NM_UNMANAGED_INTERNAL) ||
> +        nm_device_get_unmanaged_flag (device, NM_UNMANAGED_EXTERNAL_DOWN) ||
>         nm_device_get_unmanaged_flag (device, NM_UNMANAGED_PARENT))
>         return FALSE;
> 
> But on top of git master/nm-1-0 we also need
> https://bugzilla.gnome.org/show_bug.cgi?id=740702#c36 to make EXTERNAL_DOWN
> actually work.

Pushed the code to jk/nm-1-0-int.
Tested locally. It seems to be working fine.

# ink add name BBB type bridge
# ip addr add 192.168.201.3/24 dev BBB
# ip link set BBB up
# sudo ip link del BBB

Up-ing the device makes NM creating the connection and after deleting link, I see that BBB is correctly removed from "nmcli dev" and "nmcli c s -a".

With plain nm-1-0 I saw the duplicated active connections as well (no crash).
Comment 3 Dan Winship 2014-12-18 11:06:40 UTC
(In reply to comment #0)
> a) don't assume connections for EXTERNAL_DOWN interfaces, even if they have IP
> config, since they are not actually usable.  Instead assume the configuration
> when EXTERNAL_DOWN is cleared and the device is set IFF_UP.  This might be the
> best approach since having an "activated" device that's not IFF_UP is a bit
> odd, I think?

And also, this protects us more against further "NM is manipulating my device when it shouldn't be" bugs.

(In reply to comment #1)
>     if (nm_device_get_unmanaged_flag (device, NM_UNMANAGED_USER) ||
>         nm_device_get_unmanaged_flag (device, NM_UNMANAGED_INTERNAL) ||
> +        nm_device_get_unmanaged_flag (device, NM_UNMANAGED_EXTERNAL_DOWN) ||
>         nm_device_get_unmanaged_flag (device, NM_UNMANAGED_PARENT))
>         return FALSE;

But really then, what this means is that it's not those flags that are the special case, it's NM_UNMANAGED_DEFAULT. We shouldn't have to add a new check here each time we add a new unmanaged flag, we should just be making a single call that tells us "is any unmanaged-flag other than DEFAULT set?"
Comment 4 Dan Williams 2014-12-18 16:53:20 UTC
(In reply to comment #3)
> (In reply to comment #0)
> > a) don't assume connections for EXTERNAL_DOWN interfaces, even if they have IP
> > config, since they are not actually usable.  Instead assume the configuration
> > when EXTERNAL_DOWN is cleared and the device is set IFF_UP.  This might be the
> > best approach since having an "activated" device that's not IFF_UP is a bit
> > odd, I think?
> 
> And also, this protects us more against further "NM is manipulating my device
> when it shouldn't be" bugs.
> 
> (In reply to comment #1)
> >     if (nm_device_get_unmanaged_flag (device, NM_UNMANAGED_USER) ||
> >         nm_device_get_unmanaged_flag (device, NM_UNMANAGED_INTERNAL) ||
> > +        nm_device_get_unmanaged_flag (device, NM_UNMANAGED_EXTERNAL_DOWN) ||
> >         nm_device_get_unmanaged_flag (device, NM_UNMANAGED_PARENT))
> >         return FALSE;
> 
> But really then, what this means is that it's not those flags that are the
> special case, it's NM_UNMANAGED_DEFAULT. We shouldn't have to add a new check
> here each time we add a new unmanaged flag, we should just be making a single
> call that tells us "is any unmanaged-flag other than DEFAULT set?"

I agree with suggestion, but I'd like to merge the existing fix for now if that's OK?
Comment 5 Dan Williams 2014-12-18 17:12:49 UTC
Pushed as 12247a7271872ca2f746b73dc9dbf1cab6c2223b without the further fixup.  Cherry-picked to nm-1-0 as well.

Lets leave this bug open for the real fix.
Comment 6 André Klapper 2020-11-12 14:29:27 UTC
bugzilla.gnome.org is being shut down in favor of a GitLab instance. 
We are closing all old bug reports and feature requests in GNOME Bugzilla which have not seen updates for a long time.

If you still use NetworkManager and if you still see this bug / want this feature in a recent and supported version of NetworkManager, then please feel free to report it at https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/

Thank you for creating this report and we are sorry it could not be implemented (workforce and time is unfortunately limited).