Bug 695705 – bridge/bond: remove interface when connection is disconnected/deleted

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 695705 - bridge/bond: remove interface when connection is disconnected/deleted


Summary:	bridge/bond: remove interface when connection is disconnected/deleted


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	0.9.8
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-03-12 14:10 UTC by Radek Novacek
Modified:	2013-09-04 16:30 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Test script for this issue (1.03 KB, text/x-python) 2013-03-12 14:10 UTC, Radek Novacek	Details

Description Radek Novacek 2013-03-12 14:10:39 UTC

Created attachment 238691 [details]
Test script for this issue

NetworkManager doesn't delete bond device when connection that created that bond is deleted. This way there are increasing number of bonding devices present in the system and they're never removed.

Run testing script in attachment to see that the bond device is not removed.

Comment 1 Dan Winship 2013-03-12 18:44:22 UTC

the code has the comment

	/*
	 * Do not delete existing virtual devices to keep connectivity up.
	 * Virtual devices are reused when NetworkManager is restarted.
	 */

I'm not sure what that's supposed to mean... if we created the interface, then we ought to delete it too.

Comment 2 Radek Novacek 2013-03-13 07:01:25 UTC

I agree, when one deletes the connection, he obviously doesn't want to keep the connectivity any more and it can't be reused when the connection no longer exists.

Comment 3 Dan Williams 2013-03-13 13:23:33 UTC

danw: that might be the case *now*, but when we handle all virtual interface types and create temporary in-memory configuration for them, do we still want that behavior?

If NM created the interface based on configuration, then theoretically NM "owns" it if nothing else touches it.  But then if some other tool touches the interface (adds IP addresses, routes, whatever) then we probably shouldn't just delete it, because now the interface's configuration is not backed by on-disk permanent config.

So that would imply, in the near-future cooperative temporary connection model, that we'd only delete the interface if its active connection was saved to disk.  But if the active connection was "dirty" then we wouldn't.  That seems somewhat confusing, actually.

Comment 4 Dan Winship 2013-03-13 13:38:43 UTC

(In reply to comment #3)
> If NM created the interface based on configuration, then theoretically NM
> "owns" it if nothing else touches it.  But then if some other tool touches the
> interface (adds IP addresses, routes, whatever) then we probably shouldn't just
> delete it, because now the interface's configuration is not backed by on-disk
> permanent config.

Why? If NM created the device itself to back an NMConnection, and you then delete that NMConnection, why would you have any reason to expect (or even want) the device to keep existing? Or alternatively, if you want it to keep existing, then why not just NOT delete the connection?

Comment 5 Pavel Simerda 2013-08-13 09:52:24 UTC

Tomas Haller's thaller/bug-rh-953300-remove_virtual_interfaces branch contains a patchset for this bug report...

> core: add flag delete_on_deactivate to NMDevice

I usually tend to prefer descriptive names over behavioral names. In this case the flag marks a *software* device *owned* by NetworkManager and those attributes results in the *delete on deactivate* behavior.

Whether NetworkManager *owns* the software device, will have lots of other consequences in the future than just device deletion. For example we don't want to autoactivate connections on a device we don't own.

I don't know which bug report it was in but I think Dan Winship expressed he also needs to mark devices which are foreign to NetworkManager for some reasons. We could just use a single flag called 'foreign' to mark the devices that are *not* activated by NetworkManager whether it is a software device or not. Or 'ours', if you insist on the positive meaning.

Please see the lines:

        case NM_LINK_TYPE_BOND:
            device = nm_device_bond_new (link->name);
            break;
        case NM_LINK_TYPE_BRIDGE:
            /* FIXME: always create device when we handle bridges non-destructively */
            if (bridge_created_by_nm (self, link->name))
                device = nm_device_bridge_new (link->name);
            else
                nm_log_info (LOGD_BRIDGE, "(%s): ignoring bridge not created by NetworkManager", link->name);
            break;

Here, I think, the behavior should be the same for all software devices including bridging and bonding. We should create NMDevice instances for *all* devices and mark them *foreign* unless we created them either in this NetworkManager instance or in a previously running one.

Then we just enable *activation* as well as *deletion* only for NetworkManager-owned devices. But then I guess the 'foreign' is just another name for *unmanaged* and we should merge those.

> core: delete virtual devices created by NM when they are deactivated

I'm just curious why you are using delayed deletion. I smell race conditions there for no [known] benefit.

> core: fix check for valid ifindex in nm-device
>
> Invalid ifindex are expected as -1 (or negative). Zero should
> be treated as a valid value, although probably zero will never
> be used. That means, that old versions without this patch should
> work just fine.

This doesn't seem right to me. The ifindex values start at one and even nm-platform is using zero for invalid values. That also lets you do 'if (ifindex) ...'. I think we should switch to zero everywhere, check using 'if (ifindex)' and add 'assert/return_if_fail (ifindex > 0)' where valid ifindex is expected and  'assert/return_if_fail (ifindex >= 0)' where *no ifindex* is a valid input.

Comment 6 Thomas Haller 2013-08-13 11:39:28 UTC

(In reply to comment #5)

> > core: delete virtual devices created by NM when they are deactivated
> 
> I'm just curious why you are using delayed deletion. I smell race conditions
> there for no [known] benefit.

I tried to link_delete at the end of nm_device_deactivate (i.e. while being called from nm_device_state_changed). But this emits the signal NM_PLATFORM_LINK_REMOVED which eventually calls nm_device_state_changed. nm_device_state_changed does not support reentrancy, so it crashed. How do you suggest to link_delete?



> > core: fix check for valid ifindex in nm-device
> >
> > Invalid ifindex are expected as -1 (or negative). Zero should
> > be treated as a valid value, although probably zero will never
> > be used. That means, that old versions without this patch should
> > work just fine.
> 
> This doesn't seem right to me. The ifindex values start at one and even
> nm-platform is using zero for invalid values. That also lets you do 'if
> (ifindex) ...'. I think we should switch to zero everywhere, check using 'if
> (ifindex)' and add 'assert/return_if_fail (ifindex > 0)' where valid ifindex is
> expected and  'assert/return_if_fail (ifindex >= 0)' where *no ifindex* is a
> valid input.

You are right, zero means invalid. I drop the last patch and fix the rest. It should be clearified whether negative values are possible (and what the mean) because of the different meaning of "if(ifindex)" and "if(ifindex>0)". Then all checks should be consistent.

Comment 7 Thomas Haller 2013-08-13 12:07:16 UTC

(In reply to comment #5)
> Tomas Haller's thaller/bug-rh-953300-remove_virtual_interfaces branch contains
> a patchset for this bug report...
> 
> > core: add flag delete_on_deactivate to NMDevice
> 
> I usually tend to prefer descriptive names over behavioral names. In this case
> the flag marks a *software* device *owned* by NetworkManager and those
> attributes results in the *delete on deactivate* behavior.
> 
> Whether NetworkManager *owns* the software device, will have lots of other
> consequences in the future than just device deletion. For example we don't want
> to autoactivate connections on a device we don't own.
> 
> I don't know which bug report it was in but I think Dan Winship expressed he
> also needs to mark devices which are foreign to NetworkManager for some
> reasons. We could just use a single flag called 'foreign' to mark the devices
> that are *not* activated by NetworkManager whether it is a software device or
> not. Or 'ours', if you insist on the positive meaning.
> 
> Please see the lines:
> 
>         case NM_LINK_TYPE_BOND:
>             device = nm_device_bond_new (link->name);
>             break;
>         case NM_LINK_TYPE_BRIDGE:
>             /* FIXME: always create device when we handle bridges
> non-destructively */
>             if (bridge_created_by_nm (self, link->name))
>                 device = nm_device_bridge_new (link->name);
>             else
>                 nm_log_info (LOGD_BRIDGE, "(%s): ignoring bridge not created by
> NetworkManager", link->name);
>             break;
> 
> Here, I think, the behavior should be the same for all software devices
> including bridging and bonding. We should create NMDevice instances for *all*
> devices and mark them *foreign* unless we created them either in this
> NetworkManager instance or in a previously running one.
> 
> Then we just enable *activation* as well as *deletion* only for
> NetworkManager-owned devices. But then I guess the 'foreign' is just another
> name for *unmanaged* and we should merge those.
> 

I agree that the name is not the best.

But right now a device does not know whether it has virtual devices. That is decided by the manager in system_create_virtual_device (where he also sets the flag delete_on_deactivation). So currently this it's not up to the device to decide, it's up to the manager.

If you want to merge foreign, unmanaged and delete_on_deactivate, then a managed device must know on its own that it is responsible for its virtual device (which currently is not the case).

I am not opposing this idea, I just don't know what to do in detail. This needs some more thinking :)

What flags do we have right now and which flags are on the wish list? And what is their exact use?

Comment 8 Pavel Simerda 2013-08-13 13:11:43 UTC

(In reply to comment #6)
> (In reply to comment #5)
> 
> > > core: delete virtual devices created by NM when they are deactivated
> > 
> > I'm just curious why you are using delayed deletion. I smell race conditions
> > there for no [known] benefit.
> 
> I tried to link_delete at the end of nm_device_deactivate (i.e. while being
> called from nm_device_state_changed). But this emits the signal
> NM_PLATFORM_LINK_REMOVED which eventually calls nm_device_state_changed.
> nm_device_state_changed does not support reentrancy, so it crashed. How do you
> suggest to link_delete?

Good point. Please always document stuff like that in the source code so that this information is not lost. In my opinion, NMDevice should know when it's being destroyed and the platform signals should be removed at that time.

One way to achieve that is to somehow mark the device as being deleted. Another way would be to avoid acting upon internal link deletion, like:

--- a/src/nm-manager.c
+++ b/src/nm-manager.c
@@ -2374,6 +2374,10 @@ platform_link_removed_cb (NMPlatform *platform,
        NMManagerPrivate *priv = NM_MANAGER_GET_PRIVATE (self);
        NMDevice *device;
 
+       /* Only process external events. */
+       if (reason != NM_PLATFORM_REASON_EXTERNAL)
+               return;
+
        device = find_device_by_ifindex (self, ifindex);
        if (device)
                priv->devices = remove_one_device (self, priv->devices, device, FALSE);

> > > core: fix check for valid ifindex in nm-device
> > >
> > > Invalid ifindex are expected as -1 (or negative). Zero should
> > > be treated as a valid value, although probably zero will never
> > > be used. That means, that old versions without this patch should
> > > work just fine.
> > 
> > This doesn't seem right to me. The ifindex values start at one and even
> > nm-platform is using zero for invalid values. That also lets you do 'if
> > (ifindex) ...'. I think we should switch to zero everywhere, check using 'if
> > (ifindex)' and add 'assert/return_if_fail (ifindex > 0)' where valid ifindex is
> > expected and  'assert/return_if_fail (ifindex >= 0)' where *no ifindex* is a
> > valid input.
> 
> You are right, zero means invalid. I drop the last patch and fix the rest. It
> should be clearified whether negative values are possible

Negative values should never be used. In fact, NetworkManager uses guint (unsigned int) for non-negative numbers but kernel uses int for ifindex for some reason.

Maybe the assertions are just useless and can all be removed.

> (and what the mean)
> because of the different meaning of "if(ifindex)" and "if(ifindex>0)". Then all
> checks should be consistent.

I'm personally for 'if (ifindex)'. Other developers often use more explicit comparison (would be 'if (ifindex != 0)' in this case) which I consider less readable.

The 'if (ifindex > 0)' variant is a bit weird from the program logic point of view but it's an exact translation of 'ifindex is positive [and therefore valid]'. I tend to avoid it wherever negative numbers are not considered.

(In reply to comment #7)
> (In reply to comment #5)
> > Tomas Haller's thaller/bug-rh-953300-remove_virtual_interfaces branch contains
> > a patchset for this bug report...
> > 
> > > core: add flag delete_on_deactivate to NMDevice
> > 
> > I usually tend to prefer descriptive names over behavioral names. In this case
> > the flag marks a *software* device *owned* by NetworkManager and those
> > attributes results in the *delete on deactivate* behavior.
> > 
> > Whether NetworkManager *owns* the software device, will have lots of other
> > consequences in the future than just device deletion. For example we don't want
> > to autoactivate connections on a device we don't own.
> > 
> > I don't know which bug report it was in but I think Dan Winship expressed he
> > also needs to mark devices which are foreign to NetworkManager for some
> > reasons. We could just use a single flag called 'foreign' to mark the devices
> > that are *not* activated by NetworkManager whether it is a software device or
> > not. Or 'ours', if you insist on the positive meaning.
> > 
> > Please see the lines:
> > 
> >         case NM_LINK_TYPE_BOND:
> >             device = nm_device_bond_new (link->name);
> >             break;
> >         case NM_LINK_TYPE_BRIDGE:
> >             /* FIXME: always create device when we handle bridges
> > non-destructively */
> >             if (bridge_created_by_nm (self, link->name))
> >                 device = nm_device_bridge_new (link->name);
> >             else
> >                 nm_log_info (LOGD_BRIDGE, "(%s): ignoring bridge not created by
> > NetworkManager", link->name);
> >             break;
> > 
> > Here, I think, the behavior should be the same for all software devices
> > including bridging and bonding. We should create NMDevice instances for *all*
> > devices and mark them *foreign* unless we created them either in this
> > NetworkManager instance or in a previously running one.
> > 
> > Then we just enable *activation* as well as *deletion* only for
> > NetworkManager-owned devices. But then I guess the 'foreign' is just another
> > name for *unmanaged* and we should merge those.
> > 
> 
> I agree that the name is not the best.
> 
> But right now a device does not know whether it has virtual devices.

You can use nm_platform_link_is_software() and cache it as priv->is_software for example.

> That is
> decided by the manager in system_create_virtual_device

To be precise, system_create_virtual_device() should be *only* called for virtual devices, so it can't be decided there. If you look at the code, it's actually decided by NMConnection subclasses.

> (where he also sets the
> flag delete_on_deactivation). So currently this it's not up to the device to
> decide, it's up to the manager.

While it's up to the nm-connection to decide whether it needs a software device,
we also need to track foreign software devices as well as software devices created by the previous NetworkManager instance. For those we create the NMDevice object in platform_link_added_cb() instead of system_create_virtual_device().

While a device created by system_create_virtual_device() is *never* foreign, a device created by platform_link_added_cb() is foreign unless recorded in /var/run (/run might be better for that).

> If you want to merge foreign, unmanaged and delete_on_deactivate, then a
> managed device must know on its own that it is responsible for its virtual
> device (which currently is not the case).

Do we need anything more than checking for 'managed' instead of 'delete_on_deactivate'?

> I am not opposing this idea, I just don't know what to do in detail. This needs
> some more thinking :)

Indeed.

> What flags do we have right now and which flags are on the wish list? And what
> is their exact use?

So far the development was chaotic. Feel free to start documenting it somewhere, depending on the livecycle of the information, bugzilla or src/devices/README might be the right place.

Comment 9 Thomas Haller 2013-08-16 15:13:35 UTC

About calling delete_link asynchronously...

Actually I think doing that is neat.

I don't think there is a race-problem when doing link_delete in the background. A link might be removed at any given time and NM must be prepared to handle it properly anyway. So all this works already and all the clean-up work can be done as usual.

Doing it synchronously needs suppressing the link_removed_cb and than actively doing something to clean-up the NMDevice instance.

Of course, you can convince me otherwise :)

Comment 10 Pavel Simerda 2013-08-16 15:54:41 UTC

(In reply to comment #9)
> About calling delete_link asynchronously...
> 
> Actually I think doing that is neat.
> 
> I don't think there is a race-problem when doing link_delete in the background.
> A link might be removed at any given time and NM must be prepared to handle it
> properly anyway. So all this works already and all the clean-up work can be
> done as usual.

If you think you considered the potential races carefully enough, then yes, we can plan the device deletion instead of doing it directly.

> Doing it synchronously needs suppressing the link_removed_cb and than actively
> doing something to clean-up the NMDevice instance.

I think we should suppress it anyway, as reacting on deletion requested by NM itself doesn't make much sense. Actually, I guess Dan Winship will sooner or later want to change the nm-platform REASON from a signal argument to signal detail ;).

> Of course, you can convince me otherwise :)

If we only listen to external changes, immediate link_delete() is as easy as planned link_delete().

Comment 11 Thomas Haller 2013-08-20 18:03:44 UTC

I pushed a new version of the patches. Please see http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=thaller/bug-rh-953300-remove_virtual_interfaces

I renamed the flag "delete_on_deactivate" to "is_owned_virtual_device".



It does not yet address the following issues:

(1) delete_link is called by the nm-device itself. This is contrary to the creation, where the nm-manager creates the device instead. Maybe the deletion should therefore also be done by the manager. OTOH the current patch is quite concise and clean.

(2) because delete_link has the responsibility for its cleanup it needs to call nm_device_get_is_owned_virtual_device. Maybe this could be replaced by some combination of nm_device_get_manager_managed and nm_device_get_is_virtual_device (where the latter does not exist).

(3) delete_link is still invoked asynchronously.


I know, we discussed these points already. But I still don't see the consensus. Should I change something, and *how* in detail?

Comment 12 Pavel Simerda 2013-08-21 10:57:28 UTC

(In reply to comment #11)
> I pushed a new version of the patches. Please see
> http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=thaller/bug-rh-953300-remove_virtual_interfaces
> 
> I renamed the flag "delete_on_deactivate" to "is_owned_virtual_device".

Better. Except that we tend to prefer s/virtual/software and of course except that I still think we could just use state=UNMANAGED for that but that can be fixed up later.

> It does not yet address the following issues:
> 
> (1) delete_link is called by the nm-device itself. This is contrary to the
> creation, where the nm-manager creates the device instead. Maybe the deletion
> should therefore also be done by the manager. OTOH the current patch is quite
> concise and clean.

Or both could be possibly done by nm-device (sounds better to me).

> (2) because delete_link has the responsibility for its cleanup it needs to call
> nm_device_get_is_owned_virtual_device. Maybe this could be replaced by some
> combination of nm_device_get_manager_managed and
> nm_device_get_is_virtual_device (where the latter does not exist).

Yes. The latter could be just a cached nm_platform_link_is_software() or hardcoded into software device types or even we could create a new NMSoftwareDevice or NMDeviceSoftware as a superclass for all software NMDevice subclasses.

> (3) delete_link is still invoked asynchronously.

It could be either way. Synchronous just sounds much simpler to me if there's no specific reason for async.

> I know, we discussed these points already. But I still don't see the consensus.
> Should I change something, and *how* in detail?

From my part, the patch is good enough. Whether you include the points I brought up is up to you, all of them can be fixed up later.

Comment 13 Dan Williams 2013-08-28 20:17:42 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > I pushed a new version of the patches. Please see
> > http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=thaller/bug-rh-953300-remove_virtual_interfaces
> > 
> > I renamed the flag "delete_on_deactivate" to "is_owned_virtual_device".
> 
> Better. Except that we tend to prefer s/virtual/software and of course except
> that I still think we could just use state=UNMANAGED for that but that can be
> fixed up later.

Since the flag now has a different meaning (that the device is a software device that's created by NM) how about we change it to "nm_created" and use that flag in addition to priv->is_software to determine whether to delete on deactivate?

> > It does not yet address the following issues:
> > 
> > (1) delete_link is called by the nm-device itself. This is contrary to the
> > creation, where the nm-manager creates the device instead. Maybe the deletion
> > should therefore also be done by the manager. OTOH the current patch is quite
> > concise and clean.
> 
> Or both could be possibly done by nm-device (sounds better to me).

I looked into this, but it's not a clear win, because sometimes the software device is created by NM, and other times it's created in response to a Platform event.  Trying to combine those two cases such that if nm_device_bond_new() is called without an NMPlatformLink it creates the interface but if it's called with one it doesn't is somewhat weird, plus there are now side-effects from calling that function.  I think for now the current approach is suitable.

(it would also imply that system_create_virtual_device() needs to know that the nm_device_XXX_new() functions can trigger the link-added signal and thus block the handler, like it does now, but it's less clear this needs to happen when the nm_device_XXX_new() functions have side-effects)

> > (2) because delete_link has the responsibility for its cleanup it needs to call
> > nm_device_get_is_owned_virtual_device. Maybe this could be replaced by some
> > combination of nm_device_get_manager_managed and
> > nm_device_get_is_virtual_device (where the latter does not exist).
> 
> Yes. The latter could be just a cached nm_platform_link_is_software() or
> hardcoded into software device types or even we could create a new
> NMSoftwareDevice or NMDeviceSoftware as a superclass for all software NMDevice
> subclasses.

NM now sets is_software very early, so it should be very simple to add nm_device_get_is_software() if required.

> > (3) delete_link is still invoked asynchronously.
> 
> It could be either way. Synchronous just sounds much simpler to me if there's
> no specific reason for async.

It's async due to signal emission issues, because:

1) the manager listens to link-removed and calls remove_device() in response to it.  From a first look, this is actually exactly what we want because then the NMDevice goes  away when its link is deleted.  But looking further we see that this would happen *inside nm_device_deactivate()* which has other side-effects we really don't want triggered:

nm_device_deactivate()
  nm_platform_link_delete()
    <various platform functions>
    <platform emits link-removed>
    remove_device()
      nm_device_set_manager_managed()
        nm_device_set_managed_internal()
          nm_device_state_changed()
            nm_device_deactivate()
              OOPS

There's just way too many cases where attempting to do synchronous handling of the Platform's link-removed signal is going to cause problems like this, or other long callchains with unexpected effects.  One way to get around that is async.

2) There are places that call nm_device_deactivate() (like internal_activate_device() when starting another connection) where immediately deleting the device would be counter-productive, because the user just told us to start another connection on it.  Although maybe we *do* want to delete it here and then recreate it so that we get a clean interface state, since the kernel doesn't have any sort of "reset all bridge/bond attributes to the default".

So while I agree that synchronous handling would be a lot clearer, this all probably points to the fact that *if* we tried synchronous handling of device deletion, it shouldn't be done from nm_device_deactivate(), but probably instead from nm-manager.c::manager_device_state_changed().  Unfortunately, that's called from NMDevice's state handler, which could trigger the above callchain and end up changing state from within the state-changed handler, which is forbidden.  So the manager would just have to do it async from an idle handler anyway.

This turned out to be a lot harder than I thought it would be...

> > I know, we discussed these points already. But I still don't see the consensus.
> > Should I change something, and *how* in detail?
> 
> From my part, the patch is good enough. Whether you include the points I
> brought up is up to you, all of them can be fixed up later.

I think the approach is fine, see if the first suggestion about nm_owned + is_software seems worthwhile.

Comment 14 Thomas Haller 2013-08-30 18:28:01 UTC

I updated the branch and renamed the flag to is_nm_owned.

It is still here:
http://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=thaller/bug-rh-953300-remove_virtual_interfaces

I tested it again, and it works correctly for my test cases.

Comment 15 Pavel Simerda 2013-08-30 19:08:47 UTC

Looks good to me.

Comment 16 Dan Williams 2013-09-04 16:14:26 UTC

Looks good to me too, please merge.

Comment 17 Thomas Haller 2013-09-04 16:30:29 UTC

Pushed to upstream master.

See

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=c6458cab3670c4361767c2dadd888b00161032ad
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=958ec36b96d995bedb1a237bf3317dfa8838c60d