Bug 757858 – [RFE] [review] EmergencyRollback for Network Configuration: bg/checkpoint-bgo757858-v2 [cockpit]

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 757858 - [RFE] [review] EmergencyRollback for Network Configuration: bg/checkpoint-bgo757858-v2 [cockpit]


Summary:	[RFE] [review] EmergencyRollback for Network Configuration: bg/checkpoint-bgo...


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:	nm-review

Reported:	2015-11-09 21:56 UTC by Thomas Haller
Modified:	2016-08-24 07:22 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Thomas Haller 2015-11-09 21:56:36 UTC

For Cockpit (and possibly other users) let's implement the following feature:

Cockpit modifies the network over the network. So, it wants to possibly rollback an invalid configuration. As Stef Walter explained, it's conceptually similar to when you change the resolution of your screen: if you don't accept the new resolution by clicking on a button withing a timeout, the change will be reverted to avoid breaking the display.

The idea is to add a D-Bus command EmergencyRollbackSchedule(). It receives a list of D-Bus paths for Devices and a timeout. You need elevated privileges to invoke this command (only root? Does Cockpit's bridge run as root?).
If the list of devices is empty, it is the same as providing *all* currently existing devices. "Scheduling" means, going over all specified devices and create an internal clone of the currently applied-connection. An "applied-connection" is the configuration which is currently subjected onto the activated device, i.e. the setting which was lastly activated on the device -- either after autoconnect or an explicit `nmcli connection up`.
This cloned connection is entirely internal state of NetworkManager, i.e. it is not visible via D-Bus nor stored to disk and it has no D-Bus path.
If the device is currently not activated, we remember instead that the device was down.
Then the function returns a Rollback-Handle to the caller.

The caller can cancel the rollback via EmergencyCallbackCancel(handle). In this case, we just forget the previously cloned data and nothing else happens.

If the timeout expires without being canceled, we will walk over the list of previously cloned connections. Of course, originally this connection was cloned based on a user-visible connection (e.g. /org/freedesktop/NetworkManager/Settings/42). If that connection still exists, we will update the content to restore its previous value (in case the settings connection changed in the meantime). If the connection was deleted in the meantime, we will add a new connection (with the same UUID, but different D-Bus Path).
Then we start activating the thus restored connections. If the device was deactivated during the snapshot, we also deactiate any possibly active connection on the device.

A privileged user could schedule multiple rollbacks at the same time.
If a new rollback request specified the same devices (non-empty intersection set), this becomes confusing.
From NM's point of view, there would be no problem to support that. But for the user it becomes confusing as it is unclear what the user wants to achieve.
Probably a EmergencyRollbackSchedule() should fail, if there is already another request for the same device(s) pending. If Cockpit would use EmergencyRollbackSchedule() before every NM command, it effectively could use it to reserve the instance and notice when another user tries to configure the network at the same time. So this could allow to synchronize concurrent access.

If the list of devices don't overlap, there is no real problem and we can have multiple rollbacks pending.

There should also be a EmergencyRollbackList() method to return all currently pending rollbacks (with their handles and device-lists).

Comments?

Comment 1 Beniamino Galvani 2015-11-10 14:43:04 UTC

(In reply to Thomas Haller from comment #0)
> The idea is to add a D-Bus command EmergencyRollbackSchedule(). It receives
> a list of D-Bus paths for Devices and a timeout. You need elevated
> privileges to invoke this command (only root? Does Cockpit's bridge run as
> root?).

Apparently it runs with the privileges of logged user, but is able to
escalate privileges if the system is configured to allow it:

http://cockpit-project.org/guide/latest/privileges.html

Maybe we should allow the API to be used by users having a given polkit
permission (org.freedesktop.NetworkManager.network-control ?).

> If the timeout expires without being canceled, we will walk over the list of
> previously cloned connections. Of course, originally this connection was
> cloned based on a user-visible connection (e.g.
> /org/freedesktop/NetworkManager/Settings/42). If that connection still
> exists, we will update the content to restore its previous value (in case
> the settings connection changed in the meantime). If the connection was
> deleted in the meantime, we will add a new connection (with the same UUID,
> but different D-Bus Path).
> Then we start activating the thus restored connections. If the device was
> deactivated during the snapshot, we also deactiate any possibly active
> connection on the device.

Makes sense to me. The only doubt I have is: at a given time for a
device we have a runtime state (the applied connection) and a
persistent state (the setting connection) which can be, in principle,
different. When we restore the device state after a timeout we reset
its persistent (and runtime) state to be the original runtime one.

I wonder if we should restore the original applied or setting
connection. The applied one probably is better because it was known to
be working.

> A privileged user could schedule multiple rollbacks at the same time.
> If a new rollback request specified the same devices (non-empty intersection
> set), this becomes confusing.
> From NM's point of view, there would be no problem to support that. But for
> the user it becomes confusing as it is unclear what the user wants to
> achieve.

I would forbid a new Schedule() for devices that already have a
pending one, because like you said it is confusing when multiple users
access the functionality at the same time (one user could snapshot the
temporary state of the other and the result depends only on the start
times and timeouts).

Comment 2 Dan Williams 2015-12-11 23:16:10 UTC

(In reply to Thomas Haller from comment #0)
> The idea is to add a D-Bus command EmergencyRollbackSchedule(). It receives
> a list of D-Bus paths for Devices and a timeout. You need elevated
> privileges to invoke this command (only root? Does Cockpit's bridge run as
> root?).

I'd drop "Emergency" from the D-Bus methods actually.  IMHO no need for that, it's not always an emergency when the config fails...

Anyway, in general it sounds like an OK approach, though I'd echo some of Beniamino's comments.  It does sound complicated to allow handle multiple rollbacks for the same device, so I would simply return an error there.  Independent rollbacks should be fine; we may even have 10s or 100s of interfaces on the system and they could be controlled independently.

> If the list of devices is empty, it is the same as providing *all* currently
> existing devices. "Scheduling" means, going over all specified devices and
> create an internal clone of the currently applied-connection. An
> "applied-connection" is the configuration which is currently subjected onto
> the activated device, i.e. the setting which was lastly activated on the
> device -- either after autoconnect or an explicit `nmcli connection up`.

When we get live reconfig, the applied connection will also be whatever was last reconfigured on the device too.  Might be worth thinking about live reconfig here, though I don't think it'll be much of an issue since NM already has the connection info.

> This cloned connection is entirely internal state of NetworkManager, i.e. it
> is not visible via D-Bus nor stored to disk and it has no D-Bus path.

Yes, I don't think it should be exposed anywhere.

> If the device is currently not activated, we remember instead that the
> device was down.
> Then the function returns a Rollback-Handle to the caller.
> 
> The caller can cancel the rollback via EmergencyCallbackCancel(handle). In
> this case, we just forget the previously cloned data and nothing else
> happens.
> 
> If the timeout expires without being canceled, we will walk over the list of
> previously cloned connections. Of course, originally this connection was
> cloned based on a user-visible connection (e.g.
> /org/freedesktop/NetworkManager/Settings/42). If that connection still
> exists, we will update the content to restore its previous value (in case
> the settings connection changed in the meantime). If the connection was
> deleted in the meantime, we will add a new connection (with the same UUID,
> but different D-Bus Path).

We could actually resurrect the old connection at the same path it used to be on, since at runtime NM never re-uses connection paths.

> Then we start activating the thus restored connections. If the device was
> deactivated during the snapshot, we also deactiate any possibly active
> connection on the device.

Another question is what happens if the restore activation fails...  I guess NM just leaves the device disconnected?  Or should it just put it back into the autoconnect pool?

> Probably a EmergencyRollbackSchedule() should fail, if there is already
> another request for the same device(s) pending. If Cockpit would use

Yes, I think it should fail.

> There should also be a EmergencyRollbackList() method to return all
> currently pending rollbacks (with their handles and device-lists).

Hmm, I'm not sure about this one.  I guess the use-case would be that if the thing that scheduled the rollback isn't a long-running service it could cache the handle and cancel it later.  Not sure if that's too useful though.  In any case, what would RollbackList() get used for?  You could cancel somebody else's rollback, which would be odd.

In general, the proposal sounds solid though.

Comment 3 Beniamino Galvani 2016-07-17 10:34:04 UTC

Please review branch bg/checkpoint-bgo757858.

Comment 4 Lubomir Rintel 2016-07-22 13:46:44 UTC

Looks fairly well. I've added some fixups.

I'm wondering -- why to checkpoints live in their own name space instead of being objects on the bus? That way the names would look more meaningful and it would be possible to list existing checkpoints easily.

Comment 5 Beniamino Galvani 2016-08-01 20:29:32 UTC

(In reply to Lubomir Rintel from comment #4)
> Looks fairly well. I've added some fixups.
> 
> I'm wondering -- why to checkpoints live in their own name space instead of
> being objects on the bus? That way the names would look more meaningful and
> it would be possible to list existing checkpoints easily.

Good point. Branch bg/checkpoint-bgo757858-v2 implements checkpoints as D-Bus objects. I've omitted the libnm part for the moment until we decide how to proceed.

Comment 6 Dan Williams 2016-08-05 22:23:21 UTC

> checkpoint: add create, rollback and destroy D-Bus API

Maybe Rollback() should return a list of successful devices and failed devices?  If the rollback does fail for some devices, I feel like the thing doing the checkpoint/rollback would probably want to know about the failed ones and why they failed.

In nm_checkpoint_rollback() the device may be <= UNMANAGED because it doesn't exist anymore, like if it was removed before rollback.  Maybe we should update the comment to account for that.  Also, has anyone tested that scenario?

Also for nm-checkpoint.c::get_property(), can we use nm_utils_g_value_set_object_path_array() instead of open-coding it?  Unfortunately we can't just pass the result of g_hash_table_get_values() to that function since it returns a GList not a GSList, but maybe it's a code savings.

Rest LGTM.

Comment 7 Beniamino Galvani 2016-08-08 15:01:19 UTC

(In reply to Dan Williams from comment #6)
> > checkpoint: add create, rollback and destroy D-Bus API
> 
> Maybe Rollback() should return a list of successful devices and failed
> devices?  If the rollback does fail for some devices, I feel like the thing
> doing the checkpoint/rollback would probably want to know about the failed
> ones and why they failed.

I've added a "a{su}" output argument to CheckpointRollback() which returns a result code for each device.

> 
> In nm_checkpoint_rollback() the device may be <= UNMANAGED because it
> doesn't exist anymore, like if it was removed before rollback.  Maybe we
> should update the comment to account for that.  Also, has anyone tested that
> scenario?

Yes, I've tested it. The device still exists (because we hold a reference to it), but is unrealized and so it's skipped, returning an error in the results dictionary.

Updated branch bg/checkpoint-bgo757858-v2.

Comment 8 Thomas Haller 2016-08-12 15:17:30 UTC

+    <!--
+        Created:
+
+        Timestamp of checkpoint creation expressed as the number of
+        seconds since the Unix Epoch.
+    -->

initialized via time(NULL), this value is affected by resetting the clock. I think it's not suitable to expose a created timestamp.

Also, the value is affected by the clock-speed. Thus, you cannot calculate the expected rollback-time reliably as "Created + RollbackTimeout".

I would do like NMAccessPoint:LastSeen, see https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/wifi/nm-wifi-ap.c?id=6a8e46700f393ff8c9522dbd90bb30058b9d6d4b#n959








this whole create/rollback thing is of course complicated, and subtle issues can probably not avoided.
But I think some things are not correct:

for example:


1.) nm_checkpoint_rollback() does:

  if (!nm_device_is_real (device)) {
      continue;
  }

Scenario: you have a (say) vlan eth0.5 on top of eth0 device active and that is your working configuration. cockpit creates a snapshot of that.
then, the cockpit user, activates "Wired Connection 1" on top of eth0 (tearing down eth0.5). Acidentally, he borked his network, because he has only connectivity using vlan 5.
Rollback should fix it, but the is_real() check would just bail out.


2.) say, the user has wlan0 and eth0 connections. He is in an environment, where wlan0 works, but eth0 has an invalid gateway (so although locally eth0 seems fine, it won't work and steal the default route). At the beginning only wlan0 is active, all good.
Cockpit user now activates eth0 (in addition to wlan0). That makes the host unreachable. Rollback kicks in and it must deactivate eth0, which was previously active so that the sole working device wlan0 is again the device with the default route.
Ok, the code handls this scenario mostly, unless: if eth0 was initially unmanaged, then device_checkpoint_create() would not track eth0.

I think we must track when a device was unmanaged, and if it was unmanaged, whether it is unmanaged as NM_UNMANAGED_USER_EXPLICIT.

so, on rollback:

  if (device-was-unmanaged) {
      if (!device-is-real) {
          /* nothing to do */
      } else if (device-was-unamanged-by-user-explict) {
          /* ensure device is again unmanaged by-user-explicit */
          device-set-unmanaged-by-user-explict()
      } else if (device-is-managed-now) {
          /* device is managed now. We must unmanage it, do it via the
             by-user-explict-flag. */
          device-set-unmanaged-by-user-explict()
      } else {
          /* device was unmanaged (for other reasons then by-user-explict)
             and it is still unmanaged (for whatever reason).
             Nothing to do. */
      }
  }





in Rollback, you care about the connection by path, I think you must consider the UUID.
Scenario:
  - connection UUID1, PATH1 is active and we snapshot it.
  - user deletes connection with PATH1
  - user re-creates connection with UUID1 (he cannot determine the path, so it 
    gets a new path PATH2).
now, the Rollback code will try to add another connection with UUID1, which will fail!
I think connection D-Bus paths shall not matter. UUIDs matter.





+              if (!nm_manager_activate_connection (priv->manager,
+                                                   connection,
+                                                   NULL,
+                                                   device,
+                                                   subject,
+                                                   &local_error)) {
seems wrong, because it tries to re-authenticate the activation using subject. The rollback action is an internal privileged action. At that point, it must succeed.





I would actually snapshot both the applied connection and the setting connection, and restore them both. That requires a new way to activate a connection, one where you can provide the applied-connection separately.

Comment 9 Beniamino Galvani 2016-08-24 07:18:51 UTC

(In reply to Thomas Haller from comment #8)
> +    <!--
> +        Created:
> +
> +        Timestamp of checkpoint creation expressed as the number of
> +        seconds since the Unix Epoch.
> +    -->
> 
> initialized via time(NULL), this value is affected by resetting the clock. I
> think it's not suitable to expose a created timestamp.
> 
> Also, the value is affected by the clock-speed. Thus, you cannot calculate
> the expected rollback-time reliably as "Created + RollbackTimeout".
> 
> I would do like NMAccessPoint:LastSeen, see
> https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/
> wifi/nm-wifi-ap.c?id=6a8e46700f393ff8c9522dbd90bb30058b9d6d4b#n959

I've update the D-Bus API and merged the branch. Let's open a new bug for the
other improvement you suggested.

https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=6308c90b5efdb4d30508234a85864102f1e116a2

Comment 10 Beniamino Galvani 2016-08-24 07:22:08 UTC

Follow-up bug:

https://bugzilla.gnome.org/show_bug.cgi?id=770315