Bug 721971 – NetworkManager runs dispatcher scripts out of order

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 721971 - NetworkManager runs dispatcher scripts out of order


Summary:	NetworkManager runs dispatcher scripts out of order


Status:	RESOLVED WONTFIX

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	0.9.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-01-11 00:51 UTC by Scott Shambarger
Modified:	2016-09-01 23:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Initial patch with a prototype implementation (11.73 KB, patch) 2014-01-12 10:02 UTC, Scott Shambarger	none	Details \| Review
Updated patch that tracks source of cancellation (14.89 KB, patch) 2014-01-12 20:56 UTC, Scott Shambarger	none	Details \| Review
Cleaned up code a bit, better logging (14.44 KB, patch) 2014-01-14 09:01 UTC, Scott Shambarger	none	Details \| Review

Description Scott Shambarger 2014-01-11 00:51:12 UTC

I’ve been trying to get my dispatcher scripts to work around delays introduced by other scripts. This sometimes leads to (for example) scripts called for “down” after “up” even though the most recent dispatch was for “up”.

I have some scripts that depend on others (which are package managed), and any number of variables may lead to delays. I’ve tried several hacks in my scripts to work around the problem, but because a “down” script can be started after an “up” script as the last action received, I’ve finally come to the conclusion the dispatcher system as it’s currently exported cannot be worked around.

However, there appears to be a rather simple fix.

nm-dispatcher-action spawns scripts for an action in a nice serial fashion. The dispatcher stays running for least as long as needed to finish the scripts (or time them out), so as long as existing scripts are running, it’s also accepting new dispatch actions.

A simple per-interface state held in the dispatcher could solve all the ordering problems I can think of. Basically, any new action can cancel any scripts that shouldn’t be run given the new action (ie. don’t run stale scripts). I wouldn’t think it necessary to kill any single running script, just prevent new scripts from spawning potentially out of order. Having “down” scripts running when the interface is “up” again is just a Bad Thing(tm), so it seems beneficial to have these scripts never run (and certainly better than having them run after a more recent, valid, “up” script).

The logic I see is any action should cancel scripts that:
- run the same action (to reset to the first script with updated state)
- anything that should have completed before the action
- anything that would logically come after the action

Looking at the NM dispatcher calls, actions that cancel more than themselves are (up/down works within the vpn space here too):
UP: comes before DHCP changes, previous (PRE-)DOWN should have completed
(PRE-)DOWN: previous UP/DHCP should have completed
PRE-DOWN: previous DOWN should have completed
DHCP: previous (PRE-)DOWN should have completed

The logic above leaves scripts that they logically follow, or are independent of, the new action. For example:
DOWN: leaves existing PRE-DOWN
DHCP: leaves existing UP
DHCP4 and DHCP6 are independent

Issues I’m not sure of are (I could use some insight here):
(PRE-)DOWN and VPN-(PRE-)DOWN are independent (true?)
VPN and DHCP are independent (true?)

A basic table of actions canceled would be (HOSTNAME doesn’t have an interface, so not including it):

New Existing Actions Canceled (X)
Action UP VPN-UP VPN-PD VPN-DWN DHCP4 DHCP6 PRE-DWN DOWN
——————————————————————————————————————
UP X - - - X X X X
VPN-UP - X X X - - X X
VPN-PRE-DOWN - X X X - - - -
VPN-DOWN - X - X - - - -
DHCP4-CHANGE - - - - X - X X
DHCP6-CHANGE - - - - - X X X
PRE-DOWN X X - - X X X X
DOWN X X - - X X - X

I know PRE-DOWN actions aren’t called yet, but I’m assuming they will be called before DOWN actions and it’s ok to have PRE-DOWN scripts complete out of order with DOWN scripts.

My proposed implementation would be for the dispatcher process to keep a global hash keyed on interface. The hash-value would be an array of Instantiations (uint) for each action. Whenever an new action starts, it bumps the Instantiation for itself and any action it wants to cancel (obviously only for that interface).

Before the spawn callback spawns a script, it checks the Instantiation the action was started with against the action's current Instantiation. If they differ, it skips running the script (leading to the script list completing and the dispatcher sending the results to dbus).

I'd love to hear any suggestions on a better way of doing this, but as far as I can tell (I haven't coded it yet :), this is the simplest solution to the problem.

Thanks.

Comment 1 Scott Shambarger 2014-01-11 00:59:20 UTC

And, of course, tab sizes in the post form are different than the presented size (wish there was a preview-post)... here's a spaced version of the table:

New                             Existing Actions Canceled (X)
Action          UP      VPN-UP  VPN-PD  VPN-DWN DHCP4   DHCP6   PRE-DWN DOWN
----------------------------------------------------------------------------
UP              X       -       -       -       X       X       X       X 
VPN-UP          -       X       X       X       -       -       X       X
VPN-PRE-DOWN    -       X       X       X       -       -       -       -
VPN-DOWN        -       X       -       X       -       -       -       -
DHCP4-CHANGE    -       -       -       -       X       -       X       X
DHCP6-CHANGE    -       -       -       -       -       X       X       X
PRE-DOWN        X       X       -       -       X       X       X       X
DOWN            X       X       -       -       X       X       -       X

Comment 2 Scott Shambarger 2014-01-11 01:01:49 UTC

And, of course, tab sizes in the post form are different than the presented size (wish there was a preview-post)... here's a spaced version of the table:

New                             Existing Actions Canceled (X)
Action          UP      VPN-UP  VPN-PD  VPN-DWN DHCP4   DHCP6   PRE-DWN DOWN
----------------------------------------------------------------------------
UP              X       -       -       -       X       X       X       X 
VPN-UP          -       X       X       X       -       -       X       X
VPN-PRE-DOWN    -       X       X       X       -       -       -       -
VPN-DOWN        -       X       -       X       -       -       -       -
DHCP4-CHANGE    -       -       -       -       X       -       X       X
DHCP6-CHANGE    -       -       -       -       -       X       X       X
PRE-DOWN        X       X       -       -       X       X       X       X
DOWN            X       X       -       -       X       X       -       X

Comment 3 Scott Shambarger 2014-01-11 01:28:15 UTC

Sorry for the double table (timed out) -- they are the same.

To clarify... the scripts I'm talking about are fast, but "up" can be triggered less than a second from "down" so it doesn't take much of a delay to cause havoc :).  As long as a stale (eg "down") script isn't _started_ after a newly triggered (eg "up") script, I think any other races are probably acceptable.

Comment 4 Scott Shambarger 2014-01-12 10:02:05 UTC

Created attachment 266054 [details] [review]
Initial patch with a prototype implementation

I've created a patch with a proposed implementation of ordered action cancellation.  I've tested this on a couple of my test servers, and it works quite well.  I also wrote a small program to test the new code which I can provide if anyone's interested (it just adds and queries 100s of random entries).

Please let me know if there's any changes you think might improve the code, or about any alternative solutions.

Thanks.

Comment 5 Scott Shambarger 2014-01-12 20:56:11 UTC

Created attachment 266081 [details] [review]
Updated patch that tracks source of cancellation

Reworked the cancel tracking to report which action cancelled the script.

Comment 6 Scott Shambarger 2014-01-14 09:01:24 UTC

Created attachment 266236 [details] [review]
Cleaned up code a bit, better logging

Comment 7 Dan Winship 2014-04-18 15:47:36 UTC

Comment on attachment 266236 [details] [review]
Cleaned up code a bit, better logging

obsoleted by just-committed dispatcher changes

Comment 8 Dan Winship 2014-04-18 15:51:01 UTC

with the changes from bug 727439, scripts are now at least fully serialized, though there is still the issue that scripts may be run at a point when the underlying state is wrong.

from that bug:

danw:
> So actually, we need to decide which invariant we want:
> 
>   1. all scripts are always dispatched, even if the device is no longer
>      in the right state for the script when it is dispatched (current
>      behavior)
> 
>   2. scripts are only dispatched if the associated state is still correct
>      at the time the script would be started (and it's possible your
>      DOWN script will be dispatched even if your UP script wasn't)
>      (bug 721971)
> 
>   3. scripts are only dispatched if the associated state is still correct
>      at the time the script would be started, but if your UP script gets
>      skipped then your DOWN script will be skipped too.
> 
> 3 seems more consistent than 2, but perhaps there are situations where you want
> to guarantee that you always get *some* notification when the network state
> changes, even if you missed it?

thaller:
> I strongly vote for 1.
> 
> Otherwise it's hard to do anything reliable in the scripts because some might
> get skipped.
> 
> 
> It is quite expected that the state can change anytime, so when you are in an
> UP script, the state could already have changed to something else. This is not
> avoidable. Even if we check, that the state is still current before calling
> g_spawn, it could change the very moment later.
> 
> A script, if it really cares about the current state, has to check it anyway
> anew. And with 1., it can see that the state changed and behave as with 2. or
> 3.
> 
> But with 2. or 3. it cannot behave like 1.

dcbw:
> I think #1 is my choice too, since it seems the most consistent.  Bad scripts
> will be bad no matter what we do, so if some script mishandles states then we
> should make sure we point the blame finger at the script and get that fixed in
> the right place.
> 
> We could also have the dispatcher listen for new state signals on the device
> it's dispatching for, and put that new state into an environment variable so
> that scripts would at least be aware that the state had changed.  Possibly a
> future enhancement, but talking of state changes also assumes that we don't
> block on script execution, right?
> 
> My thought for a long time was that since some scripts do take a while to
> execute (NFS shutdown comes to mind) that we should just make NM block certain
> state changes like SECONDARIES->ACTIVATED and DEACTIVATING->DISCONNECTED until
> all the scripts were run, and if that made things really slow, well, again we
> point the blame finger where it's due and make script authors fix their stuff. 
> A large timeout is effectively the same thing as blocking on scripts until
> they're done too.  Thoughts?

Comment 9 Dan Winship 2014-04-18 16:49:56 UTC

looking at the dispatcher scripts I have on my system:

    - 00-netreport: re-dispatches to an older Fedora mechanism.
      Possibly unused these days?

    - 01-dnssec-trigger-hook, 10-sendmail, and 20-squid all take
      the same action on both "up" and "down", which is essentially
      "synchronize to the current state of the network". So in the
      case where the network goes up and back down again before
      the dispatcher can run the script with "up", all 3 options
      would have the same effect (since the network state after the
      interface goes down would be the same as it was before the
      interface went up, so running the script would have no effect,
      regardless of whether we ran it twice, once, or not at all).

    - 04-iscsi and 12-dhcpcd: reload their services on "up" and do
      nothing on "down". Again, assuming that the reload action
      essentially does "synchronize to the current state of the
      network" (and they apparently don't mind being out of sync
      when the network is down), then all 3 options should behave
      the same, for the same reason as above.

    - 11-dhclient: re-dispatches to dhclient-specific scripts, to
      notify services that config data from a DHCP server either
      has been received, or else is no longer relevant. Again, all
      3 options should have the same effect here; if an interface goes
      up and down before we get a chance to run the "up" scripts, then
      we don't need to tell the services about the config data, because
      it's already no-longer-relevant. And running "down" without "up"
      should also be ok, assuming the service can deal with being
      told "don't talk to ntp.example.com any more" even when it
      was never told to talk to ntp.example.com in the first place.

    - 20-chrony: tells chronyd that the system is online when a
      device goes up with the default route, and tells it the system
      is offline when a device goes down and there is no longer any
      default route.

      It seems to me like this might have race conditions in case
      2 or 3 (due to the asymmetry between the up and down cases with
      respect to the default route), but every time I try to write one
      out, it turns out that it actually does work correctly...

    - 30-winbind: runs "smbcontrol winbind online" on "up" and
      "smbcontrol winbind offline" on "down". Doesn't appear to deal
      correctly with multiple interfaces at all. For the single-
      interface case, this is the same as 01-dnssec-trigger-hook, etc.
          

I think one take-home lesson from this analysis is that we ought to be providing some environment variables to the dispatcher scripts indicating "network has just gone up", "network is up", "network has just gone down", "network is down", so they can use that rather than trying to figure out the online/offline state themselves.

Beyond that, other than the buggy 30-winbind, and possibly 20-chrony, it appears all the scripts on my system handle any of the 3 cases correctly, although #1 (the current system) is the least efficient (sometimes running the scripts even when they will have no effect on anything).

Comment 10 Thomas Haller 2014-04-18 16:53:50 UTC

Helpful analysis!!

Btw. I did not vote against adding any new environment variables and more information to the scripts (if it helps them).

I voted against skipping some scripts.

And I pointed out that the environment variables might be already out of date when the script looks at them (which is inevitable).

Comment 11 Dan Winship 2014-04-18 17:13:54 UTC

(In reply to comment #10)
> Btw. I did not vote against adding any new environment variables and more
> information to the scripts (if it helps them).

Right, I was just pointing that out as something extra I realized while doing the analysis.

Comment 12 Thomas Haller 2016-09-01 23:06:08 UTC

As it is now, scripts are serialized and called in order, no parallel execution happens.
Of course that means, that an event might be obsolete when the script is called. But such a race is unavoidable, and the script must be robust against that.

The only exception to strict-order, are scripts that are marked as "no-wait" by being a symlink to /etc/NetworkManager/dispatcher.d/no-wait.d.
For example, /etc/NetworkManager/dispatcher.d/pre-up.d/10-ifcfg-rh-routes.sh is a symlink to ../no-wait.d/10-ifcfg-rh-routes.sh.
These scripts are always called right away for each event. As such, the script runs in parallel with other scripts and even with itself. Of course, the race against networking state change exists as well and moreover they race against other scripts and them-self.



I am not convinced from the solution in comment 1 with canceling other scripts. First of all, it's hard to understand and explain. But even then, it doesn't solve any races because the very same moment dispatcher spawns a script, the state could change again. In the end, a script must be robust against that -- which may be complicated. But the only way to avoid that would be to literally block NetworkManager until the script completes, but that sounds very bad.


The current behavior seems sufficient. At least, it's not clear what to improve in detail.

I am thus closing this as WONFIX. Please re-open and restate the requested behavior if there is something to add. Thanks