Bug 387832 – implement additional 0.9 device states for pre-down/DEACTIVATING

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 387832 - implement additional 0.9 device states for pre-down/DEACTIVATING


Summary:	implement additional 0.9 device states for pre-down/DEACTIVATING


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Dan Williams
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Duplicates:	561778 600167 605579 650925 672979 690565 729833 (view as bug list)
Depends on:
Blocks:	681301 nm-0.9.10

Reported:	2006-12-20 09:20 UTC by Michel Dänzer
Modified:	2016-08-02 15:17 UTC

See Also:	https://bugzilla.redhat.com/show_bug.cgi?id=1048345 https://bugzilla.redhat.com/show_bug.cgi?id=1061212 https://bugzilla.redhat.com/show_bug.cgi?id=982633
GNOME target:	---
GNOME version:	---

Attachments
[1/6] Add a method to call dipatcher scripts synchronously (5.84 KB, patch) 2009-10-13 09:28 UTC, Christian Becke	none	Details \| Review
[2/6] Add pre-up hook support (2.70 KB, patch) 2009-10-13 09:30 UTC, Christian Becke	none	Details \| Review
Add pre-down hook support (2.58 KB, patch) 2009-10-13 09:30 UTC, Christian Becke	none	Details \| Review
[3/6] Add support for return codes from dispatcher scripts (6.26 KB, patch) 2009-10-13 09:36 UTC, Christian Becke	none	Details \| Review
[5/6] pre-up hook: use return code from dispatcher scripts (998 bytes, patch) 2009-10-13 09:39 UTC, Christian Becke	none	Details \| Review
[6/6] pre-down hook: use return code from dispatcher script (1.15 KB, patch) 2009-10-13 09:43 UTC, Christian Becke	none	Details \| Review

Description Michel Dänzer 2006-12-20 09:20:26 UTC

An increasing number of applications uses NetworkManager to detect network status, so they can e.g. (re-)connect to remote services whenever a network connection is (re-)established. That's very nice.

However, it would be even nicer if there was a mechanism whereby NetworkManager would warn applications that the network connection is about to go away, so they can prepare for offline operation. E.g. XChat-GNOME could disconnect cleanly from IRC servers, and Evolution could switch to offline mode. As the latter can take a long time, there should also be a way for apps to acknowledge that they are ready for offline operation. Only once all interested apps have acknowledged this should NetworkManager disable the network connection.

Comment 1 Dan Williams 2008-11-21 13:45:21 UTC

*** Bug 561778 has been marked as a duplicate of this bug. ***

Comment 2 Dan Williams 2009-09-03 23:59:49 UTC

For reference:

http://mail.gnome.org/archives/networkmanager-list/2009-August/msg00012.html

Body of the message for posterity:
===================================

There are two reasons I've not yet added pre-up and pre-down.  They are:

1) latency - do we expect NM to block the connect process on these
scripts?  If so, for how long?  Should NM have a timeout for each script
before it kills the script and proceeds?  Should NM run them in parallel
or in series?

Each script's execution time makes the connection attempt longer.  Badly
written scripts can make the connection attempt never succeed, hence the
need for a timer and killing of scripts that exceed that timer,  if only
to make it clear what is making the connection take longer than usual.

2) appropriateness
    a) many of the things people used to do in pre-up or pre-down
scripts (munging routing tables, other stuff) are better done by
*modifying the connection config itself*
    b) by the time any pre-down script will run, often the connection
has already gone away (the AP is out of range, the cable has been
unplugged already, etc) so any operation a pre-down script does *cannot*
depend on the interface being up; it must gracefully fail.  Common
things people wanted to do here were unmount network shares; but since
the script must always handle unexpected disconnects (which not all
network file systems do well), you might as well just run this from
post-down anyway.

With 0.7 and multiple active devices, I'm not against *pre-down* scripts
per-se, with the following caveats:

1) any pre-down operation is *advisory*, and the network interface that
the script is called for may no longer be "connected" (associated, cable
plugged in, etc).  The scripts have to understand and account for that.

2) NM wouldn't wait more than 2 or 3 seconds for *all* pre-down
operations to complete, before removing the IP addresses from the
interface or routes from the routing table, because the scripts could be
blocking a re-activation of the device to a new network.  There is room
for optimization here, ie if there is no other connection that the
device can possibly activate at this time (the cable is still unplugged,
or the rfkill switch is flipped, or the device was manually disconnected
at the users' request) then the timeout can be extended somewhat, but
certainly not to infinity.

3) This implies that pre-down scripts would have to run in parallel, so
as to complete within the time given.

The implementation for pre-down would re-factor the dispatcher to create
a "context" for each operation, a UUID for which would be returned to NM
in the response to the dispatch dbus method call.  In NM, the NMDevice
object would then start a timer of 2 or 3 seconds, and cancel that timer
if the dispatcher signaled that it was done running the scripts.
Otherwise, NM would instruct the dispatcher to cancel any running
scripts in that "context" (via the UUID) if the timer fired before the
dispatcher "context" was complete.  NM would then proceed to remove the
IP configuration information from the interface.

Basically, allowing arbitrary "pre-up" and "pre-down" scripts opens the
door to more bug reports and requires more diagnostics when stuff goes
wrong.  Thus, the requirement to *do it right* and ensure that when
somebody writes these scripts incorrectly, that the user does not suffer
the consequences, and that the guilty party (the script) is correctly
identified.

And, as always happens with timeouts, somebody will inevitably ask for
the timeout to be extended because "my use-case just takes a second
longer" without thinking about the actual impact of their request for
everyone else (ex DHCP timeouts), and without fixing the actual root
cause why they need a longer timeout.  That means yet more time spent
writing mails and replying to bug reports.

Comment 3 Michel Dänzer 2009-09-08 16:33:41 UTC

Thanks, but I wasn't thinking of pre/post-up/down scripts. Maybe it helps if I re-phrase the use case I'm thinking of:

The user knows he's going to lose the network connection soon, e.g. because he's taking the notebook along on a trip or just suspending the machine for the night. He would like his apps to cleanly shut down established network connections. Instead of switching a number of apps to offline mode individually, he would like to trigger a single 'go offline' switch[0]. That switch would prompt NM to tell interested apps they should go offline. E.g. in my GNOME session, Evolution would synchronize local copies of mail from IMAP servers, XChat-GNOME and Empathy would cleanly disconnect from chat servers, and Galeon would just switch to offline mode. The interested apps would let NM know when they're done[1], and when all of them are, NM would disable the network connection.

[0] Maybe this could simply be unchecking the nm-applet 'Enable Networking' checkbox, or maybe something more explicit would be needed.

[1] Evolution's synchronization of IMAP accounts for offline use can take several seconds to minutes here, so an arbitrary timeout might defeat the purpose of the whole scheme. However a way for the user to abort the protocol and go offline immediately might be useful.

Comment 4 Christian Becke 2009-10-13 09:28:44 UTC

Created attachment 145332 [details] [review]
[1/6] Add a method to call dipatcher scripts synchronously

I am attaching a series of patches implementing a method to call dispatcher scripts synchronously and making pre-up and pre-down events available to dispatcher scripts.
See http://mail.gnome.org/archives/networkmanager-list/2009-October/msg00101.html for reference.

This first patch adds a method to call dispatcher scripts synchronously:

typedef void (*NMUtilsDispatcherDoneCallback) (gpointer userdata);

void nm_utils_call_dispatcher_sync (const char *action,
                                    NMConnection *connection,
                                    NMDevice *device,
                                    const char *vpn_iface,
                                    NMUtilsDispatcherDoneCallback callback,
                                    gpointer userdata);

When the dispatcher scripts return, the callback is called with userdata as argument. The callback is also invoked if there is a DBus error or the scripts take too long to run. The timeout for the scripts is currently set to 10s.

Comment 5 Christian Becke 2009-10-13 09:30:16 UTC

Created attachment 145333 [details] [review]
[2/6] Add pre-up hook support

This adds support for a pre-up hook.

Comment 6 Christian Becke 2009-10-13 09:30:59 UTC

Created attachment 145335 [details] [review]
Add pre-down hook support

This adds support for a pre-down hook.

Comment 7 Christian Becke 2009-10-13 09:36:38 UTC

Created attachment 145337 [details] [review]
[3/6] Add support for return codes from dispatcher scripts

This adds a gboolean argument to the NMUtilsDispatcherDoneCallback, indicating successful run of the dispatcher scripts. If at least one of the dispatcher scripts returned != 0, success will be FALSE, else TRUE.

New signature:
typedef void (*NMUtilsDispatcherDoneCallback) (gboolean success, gpointer userdata);

Comment 8 Christian Becke 2009-10-13 09:39:44 UTC

Created attachment 145338 [details] [review]
[5/6] pre-up hook: use return code from dispatcher scripts

If the dispatcher scripts called on pre-up return FALSE, device state is set to NM_DEVICE_STATE_FAILED, with reason NM_DEVICE_STATE_REASON_PRE_UP_DISPATCHER_FAILED.

Comment 9 Christian Becke 2009-10-13 09:43:21 UTC

Created attachment 145340 [details] [review]
[6/6] pre-down hook: use return code from dispatcher script

This makes the pre-down hook use the return code from the dispatcher scripts. Currently, the only action taken if the scripts return != 0 is to log a warning.

Comment 10 Christian Becke 2009-10-13 09:45:17 UTC

Sorry, I messed up the numberings. Should be:

Add pre-up hook support            => [3/6] Add pre-up hook support
[3/6] Add support for return codes => [4/6] Add support for return codes

Comment 11 Michel Dänzer 2009-11-05 14:40:39 UTC

Thanks, but again I'm not sure this kind of scripts can really address the use cases I'm thinking of.

Comment 12 Dan Williams 2009-11-10 06:09:48 UTC

(In reply to comment #11)
> Thanks, but again I'm not sure this kind of scripts can really address the use
> cases I'm thinking of.

What kind of use-cases were those?

Comment 13 Michel Dänzer 2009-11-13 01:40:49 UTC

(In reply to comment #12)
> What kind of use-cases were those?

The ones I described here before...

Comment 14 Dan Williams 2010-01-20 22:39:05 UTC

*** Bug 605579 has been marked as a duplicate of this bug. ***

Comment 15 Dan Williams 2010-01-20 22:39:18 UTC

*** Bug 600167 has been marked as a duplicate of this bug. ***

Comment 16 hedges 2010-01-20 22:46:53 UTC

Why not run if-pre-up.d, if-up.d, if-down.d and if-post-down.d in phases pre-up, up, down, and post-down simply by calling `run-parts` at those stages as they are in ifupdown?  If scripts fail, they fail; if they take too long, they take too long - then users should report bugs against those packages.  There's no need to re-invent the wheel.  Thank you, in either case, for pursuing a fix that conforms to ifupdown behavior and provides the interface control framework that many Debian and Ubuntu packages have come to expect.

Comment 17 Dan Williams 2010-01-20 23:09:07 UTC

(In reply to comment #16)
> If scripts fail, they fail; if they take too long,
> they take too long - then users should report bugs against those packages. 

The problem is that quite a few users would *not* see it as the script taking too long, and would not report the bug against that package.  Its' the same problem with Firefox crashes for example.  Half the crashes they have are because Flash or some other plugin did something wrong and crashed the browser.  But the users just see that Firefox crashed, they don't know that Flash did it.

Same thing here.  The user tries to connect to a network, and the connection stalls because some script underneath is blocking.  The user has no idea what scripts are running underneath (nor really should they), but all they see is that they are not getting connected to the network.  NM should obviously log long-running scripts to syslog, but how many users are going to immediately go look at syslog output if their network is hanging?  Not very many I'd suspect.

So it's not simply a question of coding it, it's also a question of how users will behave when it's done, and what we can do to minimize user confusion when those scripts simply don't work correctly.

Because the users *will* blame NetworkManager.  Not the scripts.

Comment 18 hedges 2010-01-21 00:24:26 UTC

(In reply to comment #17)
> So it's not simply a question of coding it, it's also a question of how users
> will behave when it's done, and what we can do to minimize user confusion when
> those scripts simply don't work correctly.
> 
> Because the users *will* blame NetworkManager.  Not the scripts.

NetworkManger used to runparts if-*.d like ifupdown with all four phases.  It might be improved with threads, timeouts and so forth, but taking away the old behavior has not improved the user experience for working packages in Debian and Ubuntu that install those scripts.  Maybe it makes sense to start with runparts, the way it was, or one of the above contributed patches, and then work out the problems, to avoid disrupting working packages that expected the phases to work like ifupdown.

Here's some ideas to minimize user confusion:

- Include a tooltip window above DHCP sync animation in notification area, updated with `watch -n 3 grep NetworkManager $syslog_file | tail -n 10`

- Triage the bug reports... yeah well the network is a core service, so that's what happens.

- Bug button and a hook for the vendor's bug report system, then for instance the Debian maintainer can add a script that feeds `tail -n 300 $syslog_file` into `reportbug`.

- Bug button provides on-screen instructions on how to look at the syslog for problems in other scripts.

Thank you for being open to options.  It's expected stuff like this will get tossed around using a testing distribution but a lot of packages depend on ifupdown-like behavior of those directories.

--mark--

Comment 19 Dan Williams 2011-05-25 17:00:07 UTC

*** Bug 650925 has been marked as a duplicate of this bug. ***

Comment 20 Dan Williams 2012-04-28 16:35:29 UTC

*** Bug 672979 has been marked as a duplicate of this bug. ***

Comment 21 Orion Poplawski 2012-05-16 20:40:38 UTC

I *think* what the original poster was thinking of would be a dbus signal that would go out warning all dbus listeners (applications) that the network was going down.  Could be done I suppose but you still need to decide how long to wait before actually terminating the network.  I doubt that there is any know in advance who cares and so be able to wait for responses (if desired).

I find myself wishing for this again, but the cause is of course problematic behavior of other applications/services when the network goes away.  This would really just be a band-aid for that instead of making those apps/services more robust.  But sometimes a bandaid is nice when you're bleeding... :)

Comment 22 Dan Williams 2012-06-12 20:02:33 UTC

I started looking into this again and reworking some of the internal code and dispatcher stuff along all these lines.  The idea with NM 0.9 obviously is to use the DEACTIVATING state for this purpose.  When IRC clients or network shares or whatever see that state via the D-Bus signal, they can perform those pre-down operations, and after a short delay NM will take the connection down fully.

Obviously this signal will only be sent in the case that the user initiates a disconnect themselves (either by choosing another connection manually from the menu, or suspend/resume, or shutting down, or choosing "Disconnect").  If the connection is terminated for external reasons (loss of wifi signal, no carrier after a timeout, PPP died, etc) then the connection is already gone, there's no possibility to disconnect cleanly, and the device will move to the FAILED state.

Comment 23 Glenn Washburn 2012-08-06 11:52:08 UTC

I would definitely like to have a pre-up action from the dispatcher.  I would like to make changes to the interface which can't be done while its up.  Bringing the interface up and down in an "up" action seems like a bad idea and hard to do correctly.

Comment 24 Pavel Simerda 2012-08-24 14:17:22 UTC

I second this requiest for both the pre-up way (actually there's already a tracker bug 681301 for issues like this) and also for the D-Bus way.

One thing we definitely need for D-Bus is to have some configurable
timeout. It might be nice to also have a timeout for the scripts. They
can be run in parallel or sequentially.

But this is important to get NetworkManager to users that actually require this.

One use case is that when I close my notebook, I want to gracefully shutdown
application connections (e.g. IM tools) so that I don't ever loose any messages
that arrive in the following seconds or minutes.

Comment 25 Pavel Simerda 2012-08-24 14:18:35 UTC

I second this request for both the pre-up way (actually there's already a tracker bug 681301 for issues like this) and also for the D-Bus way.

One thing we definitely need for D-Bus is to have some configurable
timeout. It might be nice to also have a timeout for the scripts. They
can be run in parallel or sequentially.

But this is important to get NetworkManager to users that actually require this.

One use case is that when I close my notebook, I want to gracefully shutdown
application connections (e.g. IM tools) so that I don't ever loose any messages
that arrive in the following seconds or minutes.

Comment 26 Pavel Simerda 2012-11-04 14:27:37 UTC

What's the current status? I remember some discussion about this and thought
this is already implemented. And Gentoo is still using a patch for 0.9.6.4.

http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/net-misc/networkmanager/files/networkmanager-0.9.2.0-pre-sleep.patch?view=log

http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/net-misc/networkmanager/networkmanager-0.9.6.4.ebuild?revision=1.3&view=markup

Comment 27 Orion Poplawski 2012-12-19 21:20:43 UTC

Ping?  I could really use this to unmount sec=krb5 nfs4 mounts before the network disappears as they cannot be unmounted afterwards.

Comment 28 Dan Williams 2013-01-11 21:24:14 UTC

Both of these are in-progress; the 'pending2' branch is somewhat related.  The larger problem here is that we have a behavior of deactivating the current connection if you start another connection on the interface, and intenrally, that's synchronous.  Which precludes any kind of 'pre-down' stage.  'pending2' is an in-progress branch that is working towards removing that restriction.

Comment 29 hedges 2013-01-12 06:39:52 UTC

Is the interface being brought up or down available to scripts in an environment variable?  If you added all the other network-manager state information, like which interfaces were managed and what their state is, you could make it the script's responsibility to decide what to do if multiple interfaces are up or down.  That could be useful for setting up complex dynamic networking.

Super glad this is in the pipeline.  You rock!

Comment 30 Pavel Simerda 2013-01-14 13:38:46 UTC

(In reply to comment #29)
> Is the interface being brought up or down available to scripts in an
> environment variable?  If you added all the other network-manager state
> information, like which interfaces were managed and what their state is, you
> could make it the script's responsibility to decide what to do if multiple
> interfaces are up or down.

Possibly yes, but this is not that simple and needs quite a big deal of analysis and work.

Btw, I didn't test whether you can interface NetworkManager properly from the script. Is it possible or is NM blocked at that time?

> That could be useful for setting up complex dynamic networking.

+1

Comment 31 Peter V. Toth 2013-07-26 07:51:47 UTC

Current workaround in /etc/NetworkManager/dispatcher.c/01ifupdown script is incorrect since only pre-up/pre-down cases are commented out but their 'post' counterparts are in effect. So the [ifupdown] managed=false/true option of /etc/NetworkManager/NetworkManager.conf acts as managed=false/partially. Disabling only one side of any up/down functionality is worse than doing nothing.

Comment 32 Pavel Simerda 2013-07-26 09:31:02 UTC

(In reply to comment #27)
> I could really use this to unmount sec=krb5 nfs4 mounts before the
> network disappears as they cannot be unmounted afterwards.

Just be careful about excessive assumptions, as this only applies to situations where you (the administrator) requested the disconnect. Most other situations 
are somehow hardware-related like cable unplugged, lost signal, network equipment turned off, etc. Tools like NFS should expect to be disconnected at any time and behave gracefully in that situation.

Comment 33 Orion Poplawski 2013-07-26 17:16:26 UTC

(In reply to comment #32)
> (In reply to comment #27)
> > I could really use this to unmount sec=krb5 nfs4 mounts before the
> > network disappears as they cannot be unmounted afterwards.
> 
> Just be careful about excessive assumptions, as this only applies to situations
> where you (the administrator) requested the disconnect. Most other situations 
> are somehow hardware-related like cable unplugged, lost signal, network
> equipment turned off, etc. Tools like NFS should expect to be disconnected at
> any time and behave gracefully in that situation.

Oh, absolutely.  But the chances of NFS handling disconnected networks entirely gracefully in our lifetime is slim.

Comment 34 Tony Groves 2013-10-26 00:42:45 UTC

I'd like to vote for this bug. On my Debian laptop, I have to remember to unmount all CIFS (Samba) connections before shutting down, otherwise the shutdown takes forever (OK, five minutes) as it tries to unmount them after the wireless connection is gone.

Comment 35 Tom Hanks 2013-11-24 05:15:42 UTC

i voting for this bug too. please add at least a pre-down hook

Comment 36 Romano Giannetti 2014-03-12 18:19:29 UTC

+1, same reason of Tony Groves.

I supposed to be able to umount network shares before suspend in pm hooks 00-49 (see for example https://wiki.archlinux.org/index.php/pm-utils  --- network should be available there), but I discovered that this is not true (I suppose NM disconnect in async way when the suspend signal is going on).

Comment 37 Dan Williams 2014-04-22 14:50:53 UTC

We recently implemented the DEACTIVATING state internally for 0.9.9+.  The next step is to create a dispatcher event for that, and block the DEACTIVATING -> DISCONNECTED transition on the dispatcher event completing.

Comment 38 Dan Winship 2014-04-24 17:42:09 UTC

*** Bug 690565 has been marked as a duplicate of this bug. ***

Comment 39 Dan Williams 2014-05-22 14:08:47 UTC

Work for this feature is happening in the dcbw/dispatcher-blocking branch.

Comment 40 Dan Williams 2014-05-23 23:52:42 UTC

Posted for review.

All the commits up to the first 'dispatcher' commit are cleanups and fixes that I think are worthwhile.  The real commits for this bug start with the first 'dispatcher' commit.  I know there's a lot of commits (sorry!) but most of them are trivial and don't change operation in meaningful ways.

Comment 41 Jiri Klimes 2014-05-26 11:31:40 UTC

Compilation errors:
* it doesn't compile with GLib 2.32 due to G_SPAWN_DEFAULT:
  vpn-manager/nm-vpn-service.c: In function 'nm_vpn_service_daemon_exec':
  vpn-manager/nm-vpn-service.c:212:63: error: 'G_SPAWN_DEFAULT' undeclared (first use in this function)
https://mail.gnome.org/archives/commits-list/2013-August/msg05131.html
Use 0 instead, or define G_SPAWN_DEFAULT in include/nm-glib-compat.h

* > dispatcher: add synchronous dispatcher calls
needs
--- a/src/nm-dispatcher.c
+++ b/src/nm-dispatcher.c
@@ -324,7 +324,7 @@ _dispatcher_call (DispatcherAction action,
                nm_log_dbg (LOGD_DISPATCH, "dispatching action '%s'",
                            action_to_string (action));
        } else {
-               g_return_val_if_fail (NM_IS_DEVICE (device), NULL);
+               g_return_val_if_fail (NM_IS_DEVICE (device), FALSE);
 
                nm_log_dbg (LOGD_DISPATCH, "(%s) dispatching action '%s'",
                            nm_device_get_iface (device), action_to_string (action));

Comment 42 Jiri Klimes 2014-05-26 11:47:01 UTC

Pushed a commit renaming 'nm-dispatcher.action' to 'nm-dispatcher' in  contrib/fedora/rpm/NetworkManager.spec. We should remember to adjust specs when updates in Fedora/RHEL are made.

Comment 43 Thomas Haller 2014-05-26 12:07:51 UTC

(In reply to comment #42)
> Pushed a commit renaming 'nm-dispatcher.action' to 'nm-dispatcher' in 
> contrib/fedora/rpm/NetworkManager.spec. We should remember to adjust specs when
> updates in Fedora/RHEL are made.

Thanks Jirka, but could we squash this commit in "dispatcher: rename executable to 'nm-dispatcher'", otherwise the spec file is broken for all commits in between.

Comment 44 Thomas Haller 2014-05-26 14:29:48 UTC

>> dispatcher: only dispatch if scripts exist

move the lines:
     /* VPN actions require at least an IPv4 config (for now) */
     if (action == DISPATCHER_ACTION_VPN_UP)
          g_return_val_if_fail (vpn_ip4_config != NULL, NULL);

before
     if (do_dispatch == FALSE) {




>> dispatcher: add synchronous dispatcher calls

     if (do_dispatch == FALSE) {
-         if (blocking == FALSE) {
+         if (blocking == FALSE && (out_call_id || callback)) {
               info = g_malloc0 (sizeof (*info));

no need to schedule anything, if nobody cares about the result.


-nm_dispatcher_call_cancel (gconstpointer call)
+nm_dispatcher_call_cancel (gconstpointer call_id)
 {
-    /* 'call' is really a DispatchInfo pointer, just opaque to callers.
+    /* 'call_id' is really a DispatchInfo pointer, just opaque to callers.


Btw. this approach has a chance to cancelling the wrong call, because the original call_id might have been freed and malloc could return the same pointer again in a later callout.



>> vpn: queue additional VPN connections

-         g_assert (start_next_vpn (service, NULL));
+         started = start_next_vpn (service, NULL);
+         g_assert (started);

g_assert should be called side-effect free.



>> vpn: remove pointless child watch on VPN service daemons

If the plugin dies early and never gets to the point where it acquires the dbus name, this commit changes behaviour in that we always wait 5 seconds for _daemon_exec_timeout().



>> vpn: implement PRE_DOWN dispatcher actions (bgo #387832)

I have doubts about this _sync call. When NM blocks, dispatcher scripts cannot call to NM (e.g. via nmcli). I think, using nmcli to talk to NM from the dispatcher scripts sounds useful. Do the scripts know, when NM is blocked? I think they don't. If they would know it, they could at least workaround it -- although it requires more complex documentation.




>> core: re-order NMDevice functions to match current idoms

is this a trivial moving around of code? It should be. Could you mention "trivial" in the commit message?

This and later cleanups will be painful when diffing/backporting patches, but well...



>> core: let NMDevice export itself

I would prefer the name nm_device_dbus_export() -- Even if existing classes already use simple *_export().



>> core: implement PRE_UP dispatcher actions

shouldn't this also somewhere handle the case that something downs the connection while we callout for PRE_UP? I would expect to cancel the callout and set dispatcher_id to NULL.



>> core: fix deactivation of assumed connections on device removal (bgo #729833)

why did you put this patch here? Maybe close bug 729833 as duplicate of this bz?

commit message:
  "...which causes errors in libnm-glib clients when they cannot create teh"
"teh"



Also pushed 3 fixup commits. The rest looks fine for now.

Comment 45 Dan Williams 2014-05-28 20:24:55 UTC

*** Bug 729833 has been marked as a duplicate of this bug. ***

Comment 46 Dan Williams 2014-05-28 20:31:44 UTC

(In reply to comment #41)
> Compilation errors:
> * it doesn't compile with GLib 2.32 due to G_SPAWN_DEFAULT:
>   vpn-manager/nm-vpn-service.c: In function 'nm_vpn_service_daemon_exec':
>   vpn-manager/nm-vpn-service.c:212:63: error: 'G_SPAWN_DEFAULT' undeclared
> (first use in this function)
> https://mail.gnome.org/archives/commits-list/2013-August/msg05131.html
> Use 0 instead, or define G_SPAWN_DEFAULT in include/nm-glib-compat.h
> 
> * > dispatcher: add synchronous dispatcher calls
> needs
> --- a/src/nm-dispatcher.c
> +++ b/src/nm-dispatcher.c
> @@ -324,7 +324,7 @@ _dispatcher_call (DispatcherAction action,
>                 nm_log_dbg (LOGD_DISPATCH, "dispatching action '%s'",
>                             action_to_string (action));
>         } else {
> -               g_return_val_if_fail (NM_IS_DEVICE (device), NULL);
> +               g_return_val_if_fail (NM_IS_DEVICE (device), FALSE);
> 
>                 nm_log_dbg (LOGD_DISPATCH, "(%s) dispatching action '%s'",
>                             nm_device_get_iface (device), action_to_string
> (action));

Fixed.

(In reply to comment #42)
> Pushed a commit renaming 'nm-dispatcher.action' to 'nm-dispatcher' in 
> contrib/fedora/rpm/NetworkManager.spec. We should remember to adjust specs when
> updates in Fedora/RHEL are made.

Yes, my fault, I should have done that.  Thanks for the fixup!

Comment 47 Dan Williams 2014-05-28 22:14:58 UTC

(In reply to comment #44)
> >> dispatcher: only dispatch if scripts exist
> 
> move the lines:
>      /* VPN actions require at least an IPv4 config (for now) */
>      if (action == DISPATCHER_ACTION_VPN_UP)
>           g_return_val_if_fail (vpn_ip4_config != NULL, NULL);
> 
> before
>      if (do_dispatch == FALSE) {

Done.

> >> dispatcher: add synchronous dispatcher calls
> 
>      if (do_dispatch == FALSE) {
> -         if (blocking == FALSE) {
> +         if (blocking == FALSE && (out_call_id || callback)) {
>                info = g_malloc0 (sizeof (*info));
> 
> no need to schedule anything, if nobody cares about the result.

Done.

> -nm_dispatcher_call_cancel (gconstpointer call)
> +nm_dispatcher_call_cancel (gconstpointer call_id)
>  {
> -    /* 'call' is really a DispatchInfo pointer, just opaque to callers.
> +    /* 'call_id' is really a DispatchInfo pointer, just opaque to callers.
> 
> 
> Btw. this approach has a chance to cancelling the wrong call, because the
> original call_id might have been freed and malloc could return the same pointer
> again in a later callout.

Excellent point.  See the new commit "dispatcher: fix potential problems canceling dispatcher calls".

> >> vpn: queue additional VPN connections
> 
> -         g_assert (start_next_vpn (service, NULL));
> +         started = start_next_vpn (service, NULL);
> +         g_assert (started);
> 
> g_assert should be called side-effect free.

Changed to a g_warn_if_fail(), since we expect it to always be TRUE, but it's OK if it's not I guess (since the service should already be started).

> >> vpn: remove pointless child watch on VPN service daemons
> 
> If the plugin dies early and never gets to the point where it acquires the dbus
> name, this commit changes behaviour in that we always wait 5 seconds for
> _daemon_exec_timeout().
> 
> 
> 
> >> vpn: implement PRE_DOWN dispatcher actions (bgo #387832)
> 
> I have doubts about this _sync call. When NM blocks, dispatcher scripts cannot
> call to NM (e.g. via nmcli). I think, using nmcli to talk to NM from the
> dispatcher scripts sounds useful. Do the scripts know, when NM is blocked? I
> think they don't. If they would know it, they could at least workaround it --
> although it requires more complex documentation.

NM *only* makes blocking dispatcher calls on exit, when the mainloop is already gone.  So at these times, there's no way a script could call NM anyway since NM will just quit soon anyway.  If we were to do this quitting stuff while still in the mainloop, that allows any kind of calls (even ActivateConnection or Update) that could depend on callbacks to complete, and we can't easily guarantee that all those callbacks would complete without a lot more code.  So I think blocking on shutdown is the best thing to do here.

(you could imagine a pending actions thing for quitting too, where every operation that scheduled an idle/callback/etc, like Update() or ActivateConnection or something also added a pending action, and NM only quit when all the pending actions were removed.  This would allow dispatcher scripts to call back into NM, but I think it's tricky to get right and I'm not sure it's that useful?)

The idea of doing blocking on shutdown is to ensure that we have correct balance, and that we can always guarantee that whenever NetworkManager deconfigures an interface (like when quitting) that dispatcher scripts like NFS stuff can cleanly handle the disconnection, before the network connection goes away.  

We could also indicate to dispatcher scripts through an environment variable that NM is quitting, or we could add another NM_STATE_QUITTING and push the NM state to dispatcher scripts, so that scripts could avoid calling nmcli on shutdown.  Happy to do that patch if you'd like.

> >> core: re-order NMDevice functions to match current idoms
> 
> is this a trivial moving around of code? It should be. Could you mention
> "trivial" in the commit message?

Done.

> This and later cleanups will be painful when diffing/backporting patches, but
> well...

It turns out git actually does pretty well with this...  I don't think it minds applying hunks out-of-order as long as the context matches.  Yeah, it's a bit of a pain, but it's something I've wanted to do for a long time so I just went ahead and did it...  nm-device is one of the oldest files in NetworkManager, and it didn't really keep pace with the times.

> >> core: let NMDevice export itself
> 
> I would prefer the name nm_device_dbus_export() -- Even if existing classes
> already use simple *_export().

Done.

> >> core: implement PRE_UP dispatcher actions
> 
> shouldn't this also somewhere handle the case that something downs the
> connection while we callout for PRE_UP? I would expect to cancel the callout
> and set dispatcher_id to NULL.

nm_device_state_changed() (later changed to _set_state_full()) calls dispatcher_cleanup() whenever a state change happens, which will cancel any outstanding dispatcher calls.  So that should handle the case where the connection goes down while still in pre-up.

> >> core: fix deactivation of assumed connections on device removal (bgo #729833)
> 
> why did you put this patch here? Maybe close bug 729833 as duplicate of this
> bz?

Duped.

> commit message:
>   "...which causes errors in libnm-glib clients when they cannot create teh"
> "teh"

Fixed.

> Also pushed 3 fixup commits. The rest looks fine for now.

I squashed all the fixups, they were good.  Thanks!

Comment 48 Dan Williams 2014-05-29 21:28:27 UTC

I didn't change the VPN service child watch removal patch in the end.  I tried to add back the child watch just to catch the race condition there, but that has problems because if you ever pass SPAWN_FLAGS_DONT_REAP_CHILD, which is required for a child watch, you can't ever remove the child watch until you exit, or the child will become a zombie.  So that was a bit complicated.

However, if the child *does* exit before it claims the D-Bus service, then something is clearly wrong with the process or configuration and it's very likely to just die again the next time it's started.  I think it's acceptable to impose a 5-second penalty, which has the nice side-effect that nothing tries to continuously respawn the service just to have it exit again.  Does that seem OK to you?

Comment 49 Dan Winship 2014-05-30 18:15:28 UTC

> vpn: add DEACTIVATING state

Having "Private" or "Internal" in the name of the private state enum would be good

>+	case NM_VPN_STATE_DEACTIVATING: {
>+		/* Map DEACTIVATING to ACTIVATED to allow handling the state asynchronously */
>+		return NM_VPN_CONNECTION_STATE_ACTIVATED;
>+	}

It's not at all clear what mapping DEACTIVATING to ACTIVATED has to do with handling the state asynchronously. (Also, you don't need the {}s.)

>+		g_signal_emit (connection, signals[INTERNAL_STATE_CHANGED], 0,
>+			           new_external_state,
>+			           old_external_state,
>+			           reason);

indent

>+	/* FIXME: remove when VPNs can be queued */

g_queue_push_tail (queue, vpn);

:-)

(when VPN *activation* can be queued?)



> vpn: implement DEACTIVATING state

"implement" is a strong word... it just adds an extra no-op state in there...



> vpn: queue additional VPN connections

>+start_next_vpn (NMVPNService *self, GError **error)
>+{
>+	NMVPNServicePrivate *priv = NM_VPN_SERVICE_GET_PRIVATE (self);
>+
>+	if (!priv->active && priv->pending) {

If priv->active is non-NULL, then start_next_vpn() will try to start it a second time. So really, I think that should be:

    g_return_if_fail (priv->active == NULL);

    if (priv->pending) {
        ...



> vpn: stop all connections of a service outside of dispose

Confusing commit message... how about

    vpn: don't stop connections from NMVPNService:dispose
    
    Future patches will allow blocking dispatcher calls, which we
    don't want to happen when deactivating a VPN connection during
    normal operation.  So move code that stops VPN connections outside
    of the VPNService object's dispose() function and require the object
    that owns the VPNService (the VPNManager) to stop connections at the
    right times.

    When quitting, blocking calls are acceptable (because NetworkManager's
    D-Bus interface is no longer useful, plus we can't easily schedule
    callbacks because no mainloop is running), so it's ok to stop
    connections from NMVPNManager:dispose.



> core: re-order NMDevice functions to match current idoms

I'm going to just assume the commits that look like trivial reorganizations actually are trivial reorganizations...



> core: let NMDevice export itself

>+	nm_log_info (LOGD_CORE, "(%s): exported as %s", priv->iface, priv->path);

LOGD_DEVICE?



> core: rearrange and remove some NMDevice getters

seems a little weird to remove the public getters for functions where we still have public setters, even if no one is (currently) calling them...



> core: fix deactivation of assumed connections on device removal (bgo #729833)

Heh, I was running into this the other day while testing danw/slavemonitor

>    3) if the device is not activated, to ensure that it's IPv6 parameters

"its"



> core: consolidate generic device cleanup tasks

>+	if (deconfigure) {
>+		/* Clear legacy IPv4 address property */
>+		if (priv->ip4_address) {

That doesn't seem like it should depend on deconfigure. We've already cleared priv->ip4_config; priv->ip4_address must need to be cleared as well.



> dispatcher: only dispatch if scripts exist

I think a bigger problem is, what happens if you have 10 dispatcher scripts, but none of them cares about pre-up or pre-down? Looking through the journal, a single dispatcher run seems to take about 2 seconds on my machine. Can we afford to slow down every activation by that much?

We may need some way to make the blocking states be opt-in.



> dispatcher: add synchronous dispatcher calls

>+ * nm_dispatcher_call():

no "()" (on all of the new function docs)

>-nm_dispatcher_call_cancel (gconstpointer call)
>+nm_dispatcher_call_cancel (gconstpointer call_id)
> {
> 	/* 'call' is really a DispatchInfo pointer, just opaque to callers.

'call_id'



> dispatcher: fix potential problems canceling dispatcher calls
>    
> Thomas pointed out that using the address of the DispatcherInfo
> structure as the dispatcher call ID could cause a mis-cancelation
> if malloc re-used the same block in the future.  Fix that by
> changing to a uint call ID instead.

That could only happen if the code was already confused/buggy. (It's passing a freed pointer to nm_dispatcher_call_cancel()... if malloc *didn't* reuse the block, then NM would just crash.)

If the code is keeping track of things properly, and making sure that it clears its references to call_ids after the callback runs or after it cancels them, then there's no problem.



> dispatcher: convert action_to_string to a table

>+	if (action >= 0 && action < G_N_ELEMENTS (action_table))
>+		return action_table[action];
> 	g_assert_not_reached ();

It would be more idiomatic to rewrite that to:

    g_assert (action >= 0 && action < G_N_ELEMENTS (action_table));
    return action_table[action];



> vpn: implement PRE_DOWN dispatcher actions (bgo #387832)

This patch also makes DOWN be synchronous on quit, in addition to adding PRE_DOWN support.

>+		/* Just proceed on errors */
>+		dispatcher_pre_down_done (0, connection);

no problems with state-changing-reentrancy here?

Comment 50 Dan Williams 2014-05-30 20:44:24 UTC

(In reply to comment #49)
> > vpn: add DEACTIVATING state
> 
> Having "Private" or "Internal" in the name of the private state enum would be
> good

Done.

> >+	case NM_VPN_STATE_DEACTIVATING: {
> >+		/* Map DEACTIVATING to ACTIVATED to allow handling the state asynchronously */
> >+		return NM_VPN_CONNECTION_STATE_ACTIVATED;
> >+	}
> 
> It's not at all clear what mapping DEACTIVATING to ACTIVATED has to do with
> handling the state asynchronously. (Also, you don't need the {}s.)

Expanded comment.

> >+		g_signal_emit (connection, signals[INTERNAL_STATE_CHANGED], 0,
> >+			           new_external_state,
> >+			           old_external_state,
> >+			           reason);
> 
> indent

Fixed.

> >+	/* FIXME: remove when VPNs can be queued */
> 
> g_queue_push_tail (queue, vpn);
> 
> :-)
> 
> (when VPN *activation* can be queued?)

Updated the comment.

> > vpn: implement DEACTIVATING state
> 
> "implement" is a strong word... it just adds an extra no-op state in there...

Changed to "implement placeholder"

> > vpn: queue additional VPN connections
> 
> >+start_next_vpn (NMVPNService *self, GError **error)
> >+{
> >+	NMVPNServicePrivate *priv = NM_VPN_SERVICE_GET_PRIVATE (self);
> >+
> >+	if (!priv->active && priv->pending) {
> 
> If priv->active is non-NULL, then start_next_vpn() will try to start it a
> second time. So really, I think that should be:
> 
>     g_return_if_fail (priv->active == NULL);
> 
>     if (priv->pending) {
>         ...

How about now?  I split this into start_pending_vpn() and start_active_vpn() so that the uses are explicitly clear.  start_pending_vpn() asserts that priv->active == NULL.

> > vpn: stop all connections of a service outside of dispose
> 
> Confusing commit message... how about
> 
>     vpn: don't stop connections from NMVPNService:dispose
> 
>     Future patches will allow blocking dispatcher calls, which we
>     don't want to happen when deactivating a VPN connection during
>     normal operation.  So move code that stops VPN connections outside
>     of the VPNService object's dispose() function and require the object
>     that owns the VPNService (the VPNManager) to stop connections at the
>     right times.
> 
>     When quitting, blocking calls are acceptable (because NetworkManager's
>     D-Bus interface is no longer useful, plus we can't easily schedule
>     callbacks because no mainloop is running), so it's ok to stop
>     connections from NMVPNManager:dispose.

Done

> > core: re-order NMDevice functions to match current idoms
> 
> I'm going to just assume the commits that look like trivial reorganizations
> actually are trivial reorganizations...

Yeah, I tried hard to make sure I didn't change anything except remove prototypes from the top.

> > core: let NMDevice export itself
> 
> >+	nm_log_info (LOGD_CORE, "(%s): exported as %s", priv->iface, priv->path);
> 
> LOGD_DEVICE?

Done.

> > core: rearrange and remove some NMDevice getters
> 
> seems a little weird to remove the public getters for functions where we still
> have public setters, even if no one is (currently) calling them...

I added nm_device_get_is_nm_owned() back; I don't think the get_unmanaged_flag() on is really necessary since we already have nm_device_get_managed().

> > core: fix deactivation of assumed connections on device removal (bgo #729833)
> 
> Heh, I was running into this the other day while testing danw/slavemonitor

This patch *has* been on 'nm-review' for 2 weeks, so... :)  But I was going to modify that part of the code anyway, so I thought I'd suck it in.

> >    3) if the device is not activated, to ensure that it's IPv6 parameters
> 
> "its"

Done.

> > core: consolidate generic device cleanup tasks
> 
> >+	if (deconfigure) {
> >+		/* Clear legacy IPv4 address property */
> >+		if (priv->ip4_address) {
> 
> That doesn't seem like it should depend on deconfigure. We've already cleared
> priv->ip4_config; priv->ip4_address must need to be cleared as well.

Done.

> > dispatcher: only dispatch if scripts exist
> 
> I think a bigger problem is, what happens if you have 10 dispatcher scripts,
> but none of them cares about pre-up or pre-down? Looking through the journal, a
> single dispatcher run seems to take about 2 seconds on my machine. Can we
> afford to slow down every activation by that much?
> 
> We may need some way to make the blocking states be opt-in.

The short answer here is, "don't use dispatcher scripts" or "uninstall stuff you don't actually need".  Part of the purpose is to push back on dispatcher scripts to make sure they don't do stupid things.

However, I'm happy to modify this patchset to add new /etc/NetworkManager/dispatcher.d/pre-up.d/ and /etc/NetworkManager/dispatcher.d/pre-down.d/ directories, where the new events only execute stuff in those directories.  That would allow opting into the event, without existing scripts dragging everything down.  Does that sound OK?

> > dispatcher: add synchronous dispatcher calls
> 
> >+ * nm_dispatcher_call():
> 
> no "()" (on all of the new function docs)

Fixed.

> >-nm_dispatcher_call_cancel (gconstpointer call)
> >+nm_dispatcher_call_cancel (gconstpointer call_id)
> > {
> > 	/* 'call' is really a DispatchInfo pointer, just opaque to callers.
> 
> 'call_id'

Fixed.

> > dispatcher: fix potential problems canceling dispatcher calls
> >    
> > Thomas pointed out that using the address of the DispatcherInfo
> > structure as the dispatcher call ID could cause a mis-cancelation
> > if malloc re-used the same block in the future.  Fix that by
> > changing to a uint call ID instead.
> 
> That could only happen if the code was already confused/buggy. (It's passing a
> freed pointer to nm_dispatcher_call_cancel()... if malloc *didn't* reuse the
> block, then NM would just crash.)
> 
> If the code is keeping track of things properly, and making sure that it clears
> its references to call_ids after the callback runs or after it cancels them,
> then there's no problem.

Clarified the comment; I think it's useful because (a) it's more like glib sources to use guint, and (b) it shouldn't crash now, at least.

> > dispatcher: convert action_to_string to a table
> 
> >+	if (action >= 0 && action < G_N_ELEMENTS (action_table))
> >+		return action_table[action];
> > 	g_assert_not_reached ();
> 
> It would be more idiomatic to rewrite that to:
> 
>     g_assert (action >= 0 && action < G_N_ELEMENTS (action_table));
>     return action_table[action];

Changed.

> > vpn: implement PRE_DOWN dispatcher actions (bgo #387832)
> 
> This patch also makes DOWN be synchronous on quit, in addition to adding
> PRE_DOWN support.

Good catch, updated the commit string.  Or did you want two separate commits here?

> >+		/* Just proceed on errors */
> >+		dispatcher_pre_down_done (0, connection);
> 
> no problems with state-changing-reentrancy here?

Not with VPNs, mainly because not as much is done in the state change function and nothing responds to VPN state change signals really.  The problem with NMDevice is the signal in the middle of the function, that a *ton* of stuff listens to.  So if something that listens to the signal triggers a state change, then the code below that signal emission that should have been run for the original state change, now won't get run until after the *second* state change.  We could fix that for NMDevice by re-organizing nm_device_state_changed() if we can do that easily.

Re-pushed with the above changes.

Comment 51 Dan Williams 2014-06-02 19:19:08 UTC

(In reply to comment #50)
> (In reply to comment #49)
> > I think a bigger problem is, what happens if you have 10 dispatcher scripts,
> > but none of them cares about pre-up or pre-down? Looking through the journal, a
> > single dispatcher run seems to take about 2 seconds on my machine. Can we
> > afford to slow down every activation by that much?
> > 
> > We may need some way to make the blocking states be opt-in.
> 
> The short answer here is, "don't use dispatcher scripts" or "uninstall stuff
> you don't actually need".  Part of the purpose is to push back on dispatcher
> scripts to make sure they don't do stupid things.
> 
> However, I'm happy to modify this patchset to add new
> /etc/NetworkManager/dispatcher.d/pre-up.d/ and
> /etc/NetworkManager/dispatcher.d/pre-down.d/ directories, where the new events
> only execute stuff in those directories.  That would allow opting into the
> event, without existing scripts dragging everything down.  Does that sound OK?

Implemented this approach in "dispatcher: use separate directories for pre-up/pre-down events".

Comment 52 Dan Williams 2014-06-02 21:35:57 UTC

Also a few new patches that push the script elapsed time back to NetworkManager for debug logging; though if people want, I can drop those patches and just leave the logging in the dispatcher itself.

Comment 53 Thomas Haller 2014-06-03 08:22:37 UTC

>> dispatcher: robustify canceling dispatcher calls

(In reply to comment #49)
> > dispatcher: fix potential problems canceling dispatcher calls
> >    
> > Thomas pointed out that using the address of the DispatcherInfo
> > structure as the dispatcher call ID could cause a mis-cancelation
> > if malloc re-used the same block in the future.  Fix that by
> > changing to a uint call ID instead.
> 
> That could only happen if the code was already confused/buggy. (It's passing a
> freed pointer to nm_dispatcher_call_cancel()... if malloc *didn't* reuse the
> block, then NM would just crash.)

if malloc didn't reuse the block, then nothing would happen, because of:
    if (g_slist_find (requests, call_id))
         ((DispatchInfo *) call_id)->callback = NULL;
so, when malloc actually reused the block, we might cancel the wrong call. Which is the only error.

Already this might be improved to:
    if (g_slist_find (requests, call_id))
         ((DispatchInfo *) call_id)->callback = NULL;
    else
        g_return_if_reached ();
so that we might find the bug.


> If the code is keeping track of things properly, and making sure that it clears
> its references to call_ids after the callback runs or after it cancels them,
> then there's no problem.

I now agree mostly with danw, and was wrong before.

The potential benefit of this code is that it is much less likely to cancel the wrong call in the presence of a bug in the caller. Because first we have to overflow guint, and then actually hit a duplicate call_id -- which is a lot more unlikely then having a reused malloc block.

Another benefit when logging the call id, is that we really log some unique id (ignoring integer overflow).

Anyway. I pushed two fixup commits.

Comment 54 Thomas Haller 2014-06-03 09:48:59 UTC

>> dispatcher: use separate directories for pre-up/pre-down events


     if (dir) {
          item->has_scripts = !!g_dir_read_name (dir);
          g_dir_close (dir);
-    }
+    } else
+         item->has_scripts = TRUE;

either set it to TRUE or FALSE -- TRUE, just to be extra save that we don't occidentally miss an invocation. FALSE, if we assume that !dir means, there is no directory.



>> dispatcher: pass script elapsed time back to NetworkManager

-                          "dispatcher result (got %d, expectd 3)",
+                          "dispatcher result (got %d, expectd at least 3)",

Type: "expectd"



Is this extension backward compatible? -- which isn't that strongly required, because the dispatcher service ships with NM together. Or why the check:

+         /* Elapsed time in milliseconds */
+         if (item->n_values >= 4) {


In general, I think NM should time the duration itself instead of relying on NM-dispatcher to report it. That seems less complicated, and also more accurately counting the time we actually care about (e.g. including the startup time of the dispatcher service).


>> dispatcher: enhance debug logging

I like this patch.




I didn't re-review the rest (assume the older commits didn't change much).

Comment 55 Dan Williams 2014-06-03 15:06:51 UTC

(In reply to comment #54)
> >> dispatcher: use separate directories for pre-up/pre-down events
> 
> 
>      if (dir) {
>           item->has_scripts = !!g_dir_read_name (dir);
>           g_dir_close (dir);
> -    }
> +    } else
> +         item->has_scripts = TRUE;
> 
> either set it to TRUE or FALSE -- TRUE, just to be extra save that we don't
> occidentally miss an invocation. FALSE, if we assume that !dir means, there is
> no directory.
> 
> 
> 
> >> dispatcher: pass script elapsed time back to NetworkManager
> 
> -                          "dispatcher result (got %d, expectd 3)",
> +                          "dispatcher result (got %d, expectd at least 3)",
> 
> Type: "expectd"
> 
> 
> 
> Is this extension backward compatible? -- which isn't that strongly required,
> because the dispatcher service ships with NM together. Or why the check:
> 
> +         /* Elapsed time in milliseconds */
> +         if (item->n_values >= 4) {
> 
> 
> In general, I think NM should time the duration itself instead of relying on
> NM-dispatcher to report it. That seems less complicated, and also more
> accurately counting the time we actually care about (e.g. including the startup
> time of the dispatcher service).

NM can only time the entire duration of the dispatcher call, which means you don't know which scripts took a long time to complete.  It's interesting that a dispatcher call would take a long time, but unless we have per-script times, we don't know which script is at fault.

But as I said above, the dispatcher can already log this information... so what I think I'll do is drop the timing NM <-> dispatcher patch, and instead just have NM log when the dispatcher takes a *really* long time, and tell the user to look at dispatcher logs for why it took so long.

Comment 56 Thomas Haller 2014-06-03 16:05:14 UTC

(In reply to comment #55)
> (In reply to comment #54)
> > In general, I think NM should time the duration itself instead of relying on
> > NM-dispatcher to report it. That seems less complicated, and also more
> > accurately counting the time we actually care about (e.g. including the startup
> > time of the dispatcher service).
> 
> NM can only time the entire duration of the dispatcher call, which means you
> don't know which scripts took a long time to complete.  It's interesting that a
> dispatcher call would take a long time, but unless we have per-script times, we
> don't know which script is at fault.

Oh, I missed that point, that it times the duration individually per script. Makes a lot of sense.


> But as I said above, the dispatcher can already log this information... so what
> I think I'll do is drop the timing NM <-> dispatcher patch, and instead just
> have NM log when the dispatcher takes a *really* long time, and tell the user
> to look at dispatcher logs for why it took so long.

I wouldn't drop it. I think it's a worthy improvement (to the logging).

Comment 57 Dan Williams 2014-06-03 16:08:26 UTC

Ok, I'll keep it.

Comment 58 Dan Williams 2014-06-03 19:55:12 UTC

(In reply to comment #54)
> Is this extension backward compatible? -- which isn't that strongly required,
> because the dispatcher service ships with NM together. Or why the check:
> 
> +         /* Elapsed time in milliseconds */
> +         if (item->n_values >= 4) {

Hmm, good point.  I was thinking about upgrade, if the old dispatcher was installed but new NM.  However, I don't think that'll ever happen.  We *would* see new dispatcher + old NM at the same time, if a package upgrade was done but NM hadn't restarted, and the old NM will complain about the API mismatch.  But the scripts will still run just fine.

So I guess we don't need the backwards compatibility, because I can't think of a case where you'd be running the old dispatcher but new NM.

Alternative: I could just convert the return API into an array of dicts, and then we can just put whatever we want in the dict without breaking API later.

Comment 59 Dan Williams 2014-06-03 21:09:16 UTC

See "fixup! dispatcher: pass script elapsed time back to NetworkManager" for the conversion from struct -> dict for the results.

I also added "dispatcher: don't use NULL error domains" to fix some glib warnings.

Re-pushed!

Comment 60 Thomas Haller 2014-06-03 21:15:18 UTC

(In reply to comment #58)
> (In reply to comment #54)
> > Is this extension backward compatible? -- which isn't that strongly required,
> > because the dispatcher service ships with NM together. Or why the check:
> > 
> > +         /* Elapsed time in milliseconds */
> > +         if (item->n_values >= 4) {
> 
> Hmm, good point.  I was thinking about upgrade, if the old dispatcher was
> installed but new NM.  However, I don't think that'll ever happen.  We *would*
> see new dispatcher + old NM at the same time, if a package upgrade was done but
> NM hadn't restarted, and the old NM will complain about the API mismatch.  But
> the scripts will still run just fine.
> 
> So I guess we don't need the backwards compatibility, because I can't think of
> a case where you'd be running the old dispatcher but new NM.
> 
> Alternative: I could just convert the return API into an array of dicts, and
> then we can just put whatever we want in the dict without breaking API later.

I was just asking if this is actually backward compatible. Even if we probably don't need it, if we can get it this cheap, it's a nice feature and we should keep it.

Comment 61 Dan Williams 2014-06-04 14:30:24 UTC

(In reply to comment #60)
> (In reply to comment #58)
> > (In reply to comment #54)
> > > Is this extension backward compatible? -- which isn't that strongly required,
> > > because the dispatcher service ships with NM together. Or why the check:
> > > 
> > > +         /* Elapsed time in milliseconds */
> > > +         if (item->n_values >= 4) {
> > 
> > Hmm, good point.  I was thinking about upgrade, if the old dispatcher was
> > installed but new NM.  However, I don't think that'll ever happen.  We *would*
> > see new dispatcher + old NM at the same time, if a package upgrade was done but
> > NM hadn't restarted, and the old NM will complain about the API mismatch.  But
> > the scripts will still run just fine.
> > 
> > So I guess we don't need the backwards compatibility, because I can't think of
> > a case where you'd be running the old dispatcher but new NM.
> > 
> > Alternative: I could just convert the return API into an array of dicts, and
> > then we can just put whatever we want in the dict without breaking API later.
> 
> I was just asking if this is actually backward compatible. Even if we probably
> don't need it, if we can get it this cheap, it's a nice feature and we should
> keep it.

Either way if we change the interface between NM and the dispatcher, it's not backwards compatible.  But I don't thnk we'd ever run into a case where the old dispatcher was installed, but the new NM was running.  We *would* run into the case where the new dispatcher was installed but the old NM was running, but that won't prevent dispatcher scripts from running at all since only the return value is changing, and the old NM doesn't care about that.

I'm happy with the patch I pushed for converting the return value to a dict, it will be easier to add stuff in the future if we need it.

Comment 62 Dan Winship 2014-06-04 18:07:21 UTC

(In reply to comment #50)
> How about now?  I split this into start_pending_vpn() and start_active_vpn() so
> that the uses are explicitly clear.  start_pending_vpn() asserts that
> priv->active == NULL.

good, except:

>+	/* Expect success because the VPN service has already appeared */
>+	g_warn_if_fail (start_active_vpn (service, NULL) == TRUE);

Don't put non-debugging code inside assertions. Call start_active_vpn() and then separately assert the result.

> I added nm_device_get_is_nm_owned() back; I don't think the
> get_unmanaged_flag() on is really necessary since we already have
> nm_device_get_managed().

sure

> > We may need some way to make the blocking states be opt-in.
> 
> The short answer here is, "don't use dispatcher scripts" or "uninstall stuff
> you don't actually need".  Part of the purpose is to push back on dispatcher
> scripts to make sure they don't do stupid things.

Right, but unless you split out the pre-up and pre-down scripts, you get penalized even by dispatcher scripts that *aren't* doing stupid things.

> > > vpn: implement PRE_DOWN dispatcher actions (bgo #387832)
> > 
> > This patch also makes DOWN be synchronous on quit, in addition to adding
> > PRE_DOWN support.
> 
> Good catch, updated the commit string.  Or did you want two separate commits
> here?

They seem unrelated...

(In reply to comment #51)
> Implemented this approach in "dispatcher: use separate directories for
> pre-up/pre-down events".

looks good



> dispatcher: enhance debug logging

>+	/* Wrapping protection */
>+	if (G_UNLIKELY (!reqid))
>+		reqid = ++request_counter;

does anything actually care if reqid is 0?



>  fixup! dispatcher: pass script elapsed time back to NetworkManager
>    
>  Just make the return value an array of dicts so it's easier to
>  extend later.

you would have to do that *after* I already ported the dispatcher to GDBus. :-)

Comment 63 Dan Williams 2014-06-04 20:00:51 UTC

(In reply to comment #62)
> (In reply to comment #50)
> > How about now?  I split this into start_pending_vpn() and start_active_vpn() so
> > that the uses are explicitly clear.  start_pending_vpn() asserts that
> > priv->active == NULL.
> 
> good, except:
> 
> >+	/* Expect success because the VPN service has already appeared */
> >+	g_warn_if_fail (start_active_vpn (service, NULL) == TRUE);
> 
> Don't put non-debugging code inside assertions. Call start_active_vpn() and
> then separately assert the result.

Fixed.

> > I added nm_device_get_is_nm_owned() back; I don't think the
> > get_unmanaged_flag() on is really necessary since we already have
> > nm_device_get_managed().
> 
> sure

I un-staticed nm_device_get_unmanaged_flag() since I proposed it's use outside NM for the slavemonitor stuff.

> > > We may need some way to make the blocking states be opt-in.
> > 
> > The short answer here is, "don't use dispatcher scripts" or "uninstall stuff
> > you don't actually need".  Part of the purpose is to push back on dispatcher
> > scripts to make sure they don't do stupid things.
> 
> Right, but unless you split out the pre-up and pre-down scripts, you get
> penalized even by dispatcher scripts that *aren't* doing stupid things.

Yeah, this is true, hence the patch for new directories that you suggested.

> > > > vpn: implement PRE_DOWN dispatcher actions (bgo #387832)
> > > 
> > > This patch also makes DOWN be synchronous on quit, in addition to adding
> > > PRE_DOWN support.
> > 
> > Good catch, updated the commit string.  Or did you want two separate commits
> > here?
> 
> They seem unrelated...

I split the commits.

> > dispatcher: enhance debug logging
> 
> >+	/* Wrapping protection */
> >+	if (G_UNLIKELY (!reqid))
> >+		reqid = ++request_counter;
> 
> does anything actually care if reqid is 0?

Yeah, the code does:

if (priv->dispatcher_call_id) {
    nm_dispatcher_cancel_call (priv->dispatcher_call_id);
    priv->dispatcher_call_id = 0;
}

in a few places, so I elected to keep that bit.  Only 2 LoC.

> >  fixup! dispatcher: pass script elapsed time back to NetworkManager
> >    
> >  Just make the return value an array of dicts so it's easier to
> >  extend later.
> 
> you would have to do that *after* I already ported the dispatcher to GDBus. :-)

I'm happy to drop this commit for now, since it's a nice-to-have but isn't crucial.  We could revisit it later.

Comment 64 Thomas Haller 2014-06-05 14:18:47 UTC

from comment 54:

     if (dir) {
          item->has_scripts = !!g_dir_read_name (dir);
          g_dir_close (dir);
-    }
+    } else
+         item->has_scripts = TRUE;

either set it to TRUE or FALSE -- TRUE, just to be extra save that we don't
occidentally miss an invocation. FALSE, if we assume that !dir means, there is
no directory.

Comment 65 Dan Williams 2014-06-05 19:26:10 UTC

(In reply to comment #64)
> from comment 54:
> 
>      if (dir) {
>           item->has_scripts = !!g_dir_read_name (dir);
>           g_dir_close (dir);
> -    }
> +    } else
> +         item->has_scripts = TRUE;
> 
> either set it to TRUE or FALSE -- TRUE, just to be extra save that we don't
> occidentally miss an invocation. FALSE, if we assume that !dir means, there is
> no directory.

Fixed.

I also dropped the dispatcher return value rework for now.  Latest and hopefully last version re-pushed.

Comment 66 Dan Williams 2014-06-05 19:28:02 UTC

Changed commits since last time:

dispatcher: use separate directories for pre-up/pre-down events
- thaller's suggestion for item->has_scripts

dispatcher: enhance debug logging
- rebase after dropping dispatcher return value changes; please re-check!

dispatcher: add synchronous dispatcher calls
- squashed thaller's fixup

Comment 67 Dan Winship 2014-06-05 19:59:05 UTC

all looks good to me

Comment 68 Jiri Klimes 2014-06-06 12:01:49 UTC

The branch looks good to me too.

Comment 69 Dan Williams 2014-06-06 19:17:22 UTC

Merged to master.