Bug 745903 – [review] dcbw/rh1145988-team-respawn: respawn teamd instead of failing the device

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 745903 - [review] dcbw/rh1145988-team-respawn: respawn teamd instead of failing the device


Summary:	[review] dcbw/rh1145988-team-respawn: respawn teamd instead of failing the de...


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:	nm-review

Reported:	2015-03-09 15:00 UTC by Dan Williams
Modified:	2015-04-02 20:03 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Dan Williams 2015-03-09 15:00:44 UTC

teamd can recover interface state on its own, so if it died unexpectedly we don't need to fail the device.  Also, if for some reason a teamd is already up and running when activating the interface, we can ask for its configuration and if it has the same configuration we are about to use, just talk to the existing copy instead of killing it.

Also attempt to ratelimit respawning.

(we can't use systemd to spawn teamd as suggested in rh#1085813 for two reasons, mainly that teamd uses the interface name as part of it's dbus name, and thus each interface name must have a predefined systemd unit and dbus permissions, and we cannot change the configuration for a teamd instance after it has been started so cannot use an externally configured teamd)

Comment 1 Dan Winship 2015-03-19 17:29:54 UTC

> asdfasdf

Needs a better commit message...

> team: ratelimit teamd spawning

> 	_LOGI (LOGD_TEAM, "teamd appeared on D-Bus");
>-	teamd_timeout_remove (device);

why do you leave the timeout around now?

Comment 2 Dan Williams 2015-03-19 20:37:18 UTC

Sure you don't have an older version of the branch?  I did just re-push what I currently have though, which doesn't use that commit message.

Comment 3 Dan Williams 2015-03-19 20:43:22 UTC

(In reply to Dan Winship from comment #1)
> > asdfasdf
> 
> Needs a better commit message...
> 
> > team: ratelimit teamd spawning
> 
> > 	_LOGI (LOGD_TEAM, "teamd appeared on D-Bus");
> >-	teamd_timeout_remove (device);
> 
> why do you leave the timeout around now?

The timeout is left around for the ratelimiting.  If teamd gets respawned 5 times in 5 seconds, then we fail the device.  So after we first start teamd, we have to let the timeout run to completion to reset the spawn count and spawn start.

Comment 4 Dan Winship 2015-03-20 20:44:50 UTC

OK... I guess I find it confusing that the same timeout is used for both "teamd never appeared" and "teamd is probably not going to crash again"... It at least needs comments.

Although also, it seems like if teamd crashes within the first 5 seconds after starting, then it's a pretty safe bet that it is completely hosed somehow, and it's just going to crash again if we restart it again. So I don't think we need spawn_count. Just have the logic be "if teamd exits after running longer than 5 seconds, then respawn it".

Also, hm, can this code result in a loop of NM restarting and immediately killing teamd? It starts the original teamd[1], then time passes, and then the admin does "teamd -o ..."[2], which kills our teamd[1], causing teamd_dbus_vanished()[1] to run and run teamd_start() again[3]. Then we get the teamd_dbus_appeared() for teamd[2]. Then teamd[3] finishes starting up and grabs the D-Bus interface, causing NM to get teamd_dbus_vanished()[2], which will run teamd_start()[4]. Then we get teamd_dbus_appeared()[3]. Then teamd[4] grabs the D-Bus interface and we get teamd_dbus_vanished()[3]...

Comment 5 Dan Williams 2015-03-27 20:08:15 UTC

Repushed with fixups to not do the complicated respawn handling, but instead what you suggest with a 5-second window.

The respawn loop you propose is possible but pretty unlikely, on my machine 'sudo teamd -k -o -D -t team0' NM wins the race.  But I added some GetConnectionUnixProcessId magic to teamd_dbus_appeared() and if the newly appeared teamd isn't NMs, then NM just ignores the child watch on it's own process.

Comment 6 Thomas Haller 2015-03-30 13:17:15 UTC

+    _LOGI (LOGD_TEAM, "Activation: (team) started teamd [pid %u]...",
            (guint32) priv->teamd_pid);
            ^^^^^^^^^ (guint, not that it actually matters)



+    if (priv->teamd_process_watch) {
+         gs_unref_variant GVariant *ret = NULL;
+         guint pid;
          ^^^^^ guint32


+    /* If another teamd grabbed the bus name while our teamd was starting,
+     * just ignore the death of our teamd and run with the existing one.
+     */
+    if (priv->teamd_process_watch) {
...
+         if (pid != priv->teamd_pid) {
+              g_source_remove (priv->teamd_process_watch);
+              priv->teamd_process_watch = 0;
+         }

if you unwatch the child, will NM still reap the child process? I think this leaves a zombie. How about calling nm_utils_kill_child_async() without callback?



+         _LOGW (LOGD_TEAM, "teamd quit too quickly; failing activation");

this is a confusing message for the users. I would say: "teamd process died unexpectedly; failing activation"

Comment 7 Dan Williams 2015-04-01 14:20:12 UTC

Fixed those up and repushed.

Comment 8 Thomas Haller 2015-04-02 08:42:20 UTC

LGTM

Comment 9 Dan Williams 2015-04-02 20:03:21 UTC

Merged to master and nm-1-0