GNOME Bugzilla – Bug 745903
[review] dcbw/rh1145988-team-respawn: respawn teamd instead of failing the device
Last modified: 2015-04-02 20:03:29 UTC
teamd can recover interface state on its own, so if it died unexpectedly we don't need to fail the device. Also, if for some reason a teamd is already up and running when activating the interface, we can ask for its configuration and if it has the same configuration we are about to use, just talk to the existing copy instead of killing it. Also attempt to ratelimit respawning. (we can't use systemd to spawn teamd as suggested in rh#1085813 for two reasons, mainly that teamd uses the interface name as part of it's dbus name, and thus each interface name must have a predefined systemd unit and dbus permissions, and we cannot change the configuration for a teamd instance after it has been started so cannot use an externally configured teamd)
> asdfasdf Needs a better commit message... > team: ratelimit teamd spawning > _LOGI (LOGD_TEAM, "teamd appeared on D-Bus"); >- teamd_timeout_remove (device); why do you leave the timeout around now?
Sure you don't have an older version of the branch? I did just re-push what I currently have though, which doesn't use that commit message.
(In reply to Dan Winship from comment #1) > > asdfasdf > > Needs a better commit message... > > > team: ratelimit teamd spawning > > > _LOGI (LOGD_TEAM, "teamd appeared on D-Bus"); > >- teamd_timeout_remove (device); > > why do you leave the timeout around now? The timeout is left around for the ratelimiting. If teamd gets respawned 5 times in 5 seconds, then we fail the device. So after we first start teamd, we have to let the timeout run to completion to reset the spawn count and spawn start.
OK... I guess I find it confusing that the same timeout is used for both "teamd never appeared" and "teamd is probably not going to crash again"... It at least needs comments. Although also, it seems like if teamd crashes within the first 5 seconds after starting, then it's a pretty safe bet that it is completely hosed somehow, and it's just going to crash again if we restart it again. So I don't think we need spawn_count. Just have the logic be "if teamd exits after running longer than 5 seconds, then respawn it". Also, hm, can this code result in a loop of NM restarting and immediately killing teamd? It starts the original teamd[1], then time passes, and then the admin does "teamd -o ..."[2], which kills our teamd[1], causing teamd_dbus_vanished()[1] to run and run teamd_start() again[3]. Then we get the teamd_dbus_appeared() for teamd[2]. Then teamd[3] finishes starting up and grabs the D-Bus interface, causing NM to get teamd_dbus_vanished()[2], which will run teamd_start()[4]. Then we get teamd_dbus_appeared()[3]. Then teamd[4] grabs the D-Bus interface and we get teamd_dbus_vanished()[3]...
Repushed with fixups to not do the complicated respawn handling, but instead what you suggest with a 5-second window. The respawn loop you propose is possible but pretty unlikely, on my machine 'sudo teamd -k -o -D -t team0' NM wins the race. But I added some GetConnectionUnixProcessId magic to teamd_dbus_appeared() and if the newly appeared teamd isn't NMs, then NM just ignores the child watch on it's own process.
+ _LOGI (LOGD_TEAM, "Activation: (team) started teamd [pid %u]...", (guint32) priv->teamd_pid); ^^^^^^^^^ (guint, not that it actually matters) + if (priv->teamd_process_watch) { + gs_unref_variant GVariant *ret = NULL; + guint pid; ^^^^^ guint32 + /* If another teamd grabbed the bus name while our teamd was starting, + * just ignore the death of our teamd and run with the existing one. + */ + if (priv->teamd_process_watch) { ... + if (pid != priv->teamd_pid) { + g_source_remove (priv->teamd_process_watch); + priv->teamd_process_watch = 0; + } if you unwatch the child, will NM still reap the child process? I think this leaves a zombie. How about calling nm_utils_kill_child_async() without callback? + _LOGW (LOGD_TEAM, "teamd quit too quickly; failing activation"); this is a confusing message for the users. I would say: "teamd process died unexpectedly; failing activation"
Fixed those up and repushed.
LGTM
Merged to master and nm-1-0