GNOME Bugzilla – Bug 750595
[review] lr/master-activations: don't enqueue activation of master connection if it has pending actions
Last modified: 2016-09-07 18:29:57 UTC
manager: don't enqueue activation of master connection if it has pending actions Maybe it's being activated already and one of the activations would supersede the another causing the slave connection to loose track and get stuck forever in prepared state. This happens when a master apperars that has autoconnect=true slaves. + one small fix: manager: trivial: log the address, not the path of the master active connection
>> manager: trivial: log the address, not the path of the master active Logging the pointer value is also not very helpful. It would be useful for NMDevice instances, where we do that regularly and can correlate the pointer values... How about adding the connection UUID of the ActiveConnection? >> manager: don't enqueue activation of master connection if it has pending If it works for you, I doesn't look wrong to me :) ACK
I think this might break some guarantees that we have, where subsequent requested activations are always started. For example, if NM starts a master at startup, but then the old network scripts (or somebody's manual 'nmcli con up') happens; wouldn't this change then ignore those activations? Also, I think just checking 'has pending action' isn't fine grained enough, because we have things like 'carrier wait' and device states (see nm_device_queue_state) and autoconnect checking. I think we need to check if the device has a pending activation queued instead? I *think* the reason that slaves get stuck is that: 1) when a queued activation is superceded, nm-device.c::_clear_queued_act_request() gets called 2) that sets the master AC's state to DEACTIVATED 3) the slave's AC gets nm-active-connection.c::master_state_cb() called in response to the master state chnage 4) but master_state_cb() only acts on DEACTIVATING and so the slave gets stuck Possibly instead, we should do what nm-manager.c::_internal_activation_failed() does: nm_active_connection_set_state (active, NM_ACTIVE_CONNECTION_STATE_DEACTIVATING); nm_active_connection_set_state (active, NM_ACTIVE_CONNECTION_STATE_DEACTIVATED); to make sure that the master AC goes through all the states?
(In reply to Thomas Haller from comment #1) > >> manager: trivial: log the address, not the path of the master active > > Logging the pointer value is also not very helpful. It would be useful for > NMDevice instances, where we do that regularly and can correlate the pointer > values... > > How about adding the connection UUID of the ActiveConnection? We log pointers elsewhere. (In reply to Dan Williams from comment #2) > I think this might break some guarantees that we have, where subsequent > requested activations are always started. For example, if NM starts a > master at startup, but then the old network scripts (or somebody's manual > 'nmcli con up') happens; wouldn't this change then ignore those activations? > > Also, I think just checking 'has pending action' isn't fine grained enough, > because we have things like 'carrier wait' and device states (see > nm_device_queue_state) and autoconnect checking. I think we need to check > if the device has a pending activation queued instead? > > I *think* the reason that slaves get stuck is that: > > 1) when a queued activation is superceded, > nm-device.c::_clear_queued_act_request() gets called > 2) that sets the master AC's state to DEACTIVATED > 3) the slave's AC gets nm-active-connection.c::master_state_cb() called in > response to the master state chnage > 4) but master_state_cb() only acts on DEACTIVATING and so the slave gets > stuck > > Possibly instead, we should do what > nm-manager.c::_internal_activation_failed() does: > > nm_active_connection_set_state (active, > NM_ACTIVE_CONNECTION_STATE_DEACTIVATING); > nm_active_connection_set_state (active, > NM_ACTIVE_CONNECTION_STATE_DEACTIVATED); > > to make sure that the master AC goes through all the states? Yes. Fixing this leaves the slaves disconnected though. Reworked a bit: 65b979c manager: reuse an active connection, if the same activation is pending a169f5e active-connection: fail the activation if the master AC disconnect 2a16f34 ifcfg-rh,vlan: fall back to VLAN_ID if vlan id can't be determined from DEVICE 8a0ce5b manager: trivial: log the address, not the path of the master active connection
44c36f1 merge: branch 'lr/master-activations' 7a066a3 manager: reuse an active connection, if the same activation is pending 4078834 active-connection: fail the activation if the master AC disconnect f23a46d ifcfg-rh,vlan: fall back to VLAN_ID if vlan id can't be determined from DEVICE edbf766 manager: log the address, not the path of the master active connection Merged in, since it blocked testing. Nevertheless, keeping it open if we need to revisit this: 10:41 <thaller> hey 10:41 <thaller> can you remove the work "trivial" from manager: trivial: log the address, not the path of the master active connection? 10:43 <thaller> typo in commit message: "connections with race" 10:43 <thaller> which 10:45 <thaller> maybe rename "pending_connection" to something better? 10:46 <thaller> authorizing_connections? 10:46 <thaller> this also needs a code comment :) 10:46 <thaller> other then that, it doesn't look wrong. But I don't know 16:06 <lrintel> re-pushed lr/master-activations with suggested corrections 16:06 <lrintel> do you think it's good to merge? 16:17 <thaller> the first two certainly look right 16:17 <thaller> I don't know about the others. They fix the bug? 16:19 <lrintel> yes 16:20 <thaller> I see nothing wrong, but let's ask dcbw too. 16:20 <thaller> for now, let's merge it (we can improve/revert later) 16:20 <thaller> but let's get a dcbw opinion on this :)
Yeah, the last two patches look OK to me. The activating_connections thing is kinda icky though; we should clean that up in the future by perhaps creating a new internal property for authorizing, and then adding the AC to the priv->activating_connections, and filtering those 'authorizing' ones out of the D-Bus property. Might need some other checks in places too.
closing. If the above issues are actual issues, please reopen.