Bug 733105 – [review] dcbw/wifi-ssid-found-bgo733105: NetworkManager failed to reconnect after several AP reboots

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 733105 - [review] dcbw/wifi-ssid-found-bgo733105: NetworkManager failed to reconnect after several AP reboots


Summary:	[review] dcbw/wifi-ssid-found-bgo733105: NetworkManager failed to reconnect a...


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	Wi-Fi
Version:	0.9.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:	nm-patch

Reported:	2014-07-12 19:31 UTC by Bernd Edlinger
Modified:	2015-11-18 15:03 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Proposed Fix (1.62 KB, patch) 2014-07-12 19:31 UTC, Bernd Edlinger	none	Details \| Review
Fix for 0.9.8.8 (2.38 KB, patch) 2014-07-20 07:12 UTC, Bernd Edlinger	none	Details \| Review
0001-wifi-clean-up-removal-of-current-AP-if-it-fails-duri.patch (5.40 KB, patch) 2015-11-11 00:23 UTC, Dan Williams	none	Details \| Review

Description Bernd Edlinger 2014-07-12 19:31:23 UTC

Created attachment 280560 [details] [review]
Proposed Fix

Hi,

the NetworkManager updates a list of APs which unfortunately can
get different from what "iwlist wlan0 scanning" sees.

If that happens, the active AP is no longer in the Network Manager's
list, and thus no re-connect is initiated.

The sequence of events that leads to this, are several re-boots of the
active AP, and after some time, the WiFi network remains disconnected.

To recover from this, the network manager must be re-started,
or the AP must be shut down for about 5 minutes, and then re-started.

I've used the attached patch locally since about a year, without any problems:

What I suspect as the root cause for this is a bug in
merge_scanned_ap: If an AP is removed from the live list, the
flag WPAS_REMOVED_TAG is set, but it is possible that the AP
comes back before the deleted AP is removed from the internal list.
If that happens the WPAS_REMOVED_TAG stays set, and can cause the
AP to be removed from the list if the WiFi connection is lost later,
but the AP stays in the live list at the same time, which is possible.

The last hunk updates the time stamp of the removed AP, so that the
list entry will not removed immediately, which would be possible, when no
call-back was received before, for instance because the squelch level
stays at 100% all the time.

Comment 1 Dan Williams 2014-07-14 18:36:07 UTC

Thanks!

The first hunk clearing WPAS_REMOVED_TAG is entirely correct.

But I'm a bit concerned about the second hunk.  It looks like that will extend the time that an AP which was removed from wpa_supplicant stays in the scan list, beyond how long it was previously kept.  With this patch, if the supplicant saw the AP in a scan 120 seconds ago, and the AP was then turned off and was not seen in the last 2 scans, NM would not remove the AP from the list for 120 + 360 seconds.  Previously, NM would remove the AP much sooner, which could be a problem in some cases.

To handle that, would you be OK with:

diff --git a/src/devices/wifi/nm-device-wifi.c b/src/devices/wifi/nm-device-wifi.c
index 38fc7ca..28a039a 100644
--- a/src/devices/wifi/nm-device-wifi.c
+++ b/src/devices/wifi/nm-device-wifi.c
@@ -1956,16 +1956,30 @@ supplicant_iface_bss_removed_cb (NMSupplicantInterface *iface,
 {
        NMAccessPoint *ap;
 
        g_return_if_fail (self != NULL);
        g_return_if_fail (object_path != NULL);
 
        ap = get_ap_by_supplicant_path (self, object_path);
-       if (ap)
+       if (ap) {
+               gint32 now = nm_utils_get_monotonic_timestamp_s ();
+               guint32 last_seen = nm_ap_get_last_seen (ap);
+
+               /* We don't know when the supplicant last saw the AP's beacons,
+                * it could be two minutes or it could be 2 seconds.  Because the
+                * supplicant doesn't send property change notifications if the
+                * AP's other properties don't change, our last-seen time may be
+                * much older the supplicant's, and the AP would be immediately
+                * removed from the list on the next cleanup.  So update the
+                * last-seen time to ensure the AP sticks around for at least
+                * one more periodic scan.
+                */
+               nm_ap_set_last_seen (ap, MAX (last_seen, now - (SCAN_INTERVAL_MAX * 2) - 5));
                g_object_set_data (G_OBJECT (ap), WPAS_REMOVED_TAG, GUINT_TO_POINTER (TRUE));
+       }
 }
 
 static void
 remove_supplicant_timeouts (NMDeviceWifi *self)
 {
        NMDeviceWifiPrivate *priv = NM_DEVICE_WIFI_GET_PRIVATE (self);

Comment 2 Thomas Haller 2014-07-14 19:18:47 UTC

Btw. the original patch applies on nm-0-9-8 branch. Back then, nm_ap_last_seen() did not accept nm_utils_get_monotonic_timestamp_s() times, but instead (time(NULL)).
So, that should be solved slightly differently for nm-0-9-8.

Apart from that, I agree with comment 2. Looks good.

Comment 3 Bernd Edlinger 2014-07-14 20:04:47 UTC

Yes, thanks.

I completely overlooked that the master version uses now 
nm_utils_get_monotonic_timestamp_s() instead of g_get_current_time().

I have one question about this function:

Does nm_utils_get_monotonic_timestamp_s()
start at 1970 like time(),
or does it call CLOCK_MONOTONIC, which starts at 0
on every system boot ?

thus, can "now - (SCAN_INTERVAL_MAX * 2) - 5"
underflow, if the system up time is less than 2 minutes?


I think the supplicant definitely saw the AP when the
last request_wireless_scan() was called, which should
be no more than SCAN_INTERVAL_MAX seconds ago.

When I inspected the traces. I saw that the callback will not
come when nothing changes. The last seen time stamp
can often be a few hours in the past, but that only means,
that the AP is constantly seen on every scan, and it does
not even change the signal level. When the scan request is sent
to the supplicant, and no beacon is seen, the call back comes
immediately, because now something _has_ changed.
Scans are requested at a varying interval,
but the upper limit for the interval is SCAN_INTERVAL_MAX,
if the beacon stops, for only a second, while the scan is executed,
the removed callback came immediately.
It can come even if the AP is still connected, at the same time,
because that is a different layer.

So generally I agree with you, but would like to suggest this:

  nm_ap_set_last_seen (ap, MAX (last_seen, now - SCAN_INTERVAL_MAX));

because the last scan was at now - SCAN_INTERVAL_MAX or possibly later.
When the last_seen time is older than now - SCAN_INTERVAL_MAX it is
bogus, if it is newer, it is likely correct. Hope this makes sense for you.

Comment 4 Thomas Haller 2014-07-14 20:16:03 UTC

(In reply to comment #3)
> Yes, thanks.
> 
> I completely overlooked that the master version uses now 
> nm_utils_get_monotonic_timestamp_s() instead of g_get_current_time().
> 
> I have one question about this function:
> 
> Does nm_utils_get_monotonic_timestamp_s()
> start at 1970 like time(),
> or does it call CLOCK_MONOTONIC, which starts at 0
> on every system boot ?

it returns the seconds of clock_gettime(CLOCK_BOOTTIME), but with some offset, so that it is always >= 1. A valid timestamp is expected to be positive.


> thus, can "now - (SCAN_INTERVAL_MAX * 2) - 5"
> underflow, if the system up time is less than 2 minutes?

since @now is >= 1, the above might become negative. But that is no problem, because of MAX().

Comment 5 Thomas Haller 2014-07-14 20:19:08 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > Does nm_utils_get_monotonic_timestamp_s()
> > start at 1970 like time(),
> > or does it call CLOCK_MONOTONIC, which starts at 0
> > on every system boot ?
> 
> it returns the seconds of clock_gettime(CLOCK_BOOTTIME), but with some offset,
> so that it is always >= 1. A valid timestamp is expected to be positive.

I mean, it will be close to clock_gettime(CLOCK_BOOTTIME), but with an offset so that it is always positive. The actual start point has no further meaning, but it counts seconds.

Comment 6 Bernd Edlinger 2014-07-14 20:37:12 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #3)
> > > Does nm_utils_get_monotonic_timestamp_s()
> > > start at 1970 like time(),
> > > or does it call CLOCK_MONOTONIC, which starts at 0
> > > on every system boot ?
> > 
> > it returns the seconds of clock_gettime(CLOCK_BOOTTIME), but with some offset,
> > so that it is always >= 1. A valid timestamp is expected to be positive.
> 
> I mean, it will be close to clock_gettime(CLOCK_BOOTTIME), but with an offset
> so that it is always positive. The actual start point has no further meaning,
> but it counts seconds.

If your code is correct in case of an underflow depends on how MAX
is implemented.

If it is implemented straight forward, like this it will have problems:

cat test.c
#include <stdio.h>

#define MAX(a,b) (((a)>(b))?(a):(b))

int
main()
{
  unsigned int u = 100;
  int i = 100 - 200;
  unsigned int z = MAX(u,i);
  printf("%u\n", z);
  return 0;
}

gcc test.c && ./a.out
4294967196

the overflow happens, at the comparison uint vs. int.

how is MAX implemented, I did not find int in network-manager-0.9.8.8-ubuntu?

Comment 7 Bernd Edlinger 2014-07-14 22:38:50 UTC

Aehm, nm_ap_get_last_seen() returns gint32.
So the problem is just the declaration of "last_seen".
then both arguments to MAX are signed int, and an
underflow is handled correctly in that case.

So my preference for the second hunk would be like this:




diff --git a/src/devices/wifi/nm-device-wifi.c
b/src/devices/wifi/nm-device-wifi.c
index 38fc7ca..28a039a 100644
--- a/src/devices/wifi/nm-device-wifi.c
+++ b/src/devices/wifi/nm-device-wifi.c
@@ -1956,16 +1956,30 @@ supplicant_iface_bss_removed_cb (NMSupplicantInterface
*iface,
 {
        NMAccessPoint *ap;

        g_return_if_fail (self != NULL);
        g_return_if_fail (object_path != NULL);

        ap = get_ap_by_supplicant_path (self, object_path);
-       if (ap)
+       if (ap) {
+               gint32 now = nm_utils_get_monotonic_timestamp_s ();
+               gint32 last_seen = nm_ap_get_last_seen (ap);
+
+               /* We don't know when the supplicant last saw the AP's beacons,
+                * it could be two minutes or it could be 2 seconds.  Because
the
+                * supplicant doesn't send property change notifications if the
+                * AP's other properties don't change, our last-seen time may
be
+                * much older the supplicant's, and the AP would be immediately
+                * removed from the list on the next cleanup.  So update the
+                * last-seen time to ensure the AP sticks around for at least
+                * one more periodic scan.
+                */
+               nm_ap_set_last_seen (ap, MAX (last_seen, now - SCAN_INTERVAL_MAX));
                g_object_set_data (G_OBJECT (ap), WPAS_REMOVED_TAG,
GUINT_TO_POINTER (TRUE));
+       }
 }

 static void
 remove_supplicant_timeouts (NMDeviceWifi *self)
 {
        NMDeviceWifiPrivate *priv = NM_DEVICE_WIFI_GET_PRIVATE (self);

Comment 8 Thomas Haller 2014-07-15 12:43:27 UTC

Ah right. I missed that @last_seen was guint32 before. As you did, it should be gint32 and no problem with underflow can happen.

As to whether to subtract SCAN_INTERVAL_MAX or 2*SCAN_INTERVAL_MAX, I dunno.

Comment 9 Bernd Edlinger 2014-07-15 20:04:11 UTC

(In reply to comment #8)
> As to whether to subtract SCAN_INTERVAL_MAX or 2*SCAN_INTERVAL_MAX, I dunno.

IMHO the software tries to reach this goal:

When an AP is seen in each scan, it should stay at all times
in the internal list, even if one single scan fails to receive the beacon.


If everything works perfectly, we have a change notification in each
scan every 120 seconds. the removed callback did not update the last_seen
time, and in the next scan 120 seconds later, we have a beacon again,
This means max. 120*2 seconds between each update, which is less than the
time out of 3*120 seconds. This guarantees that the AP stays in the list.

If the AP is shut down for 2 consecutive scans the max. distance
between updates is 120*3 which does not guarantee that the AP stays in the
list because the time out may or may not be reached due to internal delays.


If things work not so perfectly, we do not have any change notifications
just a call back for AP deleted, and at 120 seconds later a callback
for new AP discovered.  before the patch the AP was deleted immediately
after the AP deleted callback

Comment 10 Bernd Edlinger 2014-07-15 20:16:31 UTC

.... if the AP is not the current AP for instance.

If we set last_seen to now-2*SCAN_INTERVAL_MAX-5
the time out of 3*SCAN_INTERVAL_MAX will be reached
at now+SCAN_INTERVAL_MAX-5 but the next scan will be
initiated at now+SCAN_INTERVAL_MAX, thus the AP will
be scavaged 5 seconds before the next scan. Which
is not what should happen, because exactly one single
scan failed to receive the beacon.

This is why I believe the variant now-SCAN_INTERVAL_MAX is right.

But this is of course only a cosmetic difference.

Comment 11 Bernd Edlinger 2014-07-20 07:12:20 UTC

Created attachment 281224 [details] [review]
Fix for 0.9.8.8

Hi,

I have reproduced this now on another piece of hardware,
which has a newer version of network manager (0.9.8.8-ubuntu-14.4).

I applied what was discussed here to that installation, but it turned
out that it was _not_ fixed yet. I manually re-booted the AP using
the AP's Web-Interface as quickly as I could. After a few times the
network manager was stuck as before.

After some analysis I discovered, that NM 0.9.8.8 needs a third hunk,
in link_timeout_cb. because it unconditionally removes the current AP
from the list, if the connect did not see the SSID. But it should do
so only if it is a fake AP. Normal APs should only be removed when
the supplicant removed them.

This bug was not present in 0.9.4.0.

The problem is that if the AP re-boots in about 90% of the cases, I
see a trace "link timed out." immediately followed by "New AP:".
In this case the AP is removed and added to the list again.

But in some rare cases, I see first "New AP:" immediately followed by
"link timed out", followed by nothing, connectivity lost.
The AP was removed from the list, while the supplicant did have it
already in the list, and it does not send new_bss_cb again.

Now when I look at the http:http://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/wifi/nm-device-wifi.c

I see this function has been significantly re-worked, but I think
the bug is still present:

 	 * found in the next scan.
 	 */
 	if (priv->ssid_found == FALSE && priv->current_ap)
-		set_current_ap (self, NULL, TRUE, TRUE);
+		set_current_ap (self, NULL, TRUE, FALSE);
 
 	nm_device_state_changed (dev,
 	                         NM_DEVICE_STATE_FAILED,

By passing TRUE as 4th parameter (force_remove_old_ap) to set_current_ap,
the current AP gets deleted, even if it is not a fake AP.
Setting this parameter to FALSE should fix it, because set_current_ap
deletes fake APs anyway.

Actually this parameter can be removed completely, because this was the
only place where it was TRUE.

Comment 12 Thomas Haller 2014-07-20 13:13:11 UTC

(In reply to comment #11)
> Created an attachment (id=281224) [details] [review]
> Fix for 0.9.8.8


Patch looks good to me, except the cast to gint32, should be glong:

+         glong now = time (NULL);
+         glong last_seen = nm_ap_get_last_seen (ap);

Comment 13 Bernd Edlinger 2014-07-20 14:10:37 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > Created an attachment (id=281224) [details] [review] [details] [review]
> > Fix for 0.9.8.8
> 
> 
> Patch looks good to me, except the cast to gint32, should be glong:
> 
> +         glong now = time (NULL);
> +         glong last_seen = nm_ap_get_last_seen (ap);

yes, you are right.

I was unsure if it should be in sync with this code in 
supplicant_iface_bss_updated_cb:

        /* Update the AP's last-seen property */
        ap = get_ap_by_supplicant_path (self, object_path);
        if (ap)
                nm_ap_set_last_seen (ap, (guint32) time (NULL));

But now I see, that cast is wrong. nm_ap_set_last_seen works with glong,
there is no need to down cast time_t to guint32 there...

Comment 14 Bernd Edlinger 2015-02-27 18:38:39 UTC

ping...

is there anything I can do to get this fixed?

Comment 15 Dan Williams 2015-04-02 15:14:12 UTC

(In reply to Bernd Edlinger from comment #11)
> The problem is that if the AP re-boots in about 90% of the cases, I
> see a trace "link timed out." immediately followed by "New AP:".
> In this case the AP is removed and added to the list again.
> 
> But in some rare cases, I see first "New AP:" immediately followed by
> "link timed out", followed by nothing, connectivity lost.
> The AP was removed from the list, while the supplicant did have it
> already in the list, and it does not send new_bss_cb again.
> 
> I see this function has been significantly re-worked, but I think
> the bug is still present:
> 
>  	 * found in the next scan.
>  	 */
>  	if (priv->ssid_found == FALSE && priv->current_ap)
> -		set_current_ap (self, NULL, TRUE, TRUE);
> +		set_current_ap (self, NULL, TRUE, FALSE);
>  
>  	nm_device_state_changed (dev,
>  	                         NM_DEVICE_STATE_FAILED,
> 
> By passing TRUE as 4th parameter (force_remove_old_ap) to set_current_ap,
> the current AP gets deleted, even if it is not a fake AP.
> Setting this parameter to FALSE should fix it, because set_current_ap
> deletes fake APs anyway.

Passing FALSE here will actually break cases where the AP does get powered off and remains in the scan list; this change basically trades one case for another.  But I see your bug, so how about this:

	if (priv->ssid_found == FALSE && priv->current_ap) {
		gboolean force_remove = (if current_ap->last-seen > 10 seconds ago);
		set_current_ap (self, NULL, TRUE, force_remove);
	}

this way if you see "New AP" and then link_timeout_cb fires, the AP won't be removed.  But if the AP was really turned off, then it's quite likely it was last seen a while ago, and it would be removed like the existing code does.  Would that work for you?

Comment 16 Thomas Haller 2015-04-02 15:51:57 UTC

I commited the agreed upon parts:

master
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=1282b468bd88cb587b42203e18b18f818ef1fb74

nm-1-0:
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=4903c127e15922bce64121e17c69ec44c28b6784

nm-0-9-10:
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=ace55a45af4bafb991ac595e69770448283cf250

nm-0-9-8:
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=47c505523c72b6454dec4681c0edda535c3a2e91

Comment 17 Bernd Edlinger 2015-04-02 17:14:52 UTC

(In reply to Dan Williams from comment #15)
> 
> Passing FALSE here will actually break cases where the AP does get powered
> off and remains in the scan list; this change basically trades one case for
> another.  But I see your bug, so how about this:
> 

I don't understand, this:
How can an AP stay in the live list when it is powered off?

If it is powered off, the beacon will stop, and the link will break.
And as it is no fake AP, it will be removed.  Right?

In the traces I saw a race between "New AP" and "link_timeout_cb",
but I am not sure if that will always be the case...

I also saw traces, where the AP was stable, and last_seen was hours ago,
just because nothing changed, and the supplicant does constantly scan
and see the beacons, but fires no callback as long as nothing changes.

But I don't understand why you want to remove an AP from the life list
while it still sends beacons?

Just because it timed out 3 times to associate?

I would just try to connect every 5 minutes until the problem or the
AP goes away.  That is what my patch from comment #11 does.

Comment 18 Dan Williams 2015-04-02 18:33:26 UTC

(In reply to Bernd Edlinger from comment #17)
> (In reply to Dan Williams from comment #15)
> > 
> > Passing FALSE here will actually break cases where the AP does get powered
> > off and remains in the scan list; this change basically trades one case for
> > another.  But I see your bug, so how about this:
> > 
> 
> I don't understand, this:
> How can an AP stay in the live list when it is powered off?
> 
> If it is powered off, the beacon will stop, and the link will break.
> And as it is no fake AP, it will be removed.  Right?

Correct, after the link timeout.

> In the traces I saw a race between "New AP" and "link_timeout_cb",
> but I am not sure if that will always be the case...

Yes, you probably did see that race.  That happens because NM keeps APs in the scan list longer than wpa_supplicant does in some cases.  So when the AP powered off, the supplicant dropped the AP from the scan list but NM didn't (for various reasons).

So when the AP powers on again, the supplicant creates a new D-Bus object for it, and we end up in supplicant_iface_new_bss_cb() with "New AP".  merge_scanned_ap() gets called for that "new" AP.

At that point in merge_scanned_ap(), obviously get_ap_by_supplicant_path() will return NULL since it's a truly new AP.  Next, we have nm_ap_match_in_list(); where presumably some attribute (most likely frequency) doesn't match, so it does show up as a wholly new AP.  The reason frequency probably doesn't match is that many APs have a "scan to find free channel" function on them, so the channel will be different on restart.  So the "new" AP gets added to the list and the old one is still there.

Then link_timeout_cb() fires for the old AP, and NM disconnects and removes the old AP.

At least that's one theory of how this could work.

> I also saw traces, where the AP was stable, and last_seen was hours ago,
> just because nothing changed, and the supplicant does constantly scan
> and see the beacons, but fires no callback as long as nothing changes.

Right.  I think there's actually a patch from Mathieu Trudel for some last-seen stuff on the mailing list that will "fix" that by making sure that any AP that is still known to the supplicant will always be seen at least 120 seconds ago or less.

> But I don't understand why you want to remove an AP from the life list
> while it still sends beacons?

I don't.  But NM doesn't know about the beacons, it only knows that the supplicant still thinks the AP is valid or not.  And in the problem case, wpa_supplicant *doesn't* think the old AP is valid, and it created a new AP instead.  So NM is correctly disconnecting from the old AP as the code intends, if  my theory is correct.  I just don't know why it doesn't automatically reconnect with the "new" AP.

> Just because it timed out 3 times to associate?
> 
> I would just try to connect every 5 minutes until the problem or the
> AP goes away.  That is what my patch from comment #11 does.

NM should be doing that, but I'd like a bit more information here if you can get it; when "new" is seen before link_timeout_cb(), does the new AP get merged into the old one, or does it get added to the scan list on it's own?

Comment 19 Bernd Edlinger 2015-04-03 10:12:34 UTC

(In reply to Dan Williams from comment #18)
> > But I don't understand why you want to remove an AP from the life list
> > while it still sends beacons?
> 
> I don't.  But NM doesn't know about the beacons, it only knows that the
> supplicant still thinks the AP is valid or not.  And in the problem case,
> wpa_supplicant *doesn't* think the old AP is valid, and it created a new AP
> instead.  So NM is correctly disconnecting from the old AP as the code
> intends, if  my theory is correct.  I just don't know why it doesn't
> automatically reconnect with the "new" AP.
> 
> > Just because it timed out 3 times to associate?
> > 
> > I would just try to connect every 5 minutes until the problem or the
> > AP goes away.  That is what my patch from comment #11 does.
> 
> NM should be doing that, but I'd like a bit more information here if you can
> get it; when "new" is seen before link_timeout_cb(), does the new AP get
> merged into the old one, or does it get added to the scan list on it's own?

My AP is fixed at channel 1, so it can't do any frequency hopping.

When the problem happened here, the NM did initiate a scan frequently,
because the Live list did change often.  And I think from the traces,
the AP got removed in the supplicant and the WPAS_REMOVED_TAG is set,
then 2 seconds later the AP is there again, and the New AP merged the
old entry, removes the WPAS_REMOVED_TAG again.

But then the link_timeout_cb removed the entry completely.

Even if a bss_updated_cb callback arrives, it is ignored because the
BSSID is not found in the list.

To recover from this failure mode, the AP has to be switched
off again until the next scan is initiated, this causes
a AP removed callbackm which is also ignored.
and the next time the AP is switched on again, the New AP works
as expected.

Comment 20 Bernd Edlinger 2015-04-05 08:30:05 UTC

(In reply to Dan Williams from comment #18)
> At that point in merge_scanned_ap(), obviously get_ap_by_supplicant_path()
> will return NULL since it's a truly new AP.  Next, we have
> nm_ap_match_in_list(); where presumably some attribute (most likely
> frequency) doesn't match, so it does show up as a wholly new AP.  The reason
> frequency probably doesn't match is that many APs have a "scan to find free
> channel" function on them, so the channel will be different on restart.  So
> the "new" AP gets added to the list and the old one is still there.
> 
> Then link_timeout_cb() fires for the old AP, and NM disconnects and removes
> the old AP.
> 

Ok, I understand what you mean. It was not the case here however.

But If this happens, it is still unnecessary to remove the old AP by force,
because it is likely already removed by the supplicant
(supplicant_iface_bss_removed_cb), and we have the
WPAS_REMOVED_TAG set on it already, or it will happen in the next scan.
And the old AP will be removed after the usual time-out.  Right?

Comment 21 Bernd Edlinger 2015-11-08 17:39:21 UTC

Hi,

the patch in comment#11 is still badly needed IMO.

I just read at https://bugzilla.redhat.com/show_bug.cgi?id=1025371#c36
why the force_remove_old_ap was initially introduced.

If I see it right, this is what it was supposed to fix:

comment#36: Dan Williams 2013-12-05 10:24:17 EST
"Just for the record, these are the things I'm trying to test:
- start adhoc, make sure fake goes away on disconnect - FAIL (not regression)"

Well I do not know what this exactly means, maybe you could explain
it to me, so that we can find a way to fix this issue that does not cause
that regression again.

And, if this is some kind of test-suite, does it start to fail if the patch from
comment#11 gets applied?

Comment 22 Dan Williams 2015-11-11 00:12:42 UTC

Note that this isn't a problem any more with NM git master (1.2) because NM tracks the supplicant's scan list instead of keeping one of its own.  If priv->current_ap was removed by the supplicant during the connect, it will be marked 'fake' and link_timeout_cb() will remove it.  But if the AP was found again by the supplicant, NM will be notified of that new AP, which will have a new object path, and NM will consider it a completely new AP distinct from priv->current_ap.  So after priv->current_ap gets removed, NM should reconnect using the new AP.

------

But for 1.0.x, the core problem (from comment #11) is that there's a race between link_timeout_cb() and when the supplicant removes/adds the AP.  If the supplicant removes the AP and re-adds it before link_timeout_cb() runs, then the problem occurs.  Getting rid of force_remove_old_ap will fix one problem and bring back another, so we don't want to do that.  Essentially, we don't want to remove priv->current_ap if the supplicant still knows about it.

So how about this?

	/* If the access point failed, and wasn't found by the supplicant when it
	 * attempted to reconnect, then it's probably out of range or turned off.
	 * Remove it from the list and if it's actually still present, it'll be
	 * found in the next scan.
	 */
	if (   priv->ssid_found == FALSE
	    && priv->current_ap
	    && g_object_get_data (G_OBJECT (priv->current_ap), WPAS_REMOVED_TAG)) {
		set_current_ap (self, NULL, TRUE, TRUE);
	}

That will only remove the AP if the supplicant doesn't know about it.

Comment 23 Dan Williams 2015-11-11 00:23:31 UTC

Created attachment 315221 [details] [review]
0001-wifi-clean-up-removal-of-current-AP-if-it-fails-duri.patch

NM 1.0 "backport" of dcbw/wifi-ssid-found-bgo733105.

Comment 24 Bernd Edlinger 2015-11-11 08:47:47 UTC

(In reply to Dan Williams from comment #22)
> Note that this isn't a problem any more with NM git master (1.2) because NM
> tracks the supplicant's scan list instead of keeping one of its own.  If
> priv->current_ap was removed by the supplicant during the connect, it will
> be marked 'fake' and link_timeout_cb() will remove it.  But if the AP was
> found again by the supplicant, NM will be notified of that new AP, which
> will have a new object path, and NM will consider it a completely new AP
> distinct from priv->current_ap.  So after priv->current_ap gets removed, NM
> should reconnect using the new AP.
> 
> ------
> 
> But for 1.0.x, the core problem (from comment #11) is that there's a race
> between link_timeout_cb() and when the supplicant removes/adds the AP.  If
> the supplicant removes the AP and re-adds it before link_timeout_cb() runs,
> then the problem occurs.  Getting rid of force_remove_old_ap will fix one
> problem and bring back another, so we don't want to do that.  Essentially,
> we don't want to remove priv->current_ap if the supplicant still knows about
> it.
> 
> So how about this?
> 
> 	/* If the access point failed, and wasn't found by the supplicant when it
> 	 * attempted to reconnect, then it's probably out of range or turned off.
> 	 * Remove it from the list and if it's actually still present, it'll be
> 	 * found in the next scan.
> 	 */
> 	if (   priv->ssid_found == FALSE
> 	    && priv->current_ap
> 	    && g_object_get_data (G_OBJECT (priv->current_ap), WPAS_REMOVED_TAG)) {
> 		set_current_ap (self, NULL, TRUE, TRUE);
> 	}
> 
> That will only remove the AP if the supplicant doesn't know about it.

Yes, that looks good to me,
except maybe the extra braces.

Comment 25 Jiri Klimes 2015-11-11 14:13:26 UTC

(In reply to Dan Williams from comment #23)
> Created attachment 315221 [details] [review] [review]
> 0001-wifi-clean-up-removal-of-current-AP-if-it-fails-duri.patch
> 
> NM 1.0 "backport" of dcbw/wifi-ssid-found-bgo733105.

LGTM.

Comment 26 Thomas Haller 2015-11-11 14:52:55 UTC

(In reply to Dan Williams from comment #23)
> Created attachment 315221 [details] [review] [review]
> 0001-wifi-clean-up-removal-of-current-AP-if-it-fails-duri.patch
> 
> NM 1.0 "backport" of dcbw/wifi-ssid-found-bgo733105.

LGTM. (some whitespace issues though).

Comment 27 Dan Williams 2015-11-12 18:30:13 UTC

Whitespace issues fixed.

nm-1-0: 095818ac10457f77458ad865ef1777a469d9825a
master: 46b46fd1ca8ec2503658d72dcdb47c77beaa15fe

Bernd, can you please test and re-open this bug if there's still an issue?

Comment 28 Bernd Edlinger 2015-11-12 20:09:44 UTC

(In reply to Dan Williams from comment #27)
> Whitespace issues fixed.
> 
> nm-1-0: 095818ac10457f77458ad865ef1777a469d9825a
> master: 46b46fd1ca8ec2503658d72dcdb47c77beaa15fe
> 
> Bernd, can you please test and re-open this bug if there's still an issue?

Thanks,

but is the 0.9.8 branch already closed?
It also has this race condition.

I had this proposed initially for 0.9.8.8:

@@ -2174,7 +2188,7 @@ link_timeout_cb (gpointer user_data)
 		} else
 			ap = nm_device_wifi_get_activation_ap (self);
 
-		if (ap)
+		if (ap && nm_ap_get_fake (ap))
 			remove_access_point (self, ap);
 	}

Comment 29 Thomas Haller 2015-11-18 15:03:19 UTC

(In reply to Bernd Edlinger from comment #28)
> (In reply to Dan Williams from comment #27)
> > Whitespace issues fixed.
> > 
> > nm-1-0: 095818ac10457f77458ad865ef1777a469d9825a
> > master: 46b46fd1ca8ec2503658d72dcdb47c77beaa15fe
> > 
> > Bernd, can you please test and re-open this bug if there's still an issue?
> 
> Thanks,
> 
> but is the 0.9.8 branch already closed?
> It also has this race condition.
> 
> I had this proposed initially for 0.9.8.8:
> 
> @@ -2174,7 +2188,7 @@ link_timeout_cb (gpointer user_data)
>  		} else
>  			ap = nm_device_wifi_get_activation_ap (self);
>  
> -		if (ap)
> +		if (ap && nm_ap_get_fake (ap))
>  			remove_access_point (self, ap);
>  	}

nm-0-9-8 is very old. Usually we don't do any fixes there (because every change has the potential for regression)
 If you (or your distribution) uses such an old version, those patches should be applied downstream.

Also, a change would need to be backported first to nm-0-9-10 branch, and from there to nm-0-9-8.