Bug 762335 – Try alternative servers on connection failures

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 762335 - Try alternative servers on connection failures


Summary:	Try alternative servers on connection failures


Status:	RESOLVED FIXED

Product:	polari
Classification:	Applications
Component:	general
Version:	3.19.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Polari maintainers
QA Contact:	Polari maintainers

URL:
Whiteboard:

Duplicates:	764435 (view as bug list)
Depends on:
Blocks:

Reported:	2016-02-19 15:53 UTC by Florian Müllner
Modified:	2016-04-29 16:54 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
application: Try alternative servers on connection failures (4.95 KB, patch) 2016-04-02 05:01 UTC, Kunaal Jain	none	Details \| Review
application: Try alternative servers on connection failures (4.94 KB, patch) 2016-04-02 05:02 UTC, Kunaal Jain	none	Details \| Review
application: Try alternative servers on connection failures (7.44 KB, patch) 2016-04-29 16:06 UTC, Florian Müllner	committed	Details \| Review

Description Florian Müllner 2016-02-19 15:53:57 UTC

Since bug 761859 we allow users to set up connections from a list of predefined networks. In the cases where a network is comprised of more than one server, we can step through the list on connection failures before reporting an error to the user.

Comment 1 Kunaal Jain 2016-03-09 22:24:31 UTC

Well recently one IRC server of gimp was not reachable to Bastian, so he was not able to connect through IRC. Lets try to get in 3.20, since no string breaks? I'll take a look tomorrow.

Comment 2 Florian Müllner 2016-04-01 13:34:38 UTC

*** Bug 764435 has been marked as a duplicate of this bug. ***

Comment 3 Kunaal Jain 2016-04-02 05:01:07 UTC

Created attachment 325204 [details] [review]
application: Try alternative servers on connection failures

On predefined conncetions we have list of servers.
There are times when one of the server is down, hence in
case of connection failures try alternative servers.
failurs

Comment 4 Kunaal Jain 2016-04-02 05:02:42 UTC

Created attachment 325205 [details] [review]
application: Try alternative servers on connection failures

On predefined conncetions we have list of servers.
There are times when one of the server is down, hence in
case of connection failures try alternative servers.

Comment 5 Florian Müllner 2016-04-07 14:42:12 UTC

Review of attachment 325205 [details] [review]:

::: src/application.js
@@ +427,3 @@
+
+        // Try again with a alternate server
+        let server = requestData.alternateServers[requestData.currentServerId];

I don't like passing around an array index. Instead, you can "consume" the array:
  let server = requestData.alternateServers.shift();

@@ +453,3 @@
             }
 
+            if (error && error != ConnectionError.CANCELLED) {

Grepping through telepathy-idle, I find the following errors:
 TP_ERROR_AUTHENTICATION_FAILED
 TP_ERROR_CANCELLED
 TP_ERROR_CHANNEL_BANNED
 TP_ERROR_CHANNEL_FULL
 TP_ERROR_CHANNEL_INVITE_ONLY
 TP_ERROR_DISCONNECTED
 TP_ERROR_DOES_NOT_EXIST
 TP_ERROR_INVALID_ARGUMENT
 TP_ERROR_INVALID_HANDLE
 TP_ERROR_NETWORK_ERROR
 TP_ERROR_NOT_AVAILABLE
 TP_ERROR_NOT_IMPLEMENTED
 TP_ERROR_PERMISSION_DENIED
 TP_ERROR_SERVICE_BUSY

I don't think they are all relevant, but there are some that we should be able to get here where connecting with a different server doesn't make sense (CHANNEL_BANNED, CHANNEL_FULL and CHANNEL_INVITE_ONLY at least)

Also ALREADY_CONNECTED has only been handled above if the number of retries didn't exceed MAX_RETRIES.

@@ +454,3 @@
 
+            if (error && error != ConnectionError.CANCELLED) {
+                if (++requestData.currentServerId < requestData.alternateServers.length) {

This isn't quite right - the first ID we use for retrying is 1, which assumes that the original attempt was using the first entry (e.g. ID 0). That assumption isn't necessarily correct though:

 - the server fails, we retry with the 2nd entry that works
 - at some point (weeks or months later), that server fails
 - we now start retrying with the failed server, and don't try
   the original server at all

I see two possible approaches:
 (1) make sure we always use the first network by updating the server
    parameter before the first try
 (2) filter out the currently used server from the alternate server list

(1) is easier, but if a failure is not temporary, we end up starting out with a known bad server. (2) means we'll try with a server that likely worked the last time ...

::: src/networksManager.js
@@ +89,3 @@
+        let sslServers = network.servers.filter(s => s.ssl);
+        return sslServers.length > 0 ? sslServers
+                                     : network.servers;

Could use that in getNetworkDetails

Comment 6 Kunaal Jain 2016-04-12 07:40:16 UTC

Florian, I am aware of the review. Busy in college atm, I'll fix it this weekend. maybe we can delay 3.20.1 till then?

Comment 7 Florian Müllner 2016-04-13 14:52:57 UTC

(In reply to Kunaal Jain from comment #6)
> Florian, I am aware of the review. Busy in college atm, I'll fix it this
> weekend. maybe we can delay 3.20.1 till then?

Sure, thanks for the heads-up.

Comment 8 Kunaal Jain 2016-04-16 07:16:06 UTC

(In reply to Florian Müllner from comment #5)
> Review of attachment 325205 [details] [review] [review]:
> 
> ::: src/application.js
> @@ +454,3 @@
>  
> +            if (error && error != ConnectionError.CANCELLED) {
> +                if (++requestData.currentServerId <
> requestData.alternateServers.length) {
> 
> This isn't quite right - the first ID we use for retrying is 1, which
> assumes that the original attempt was using the first entry (e.g. ID 0).
> That assumption isn't necessarily correct though:
> 
>  - the server fails, we retry with the 2nd entry that works
>  - at some point (weeks or months later), that server fails
>  - we now start retrying with the failed server, and don't try
>    the original server at all
> 
> I see two possible approaches:
>  (1) make sure we always use the first network by updating the server
>     parameter before the first try
>  (2) filter out the currently used server from the alternate server list
> 
> (1) is easier, but if a failure is not temporary, we end up starting out
> with a known bad server. (2) means we'll try with a server that likely
> worked the last time ...
> 

1.) What about when users clicks on reconnect? Should we iterate over all servers?
2.) The two approaches you mentioned are for when we have successfully connected and then fail?
3.) I am not getting how reconnect button works? The code is "target = new GLib.Variant('o', this._account.get_object_path());" But no idea what it means. 

A approach I think can be that if we connect, than restore the alternateServers list to original but currently used server being the first entry. (Effectively rotate to make the first entry this one)

> ::: src/networksManager.js
> @@ +89,3 @@
> +        let sslServers = network.servers.filter(s => s.ssl);
> +        return sslServers.length > 0 ? sslServers
> +                                     : network.servers;
> 
> Could use that in getNetworkDetails

Comment 9 Kunaal Jain 2016-04-17 04:14:51 UTC

When server fails after we have already been connected, we don't reconnect using the methods defined in application.js. Are you suggesting we take care of it in chatroomManager.js ?

Comment 10 Florian Müllner 2016-04-17 18:03:46 UTC

(In reply to Kunaal Jain from comment #8)
> 1.) What about when users clicks on reconnect? Should we iterate over all
> servers?

Good question. Right now, 'reconnect' completely side-steps the error handling in application.js. This is something that we should address eventually (likely by moving out the generic bits to a 'notify::connection-status' handler), but that's out of scope of this bug - stuff like appending underscores to the nick don't work either with the reconnect button, so just ignore the issue for now.


> 2.) The two approaches you mentioned are for when we have successfully
> connected and then fail?

It's about correct behavior over time. Take this example:

 - we have a list of servers A, B and C
 - one day server A has an outage, so we pick B
   (and use it to set the account's server property)
 - during the next couple of months, server B works
   fine; A has come up again, but as we don't try
   it, we don't notice that
 - now B has an outage too:
   we now try B (and fail again), then C, and never 
   reconsider A (which would actually work now)

The behavior in the last point is clearly not great - the first entry is special in the sense that we will completely ignore it after it failed once, and our first reconnect attempt is made with a server that just failed. So I'm suggesting to either:

 - before the first connection attempt, set the account's server
   property to the first server and use the rest as alternativeServer
   list; this means we will always try A first, then fall back to B and C.
   That's nicely consistent and predictable, however it sucks when A has
   a permanent failure

 - keep the modified 'server' property as in your patch, but build the
   alternativeServer list so that it only contains the other servers;
   so if we first try B, we fall then back to A and C


> 3.) I am not getting how reconnect button works? The code is "target = new
> GLib.Variant('o', this._account.get_object_path());" But no idea what it
> means. 

The items in the menu are GtkModelButtons, which implement the GtkActionable[0] interface that allows using GActions instead of ::clicked handlers. The ::action-name property is set from the .ui file (room-list-header.ui) as it's the same for all instances, but the target (a serialized description of the account) is tied to a particular instance, thus set in the code.

[0] https://developer.gnome.org/gtk3/stable/GtkActionable.html


> A approach I think can be that if we connect, than restore the
> alternateServers list to original but currently used server being the first
> entry. (Effectively rotate to make the first entry this one)

Not sure I understand the suggestion, but the list kept by NetworksManager should be treated as read-only, not least because it effectively is on restart (the data is in a gresource which isn't writable).

Comment 11 Kunaal Jain 2016-04-18 16:01:11 UTC

Having some trouble issues with my machine, I'll upload a patch in few hours.

Comment 12 Kunaal Jain 2016-04-18 17:56:16 UTC

How do I know if the server works?

My idea is in _onEnsureChannel, if req.ensure_and_observe_channel_finish(res) returns channel, this means server works, so we can update the new alternativeserver list. Is this correct behavior?

Comment 13 Kunaal Jain 2016-04-18 19:31:41 UTC

(In reply to Florian Müllner from comment #10)
> (In reply to Kunaal Jain from comment #8)
> 
> > 2.) The two approaches you mentioned are for when we have successfully
> > connected and then fail?
> 
> It's about correct behavior over time. Take this example:
> 
>  - we have a list of servers A, B and C
>  - one day server A has an outage, so we pick B
>    (and use it to set the account's server property)
>  - during the next couple of months, server B works
>    fine; A has come up again, but as we don't try
>    it, we don't notice that
>  - now B has an outage too:
>    we now try B (and fail again), then C, and never

So I spent last 6 hours on this, and I might be wrong, but why this would happen?
A failed, we picked B. B was working and then it fails, we wont try C as we don't handle this in reconnect. When user restarts Polari or tries joining new channel (this is when _requestChannel in application.js is called) , the requestData is built with alternateServers A,B,C. So B (currentServer) fails, we try A -> B -> C. So we resconsider all servers each time new channel request is made.
 
>    reconsider A (which would actually work now)

Comment 14 Florian Müllner 2016-04-18 20:15:25 UTC

(In reply to Kunaal Jain from comment #12)
> So I spent last 6 hours on this, and I might be wrong, but why this would
> happen?
> A failed, we picked B. B was working and then it fails, we wont try C as we
> don't handle this in reconnect.

Yeah, I'm not talking about reconnect. Say, the user connects 50 times successfully through B (on startup), then it fails.


> When user restarts Polari or tries joining new channel, the
> requestData is built with alternateServers A,B,C. So B (currentServer)
> fails, we try A -> B -> C. So we resconsider all servers each time new
> channel request is made.

No. alternateServers are A,B,C, but the retry code is

+          currentServerId: 0
...
+                if (++requestData.currentServerId < requestData.alternateServers.length) {
+                    this._retryServerRequest(requestData);
+                    return;

So we'll first retry with ID 1, which is B.

Comment 15 Florian Müllner 2016-04-29 16:06:55 UTC

Created attachment 327044 [details] [review]
application: Try alternative servers on connection failures

Updated patch with review comments.

Comment 16 Florian Müllner 2016-04-29 16:54:08 UTC

Attachment 327044 [details] pushed as 88fca63 - application: Try alternative servers on connection failures