GNOME Bugzilla – Bug 762335
Try alternative servers on connection failures
Last modified: 2016-04-29 16:54:12 UTC
Since bug 761859 we allow users to set up connections from a list of predefined networks. In the cases where a network is comprised of more than one server, we can step through the list on connection failures before reporting an error to the user.
Well recently one IRC server of gimp was not reachable to Bastian, so he was not able to connect through IRC. Lets try to get in 3.20, since no string breaks? I'll take a look tomorrow.
*** Bug 764435 has been marked as a duplicate of this bug. ***
Created attachment 325204 [details] [review] application: Try alternative servers on connection failures On predefined conncetions we have list of servers. There are times when one of the server is down, hence in case of connection failures try alternative servers. failurs
Created attachment 325205 [details] [review] application: Try alternative servers on connection failures On predefined conncetions we have list of servers. There are times when one of the server is down, hence in case of connection failures try alternative servers.
Review of attachment 325205 [details] [review]: ::: src/application.js @@ +427,3 @@ + + // Try again with a alternate server + let server = requestData.alternateServers[requestData.currentServerId]; I don't like passing around an array index. Instead, you can "consume" the array: let server = requestData.alternateServers.shift(); @@ +453,3 @@ } + if (error && error != ConnectionError.CANCELLED) { Grepping through telepathy-idle, I find the following errors: TP_ERROR_AUTHENTICATION_FAILED TP_ERROR_CANCELLED TP_ERROR_CHANNEL_BANNED TP_ERROR_CHANNEL_FULL TP_ERROR_CHANNEL_INVITE_ONLY TP_ERROR_DISCONNECTED TP_ERROR_DOES_NOT_EXIST TP_ERROR_INVALID_ARGUMENT TP_ERROR_INVALID_HANDLE TP_ERROR_NETWORK_ERROR TP_ERROR_NOT_AVAILABLE TP_ERROR_NOT_IMPLEMENTED TP_ERROR_PERMISSION_DENIED TP_ERROR_SERVICE_BUSY I don't think they are all relevant, but there are some that we should be able to get here where connecting with a different server doesn't make sense (CHANNEL_BANNED, CHANNEL_FULL and CHANNEL_INVITE_ONLY at least) Also ALREADY_CONNECTED has only been handled above if the number of retries didn't exceed MAX_RETRIES. @@ +454,3 @@ + if (error && error != ConnectionError.CANCELLED) { + if (++requestData.currentServerId < requestData.alternateServers.length) { This isn't quite right - the first ID we use for retrying is 1, which assumes that the original attempt was using the first entry (e.g. ID 0). That assumption isn't necessarily correct though: - the server fails, we retry with the 2nd entry that works - at some point (weeks or months later), that server fails - we now start retrying with the failed server, and don't try the original server at all I see two possible approaches: (1) make sure we always use the first network by updating the server parameter before the first try (2) filter out the currently used server from the alternate server list (1) is easier, but if a failure is not temporary, we end up starting out with a known bad server. (2) means we'll try with a server that likely worked the last time ... ::: src/networksManager.js @@ +89,3 @@ + let sslServers = network.servers.filter(s => s.ssl); + return sslServers.length > 0 ? sslServers + : network.servers; Could use that in getNetworkDetails
Florian, I am aware of the review. Busy in college atm, I'll fix it this weekend. maybe we can delay 3.20.1 till then?
(In reply to Kunaal Jain from comment #6) > Florian, I am aware of the review. Busy in college atm, I'll fix it this > weekend. maybe we can delay 3.20.1 till then? Sure, thanks for the heads-up.
(In reply to Florian Müllner from comment #5) > Review of attachment 325205 [details] [review] [review]: > > ::: src/application.js > @@ +454,3 @@ > > + if (error && error != ConnectionError.CANCELLED) { > + if (++requestData.currentServerId < > requestData.alternateServers.length) { > > This isn't quite right - the first ID we use for retrying is 1, which > assumes that the original attempt was using the first entry (e.g. ID 0). > That assumption isn't necessarily correct though: > > - the server fails, we retry with the 2nd entry that works > - at some point (weeks or months later), that server fails > - we now start retrying with the failed server, and don't try > the original server at all > > I see two possible approaches: > (1) make sure we always use the first network by updating the server > parameter before the first try > (2) filter out the currently used server from the alternate server list > > (1) is easier, but if a failure is not temporary, we end up starting out > with a known bad server. (2) means we'll try with a server that likely > worked the last time ... > 1.) What about when users clicks on reconnect? Should we iterate over all servers? 2.) The two approaches you mentioned are for when we have successfully connected and then fail? 3.) I am not getting how reconnect button works? The code is "target = new GLib.Variant('o', this._account.get_object_path());" But no idea what it means. A approach I think can be that if we connect, than restore the alternateServers list to original but currently used server being the first entry. (Effectively rotate to make the first entry this one) > ::: src/networksManager.js > @@ +89,3 @@ > + let sslServers = network.servers.filter(s => s.ssl); > + return sslServers.length > 0 ? sslServers > + : network.servers; > > Could use that in getNetworkDetails
When server fails after we have already been connected, we don't reconnect using the methods defined in application.js. Are you suggesting we take care of it in chatroomManager.js ?
(In reply to Kunaal Jain from comment #8) > 1.) What about when users clicks on reconnect? Should we iterate over all > servers? Good question. Right now, 'reconnect' completely side-steps the error handling in application.js. This is something that we should address eventually (likely by moving out the generic bits to a 'notify::connection-status' handler), but that's out of scope of this bug - stuff like appending underscores to the nick don't work either with the reconnect button, so just ignore the issue for now. > 2.) The two approaches you mentioned are for when we have successfully > connected and then fail? It's about correct behavior over time. Take this example: - we have a list of servers A, B and C - one day server A has an outage, so we pick B (and use it to set the account's server property) - during the next couple of months, server B works fine; A has come up again, but as we don't try it, we don't notice that - now B has an outage too: we now try B (and fail again), then C, and never reconsider A (which would actually work now) The behavior in the last point is clearly not great - the first entry is special in the sense that we will completely ignore it after it failed once, and our first reconnect attempt is made with a server that just failed. So I'm suggesting to either: - before the first connection attempt, set the account's server property to the first server and use the rest as alternativeServer list; this means we will always try A first, then fall back to B and C. That's nicely consistent and predictable, however it sucks when A has a permanent failure - keep the modified 'server' property as in your patch, but build the alternativeServer list so that it only contains the other servers; so if we first try B, we fall then back to A and C > 3.) I am not getting how reconnect button works? The code is "target = new > GLib.Variant('o', this._account.get_object_path());" But no idea what it > means. The items in the menu are GtkModelButtons, which implement the GtkActionable[0] interface that allows using GActions instead of ::clicked handlers. The ::action-name property is set from the .ui file (room-list-header.ui) as it's the same for all instances, but the target (a serialized description of the account) is tied to a particular instance, thus set in the code. [0] https://developer.gnome.org/gtk3/stable/GtkActionable.html > A approach I think can be that if we connect, than restore the > alternateServers list to original but currently used server being the first > entry. (Effectively rotate to make the first entry this one) Not sure I understand the suggestion, but the list kept by NetworksManager should be treated as read-only, not least because it effectively is on restart (the data is in a gresource which isn't writable).
Having some trouble issues with my machine, I'll upload a patch in few hours.
How do I know if the server works? My idea is in _onEnsureChannel, if req.ensure_and_observe_channel_finish(res) returns channel, this means server works, so we can update the new alternativeserver list. Is this correct behavior?
(In reply to Florian Müllner from comment #10) > (In reply to Kunaal Jain from comment #8) > > > 2.) The two approaches you mentioned are for when we have successfully > > connected and then fail? > > It's about correct behavior over time. Take this example: > > - we have a list of servers A, B and C > - one day server A has an outage, so we pick B > (and use it to set the account's server property) > - during the next couple of months, server B works > fine; A has come up again, but as we don't try > it, we don't notice that > - now B has an outage too: > we now try B (and fail again), then C, and never So I spent last 6 hours on this, and I might be wrong, but why this would happen? A failed, we picked B. B was working and then it fails, we wont try C as we don't handle this in reconnect. When user restarts Polari or tries joining new channel (this is when _requestChannel in application.js is called) , the requestData is built with alternateServers A,B,C. So B (currentServer) fails, we try A -> B -> C. So we resconsider all servers each time new channel request is made. > reconsider A (which would actually work now)
(In reply to Kunaal Jain from comment #12) > So I spent last 6 hours on this, and I might be wrong, but why this would > happen? > A failed, we picked B. B was working and then it fails, we wont try C as we > don't handle this in reconnect. Yeah, I'm not talking about reconnect. Say, the user connects 50 times successfully through B (on startup), then it fails. > When user restarts Polari or tries joining new channel, the > requestData is built with alternateServers A,B,C. So B (currentServer) > fails, we try A -> B -> C. So we resconsider all servers each time new > channel request is made. No. alternateServers are A,B,C, but the retry code is + currentServerId: 0 ... + if (++requestData.currentServerId < requestData.alternateServers.length) { + this._retryServerRequest(requestData); + return; So we'll first retry with ID 1, which is B.
Created attachment 327044 [details] [review] application: Try alternative servers on connection failures Updated patch with review comments.
Attachment 327044 [details] pushed as 88fca63 - application: Try alternative servers on connection failures