GNOME Bugzilla – Bug 773509
souphttpsrc: connection loss / reconnect issues
Last modified: 2016-10-31 17:53:56 UTC
Created attachment 338475 [details] [review] souphttpsrc: reset read_position when reading fails Since 8816764112408766889c8b680a3af51115df4bf5 ("souphttpsrc: reduce reading latency by using non-blocking read") souphttpsrc loses the connection with "Socket I/O timed out" occasionally. It reconnects immediately, but the 'Range' is not updated, so the wrong data is fetched. I've attached a patch that fixes the 'Range' issue. With this, playing the file works again. The error+reconnect still happen but recovery works as expected. Sp far I have no idea why the error occurs. It's introduced by the commit mentioned above. The HTTP server is in the local network. The error always occurs at the same position in my test video.
Comment on attachment 338475 [details] [review] souphttpsrc: reset read_position when reading fails Which code are you thinking about that would update the Range? The check for the read position in gst_soup_http_src_do_request()? Isn't the problem with resetting the read position here though that we will now *always* start from the beginning again? How would it know about the position where it should start again?
The check is in gst_soup_http_src_do_request(), that's correct. Without this patch 'src->request_position != src->read_position' is always false, so the last range is used, so we start at the same position as the last request, so with something we've already played. With this patch gst_soup_http_src_add_range_header() uses 'src->request_position' to set the new range. And request_position is always the current position (or the requested position after a seek). It's adjusted every time data is read successfully and no seek request came in since the last request was sent out. Basically, 'request_position' is the position where we want to be and 'read_position' is the position where we actually are.
Comment on attachment 338475 [details] [review] souphttpsrc: reset read_position when reading fails That makes sense, can you put this explanation into the commit message and maybe also add a comment into the code about why? :)
Created attachment 338491 [details] [review] souphttpsrc: reset read_position when reading fails Better? As a side note: I think we might be able to remove the 'gst_soup_http_src_destroy_input_stream()' here. It will happen anyway when retrying if 'read_position != request_position'. But I'm not 100% sure if that's correct if we don't retry, so I left it there. Calling it twice is ok so it would just be a minor cleanup anyway. Maybe something for master after the release.
Comment on attachment 338491 [details] [review] souphttpsrc: reset read_position when reading fails Makes sense to me, if it gets another "ok" we should get it into 1.10.0 :)
Makes sense to me too. Since it sounds like a regression from the rebasing on top of the new API we should probably get it into 1.10.0.
Did someone figure out the root reason for the timeout?
Patch looks good to me too. What's the pipeline causing the timeout? Does it happen because queues get full and it actually times out as it can't read more data in time?
I'm still looking for the cause of the timeout. With this patch applied so that playback continues, I get a timeout + reconnect approximately every 30 seconds.
What kind of stream is it? Internet radio? HLS? DLNA? File served over http?
It's https://download.blender.org/demo/movies/ToS/tearsofsteel_4k.mov but served from a web server in the same network. They are connected via gigabit ethernet, so the download speed is a lot larger than the playback speed. It does not happen if I use the URL above directly.
Attachment 338491 [details] pushed as 8666b7d - souphttpsrc: reset read_position when reading fails
Still should check if there is more to fix here... that it times out at all seems problematic.
I think there is a problem in the way we handle the GMainContext. When the error occurs, then gst_soup_http_src_wait_for_data() has not been called for >30 Seconds. With the change below, the problems goes away. I think not calling g_main_context_iteration() for a long time confuses the timeout calculation. @@ -1610,6 +1612,7 @@ gst_soup_http_src_read_buffer (GstSoupHTTPSrc * src, GstBuffer ** outbuf) if (src->has_pollable_interface) { while (1) { + g_main_context_iteration (src->poll_context, FALSE); read_bytes = g_pollable_input_stream_read_nonblocking ((GPollableInputStream *) src->input_stream, mapinfo.data, mapinfo.size, src->cancellable,
That also makes sense. The way how all this works with the main context seems a bit fragile in general. Thiago, what do you think about the above addition? Are there more places where we should shortly check the main context?
That line should make sure we call it for every buffer but it sounds weird to timeout even though we're reading data frequently. I'd check if there isn't a bug in lower layers (glib?). Of course, merging the workaround for the release seems fine to me. Is this reproducible using standard playbin serving http over localhost?
I've always used playbin. And it it reproducible with gst-play. Reproducing it with http over localhost requires some interference: It only occurs if gst_soup_http_src_wait_for_data() is not called for a long time and is then called again later. With http over localhost it is just called once at the beginning and then never again. The error is set in socket_source_dispatch() (gio/gsocket.c). If I use 'kill -STOP <pid>; sleep 3 ; kill -CONT <pid>' to pause the webserver for a short time, then I can trigger a call to gst_soup_http_src_wait_for_data() and with that trigger the error.
I guess it's either a glib issue or incorrect glib API usage. Either way, I just upgraded to glib 2.50.1 and the issue still exists.
And even with the workaround, pausing playback triggers the problem. I'm not sure where exactly the limit is, but pausing about 15 seconds seems to be enough.
This is caused by the way socket_source_dispatch() handles timeouts. If I read the code correctly, then a timeout is detected when the time between two calls to socket_source_dispatch() exceeds timeout + 1 seconds. It does not distinguish between 'source not ready' and 'g_main_context_iteration() was not called'. I have no idea how to fix this or work around it locally in any sane way.
Can you summarize what the remaining problem here is, and when it happens and why? Is that a regression (did it work fine in 1.8?) or a new problem?
The commit that introduced the problem is 1.8.0-102-g881676411240. It worked fine in 1.8 and it works if this commit is reverted. What happens is, that under certain circumstances, a timeout is detected incorrectly and souphttpsrc reconnects to the server. The (already applied) patch mitigates the problems because it ensures that this reconnect works correctly, so most users won't notice the problem. So far, I've not understood the problem completely. From what I can tell the problem is this: The GSocket timeout handling assumes, that the corresponding GMainContext calls the dispatch function with little delay, if something happened. If this assumption is violated, then the timeout calculation may go wrong. The problematic commit introduces a new GMainContext to handle the socket. I am currently aware of two situations where the assumption above can be violated: 1. Bad timing. This seems to be pretty rare and there is a workaround for this in comment 14. 2. Pause Playback. I don't know any workarounds, that don't involve a new thread or disabling timeouts altogether. I'm not sure if a user visible regression remains. For live sources, situation 1. is next to impossible and pausing makes no sense. For non-live sources, the user will only notice something, if the HTTP server does not support 'Range' headers. I'm not sure if souphttpsrc can handle that anyways.
This sounds like a problem in libsoup / GSocket then. It seems wrong to assume that GSources are immediately called when they are ready. Can you file a bug there? and then we close this one for the time being?
> I'm not sure if a user visible regression remains. For live sources, > situation 1. is next to impossible and pausing makes no sense. For non-live > sources, the user will only notice something, if the HTTP server does not > support 'Range' headers. I'm not sure if souphttpsrc can handle that anyways. People do pause radio streams and such, and if they pause for too long the server will close the connection, and we'll need to reconnect (and the server might not support range headers in that case).
So lets say we have a http radio stream and no range header support. We pause the stream for lets say 30 seconds. And then want to continue at the same point where we left of (so with 30 seconds extra latency). This will no longer work, even if the server does not close the connection.
Is there any regression we should fix *now* for 1.10? If so, how? And did you file a bug against glib and/or libsoup? :)
Yes there still is a regression. We reconnect in cases that worked fine before. Maybe we should for now revert to not using non-blocking reads?
FWIW, I've come across this bug. The symptom was the audio occasionally disappearing mid-stream. Flailing about in logging, I noticed that "souphttpsrc gstsouphttpsrc.c:1031:gst_soup_http_src_got_headers:<source> got headers" would appear shortly before the audio drop-out, which seem indicative of a reconnect. With attachment 338491 [details] [review] I still see the "got headers", but the audio didn't drop out. By reverting 8816764112408766889c8b680a3af51115df4bf5 ("souphttpsrc: reduce reading latency by using non-blocking read") the "got headers"/reconnect disappears.
commit 60d30db912a1aedd743e66b9dcd2e21d71fbb24f Author: Sebastian Dröge <sebastian@centricular.com> Date: Mon Oct 31 18:00:07 2016 +0200 Revert "souphttpsrc: reduce reading latency by using non-blocking read" This reverts commit 8816764112408766889c8b680a3af51115df4bf5. It causes issues with the timeouts, and causes connections to be closed without actual reason. Needs further investigation. https://bugzilla.gnome.org/show_bug.cgi?id=773509
(In reply to Sebastian Dröge (slomo) from comment #26) > And did you file a bug against glib and/or libsoup? :) Not yet. It's on my TODO list. I'm actually on vacation right now. I just kept an eye on this because of the upcoming release. A proper bug report for glib (I think) can wait until I'm back.