After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 773509 - souphttpsrc: connection loss / reconnect issues
souphttpsrc: connection loss / reconnect issues
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-good
1.9.90
Other Linux
: Normal blocker
: 1.10.0
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2016-10-26 07:27 UTC by Michael Olbrich
Modified: 2016-10-31 17:53 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
souphttpsrc: reset read_position when reading fails (1.04 KB, patch)
2016-10-26 07:27 UTC, Michael Olbrich
none Details | Review
souphttpsrc: reset read_position when reading fails (1.52 KB, patch)
2016-10-26 09:46 UTC, Michael Olbrich
committed Details | Review

Description Michael Olbrich 2016-10-26 07:27:16 UTC
Created attachment 338475 [details] [review]
souphttpsrc: reset read_position when reading fails

Since 8816764112408766889c8b680a3af51115df4bf5 ("souphttpsrc: reduce reading latency by using non-blocking read") souphttpsrc loses the connection with "Socket I/O timed out" occasionally. It reconnects immediately, but the 'Range' is not updated, so the wrong data is fetched.

I've attached a patch that fixes the 'Range' issue. With this, playing the file works again. The error+reconnect still happen but recovery works as expected.

Sp far I have no idea why the error occurs. It's introduced by the commit mentioned above. The HTTP server is in the local network. The error always occurs at the same position in my test video.
Comment 1 Sebastian Dröge (slomo) 2016-10-26 07:42:12 UTC
Comment on attachment 338475 [details] [review]
souphttpsrc: reset read_position when reading fails

Which code are you thinking about that would update the Range? The check for the read position in gst_soup_http_src_do_request()?


Isn't the problem with resetting the read position here though that we will now *always* start from the beginning again? How would it know about the position where it should start again?
Comment 2 Michael Olbrich 2016-10-26 08:18:07 UTC
The check is in gst_soup_http_src_do_request(), that's correct.
Without this patch 'src->request_position != src->read_position' is always false, so the last range is used, so we start at the same position as the last request, so with something we've already played.

With this patch gst_soup_http_src_add_range_header() uses 'src->request_position' to set the new range. And request_position is always the current position (or the requested position after a seek). It's adjusted every time data is read successfully and no seek request came in since the last request was sent out.

Basically, 'request_position' is the position where we want to be and 'read_position' is the position where we actually are.
Comment 3 Sebastian Dröge (slomo) 2016-10-26 08:23:42 UTC
Comment on attachment 338475 [details] [review]
souphttpsrc: reset read_position when reading fails

That makes sense, can you put this explanation into the commit message and maybe also add a comment into the code about why? :)
Comment 4 Michael Olbrich 2016-10-26 09:46:04 UTC
Created attachment 338491 [details] [review]
souphttpsrc: reset read_position when reading fails

Better?

As a side note: I think we might be able to remove the 'gst_soup_http_src_destroy_input_stream()' here. It will happen anyway when retrying if 'read_position != request_position'. But I'm not 100% sure if that's correct if we don't retry, so I left it there. Calling it twice is ok so it would just be a minor cleanup anyway. Maybe something for master after the release.
Comment 5 Sebastian Dröge (slomo) 2016-10-26 12:16:43 UTC
Comment on attachment 338491 [details] [review]
souphttpsrc: reset read_position when reading fails

Makes sense to me, if it gets another "ok" we should get it into 1.10.0 :)
Comment 6 Tim-Philipp Müller 2016-10-26 12:30:40 UTC
Makes sense to me too. Since it sounds like a regression from the rebasing on top of the new API we should probably get it into 1.10.0.
Comment 7 Thiago Sousa Santos 2016-10-26 12:40:29 UTC
Did someone figure out the root reason for the timeout?
Comment 8 Thiago Sousa Santos 2016-10-26 12:50:41 UTC
Patch looks good to me too.

What's the pipeline causing the timeout? Does it happen because queues get full and it actually times out as it can't read more data in time?
Comment 9 Michael Olbrich 2016-10-26 13:00:46 UTC
I'm still looking for the cause of the timeout. With this patch applied so that playback continues, I get a timeout + reconnect approximately every 30 seconds.
Comment 10 Tim-Philipp Müller 2016-10-26 13:04:40 UTC
What kind of stream is it? Internet radio? HLS? DLNA? File served over http?
Comment 11 Michael Olbrich 2016-10-26 13:53:52 UTC
It's https://download.blender.org/demo/movies/ToS/tearsofsteel_4k.mov but served from a web server in the same network. They are connected via gigabit ethernet, so the download speed is a lot larger than the playback speed.

It does not happen if I use the URL above directly.
Comment 12 Sebastian Dröge (slomo) 2016-10-26 13:57:45 UTC
Attachment 338491 [details] pushed as 8666b7d - souphttpsrc: reset read_position when reading fails
Comment 13 Sebastian Dröge (slomo) 2016-10-26 13:58:19 UTC
Still should check if there is more to fix here... that it times out at all seems problematic.
Comment 14 Michael Olbrich 2016-10-26 14:46:53 UTC
I think there is a problem in the way we handle the GMainContext. When the error occurs, then gst_soup_http_src_wait_for_data() has not been called for >30 Seconds. With the change below, the problems goes away. I think not calling g_main_context_iteration() for a long time confuses the timeout calculation.

@@ -1610,6 +1612,7 @@ gst_soup_http_src_read_buffer (GstSoupHTTPSrc * src, GstBuffer ** outbuf)
 
   if (src->has_pollable_interface) {
     while (1) {
+      g_main_context_iteration (src->poll_context, FALSE);
       read_bytes =
           g_pollable_input_stream_read_nonblocking ((GPollableInputStream *)
           src->input_stream, mapinfo.data, mapinfo.size, src->cancellable,
Comment 15 Sebastian Dröge (slomo) 2016-10-26 15:19:39 UTC
That also makes sense. The way how all this works with the main context seems a bit fragile in general.

Thiago, what do you think about the above addition? Are there more places where we should shortly check the main context?
Comment 16 Thiago Sousa Santos 2016-10-26 15:56:55 UTC
That line should make sure we call it for every buffer but it sounds weird to timeout even though we're reading data frequently. I'd check if there isn't a bug in lower layers (glib?).

Of course, merging the workaround for the release seems fine to me.

Is this reproducible using standard playbin serving http over localhost?
Comment 17 Michael Olbrich 2016-10-27 06:44:58 UTC
I've always used playbin. And it it reproducible with gst-play.
Reproducing it with http over localhost requires some interference: It only occurs if gst_soup_http_src_wait_for_data() is not called for a long time and is then called again later. With http over localhost it is just called once at the beginning and then never again. The error is set in socket_source_dispatch() (gio/gsocket.c).

If I use 'kill -STOP <pid>; sleep 3 ; kill -CONT <pid>' to pause the webserver for a short time, then I can trigger a call to  gst_soup_http_src_wait_for_data() and with that trigger the error.
Comment 18 Michael Olbrich 2016-10-27 07:09:33 UTC
I guess it's either a glib issue or incorrect glib API usage. Either way, I just upgraded to glib 2.50.1 and the issue still exists.
Comment 19 Michael Olbrich 2016-10-27 07:19:45 UTC
And even with the workaround, pausing playback triggers the problem. I'm not sure where exactly the limit is, but pausing about 15 seconds seems to be enough.
Comment 20 Michael Olbrich 2016-10-28 07:02:29 UTC
This is caused by the way socket_source_dispatch() handles timeouts. If I read the code correctly, then a timeout is detected when the time between two calls to socket_source_dispatch() exceeds timeout + 1 seconds. It does not distinguish between 'source not ready' and 'g_main_context_iteration() was not called'.

I have no idea how to fix this or work around it locally in any sane way.
Comment 21 Sebastian Dröge (slomo) 2016-10-31 10:30:43 UTC
Can you summarize what the remaining problem here is, and when it happens and why? Is that a regression (did it work fine in 1.8?) or a new problem?
Comment 22 Michael Olbrich 2016-10-31 11:17:22 UTC
The commit that introduced the problem is 1.8.0-102-g881676411240. It worked fine in 1.8 and it works if this commit is reverted.

What happens is, that under certain circumstances, a timeout is detected incorrectly and souphttpsrc reconnects to the server.
The (already applied) patch mitigates the problems because it ensures that this reconnect works correctly, so most users won't notice the problem.

So far, I've not understood the problem completely. From what I can tell the problem is this:
The GSocket timeout handling assumes, that the corresponding GMainContext calls the dispatch function with little delay, if something happened. If this assumption is violated, then the timeout calculation may go wrong.

The problematic commit introduces a new GMainContext to handle the socket. I am currently aware of two situations where the assumption above can be violated:

1. Bad timing. This seems to be pretty rare and there is a workaround for this in comment 14.
2. Pause Playback. I don't know any workarounds, that don't involve a new thread or disabling timeouts altogether.

I'm not sure if a user visible regression remains. For live sources, situation 1. is next to impossible and pausing makes no sense. For non-live sources, the user will only notice something, if the HTTP server does not support 'Range' headers. I'm not sure if souphttpsrc can handle that anyways.
Comment 23 Sebastian Dröge (slomo) 2016-10-31 11:24:33 UTC
This sounds like a problem in libsoup / GSocket then. It seems wrong to assume that GSources are immediately called when they are ready.

Can you file a bug there? and then we close this one for the time being?
Comment 24 Tim-Philipp Müller 2016-10-31 11:56:32 UTC
> I'm not sure if a user visible regression remains. For live sources,
> situation 1. is next to impossible and pausing makes no sense. For non-live
> sources, the user will only notice something, if the HTTP server does not
> support 'Range' headers. I'm not sure if souphttpsrc can handle that anyways.

People do pause radio streams and such, and if they pause for too long the server will close the connection, and we'll need to reconnect (and the server might not support range headers in that case).
Comment 25 Michael Olbrich 2016-10-31 12:51:27 UTC
So lets say we have a http radio stream and no range header support.
We pause the stream for lets say 30 seconds. And then want to continue at the same point where we left of (so with 30 seconds extra latency).
This will no longer work, even if the server does not close the connection.
Comment 26 Sebastian Dröge (slomo) 2016-10-31 13:09:21 UTC
Is there any regression we should fix *now* for 1.10? If so, how?

And did you file a bug against glib and/or libsoup? :)
Comment 27 Sebastian Dröge (slomo) 2016-10-31 14:37:04 UTC
Yes there still is a regression. We reconnect in cases that worked fine before. Maybe we should for now revert to not using non-blocking reads?
Comment 28 Jonathan Miles 2016-10-31 14:54:52 UTC
FWIW, I've come across this bug. The symptom was the audio occasionally disappearing mid-stream.

Flailing about in logging, I noticed that "souphttpsrc gstsouphttpsrc.c:1031:gst_soup_http_src_got_headers:<source> got headers" would appear shortly before the audio drop-out, which seem indicative of a reconnect.


With attachment 338491 [details] [review] I still see the "got headers", but the audio didn't drop out.

By reverting 8816764112408766889c8b680a3af51115df4bf5 ("souphttpsrc: reduce reading latency by using non-blocking read") the "got headers"/reconnect disappears.
Comment 29 Sebastian Dröge (slomo) 2016-10-31 16:01:47 UTC
commit 60d30db912a1aedd743e66b9dcd2e21d71fbb24f
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Mon Oct 31 18:00:07 2016 +0200

    Revert "souphttpsrc: reduce reading latency by using non-blocking read"
    
    This reverts commit 8816764112408766889c8b680a3af51115df4bf5.
    
    It causes issues with the timeouts, and causes connections to be closed
    without actual reason. Needs further investigation.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=773509
Comment 30 Michael Olbrich 2016-10-31 17:53:56 UTC
(In reply to Sebastian Dröge (slomo) from comment #26)
> And did you file a bug against glib and/or libsoup? :)

Not yet. It's on my TODO list. I'm actually on vacation right now. I just kept an eye on this because of the upcoming release. A proper bug report for glib (I think) can wait until I'm back.