GNOME Bugzilla – Bug 760930
adaptivedemux: stream can pause for a long time when server errors encountered
Last modified: 2018-11-03 13:45:36 UTC
During testing dashdemux/adaptivedemux with live streams we noticed the playback cease and wait for a long time before attempting to carry on. This was caused by an unexpected 503 error from the DASH server, which caused the adaptivedemux element to immediately wait for the next MPD update. The MPD used had a 1 hour update period, meaning that the stream could remain frozen for up to 1 hour after encountering an HTTP server error. We believe that this a less than ideal situation for playback of a live stream. We propose that the error recovery logic in gst_adaptive_demux_download_fragment and gst_adaptive_demux_download_loop be changed. We believe, for a live stream, a better error recovery scenario would be achieved by: 1. retry the current media fragment first, up to the retry limit. 2. once the retry limit is reached for the current media fragment it should be abandoned and the stream resynchronised to the fragment that should be at the current running time (i.e. bring back up to date with what should be playing out had the stream continued and not encountered an error). 3. if there is no new fragment to resynchronise to, only then wait for a manifest update. Step 2 could also just skip to the next fragment should one be available. If this logic seems sound I'll try and produce a patch to implement it. Can any of the maintainers foresee any problems with these proposed changes to adaptivedemux?
Sounds like a good solution, but also 1) seems something to be done for non-live streams.
Created attachment 320445 [details] [review] Patch to change the way fragments are retried in live streams Here's my proposed patch. In the patch I needed to know the difference between a stream seek that moved onto a new fragment and one that didn't. Since none of the code checked the return code from the stream_seek I added an extra return code of GST_FLOW_CUSTOM_SUCCESS to indicate that the seek succeeded, but it didn't actually change the stream position, GST_FLOW_OK means the seek worked and changed the stream position. Hope this is ok.
Sebastian, to answer your earlier comment, the behaviour for non-live streams was already to do the retries, so I've left that behaviour as it is.
I like this idea. The new return value needs to be replicated in mssdemux and hlsdemux as well.
Created attachment 320505 [details] [review] Patch to change the way fragments are retried in live streams (dashdemux/hlsdemux/msdemux)
I've updated to the latest git master HEAD and implemented on top of the new changes to stream_seek from there. I've also included mssdemux and hlsdemux. Had to write the seek function for hlsdemux as it didn't have one already, so based it on some of the code in the seek event call. I can extensively test the dashdemux path, but I've only been able to do a cursory check on hlsdemux and mssdemux using example streams.
Review of attachment 320505 [details] [review]: ::: ext/hls/gsthlsdemux.c @@ +414,3 @@ +} + +static GstFlowReturn Please make sure this doesn't break the unit tests that do snap-seeking. ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +2604,3 @@ + + gst_adaptive_demux_get_live_seek_range (demux, &range_start, + &range_end); smoothstreaming doesn't implement this. We need to skip this if it isn't implemented or just have some sane default that we could try, perhaps the last fragment timestamp in the current manifest version? @@ +2605,3 @@ + gst_adaptive_demux_get_live_seek_range (demux, &range_start, + &range_end); + stream_time = range_end - stream->segment.base; Do we need to subtract base here? AFAIK the ranges are already in the correct values.
I'm having a look at the unit tests, but I'm having trouble getting them to execute consistently. One time all tests will pass, another run (without recompiling) and the test will show bus errors and seg-faults. Is there a trick to getting consistent runs for the unit tests?
> I'm having a look at the unit tests, but I'm having trouble getting them to > execute consistently. One time all tests will pass, another run (without > recompiling) and the test will show bus errors and seg-faults. Is there a > trick to getting consistent runs for the unit tests? Not really, how do you run them exactly? On what platform?
The platform I'm trying to test on is a Dell T7400 (Intel Xeon E5420, 8GB ram, NVidia Quadro FX 1700) running up-to-date Fedora 23. I'm using "make check" to run the unit tests after compiling with "make -j4". I'm building in a separate directory to the source (i.e. running the configure script in the gst-plugins-bad directory from a different build directory), if that makes any difference.
Can you share the errors you get? Are they failing before your modifications? This looks like a separate bug.
I did another git pull before trying to get some logs together for you. The tests now seem a lot more stable. Although I'm still seeing: - elements/mplex occasionally (randomly) fails - elements/audiointerleave and elements/audiomixer always fail - elements/mssdemux fails when run with other tests, but succeeds when run on its own I don't believe my code touches mplex, or the audio* ones, so is it safe to ignore those failures? The mssdemux error (when run with other tests) is: 0:00:06.241177356 17899 0x7fca88002a80 ERROR adaptivedemux gstadaptivedemux.c:3010:gst_adaptive_demux_stream_download_loop:<mssdemux:audio_00> Download error: A network error occurred, or the server closed the connection unexpectedly.: /project-mcuk/bcs/davidjw/InetDist/DASM/rd-gst-plugins-bad/tests/check/elements/mssdemux.c(525): test_fragment_download_error_src_create (): /GstPipeline:pipeline/GstMssDemux:mssdemux/GstBin:srcbin-audio_00/GstTestHTTPSrc:testhttpsrc1: A network error occurred, or the server closed the connection unexpectedly. 91%: Checks: 12, Failures: 0, Errors: 1 /project-mcuk/bcs/davidjw/InetDist/DASM/rd-gst-plugins-bad/tests/check/elements/test_http_src.c:477:E:basicTest:testDownloadError:0: (after this point) Test timeout expired FAIL elements/mssdemux (exit status: 1) elements/hls_demux is now failing consistently on snap seeks, so I can get on and look at that.
Created attachment 322356 [details] [review] Patch to change the way fragments are retried in live streams (dashdemux/mssdemux) This patch should address issues with the previous patch.
I don't think the logic should be to always jump back to the current live point following an error. One transient network error causing playback to jump forwards would be rather confusing (and probably annoying) to the viewer. Imagine if you were watching a sports event, 30 minutes behind live. Having the stream suddenly jump forward 30 minutes without any user input would be rather annoying. I think the behaviour should be to move forwards, one segment at a time, until it is either at live or has succeeded in getting back into a good state.
Yes, I'd thought of that, but quickly realised skipping forward one segment at a time wouldn't work. I then tried to figure out how far the viewer was from live and use that as an offset to recalculate, but I couldn't figure out how to find out how far behind live the viewer is at (due to pause/skip). That figure doesn't seem to be easily available and there doesn't seem to be any way to derive it from absolute time, running time, stream time and/or segment information (my first patch had incorrectly tried to use the segment base as this offset, due to me misreading the GstSegment documentation). The problem with just trying to skip forward one segment at a time is that, due to the retry logic, each fragment will take a fragment duration plus a little extra to retry. So simply skipping to next fragment will never catch up the stream. It needs to be a jump forward, sometimes 2 or 3 fragments to get back to the streaming point. As pointed out above, calculating that skip doesn't seem to be an easy task. I'm open to suggestions to improve this, but I was spending way too much time trying all sorts of combinations of clocks, bases and offsets. I had to make a decision to just drop that behaviour and just assume that most viewers would be watching a live stream at the live point. For those not viewing live, there would be a disruption to playback anyway, during retry of the current fragment, followed by a jump back to live. I realise this is not ideal, but I also believe that it is better than the current master branch behaviour which can freeze playback for long periods of time waiting for a manifest update. After posting this latest patch I did have a thought to try remembering the difference between current playback stream time and live stream time upon seek event or transition to PLAYING, and use that as an offset. But I'm not sure how well this would work and I haven't had chance to try it yet.
Created attachment 328078 [details] [review] Updated patch which includes better tracking of playback distance from live. Finally had a chance to try out the method in my previous post and it works quite well, so here's a new version of the patch with changes to track how far the playback is behind live and use that adjustment when resynchronising the stream.
Is this related to https://bugzilla.gnome.org/show_bug.cgi?id=776609 ?
Certainly looks like they came across the same situation we did, i.e. a fragment download error causing the download loop to wait for an MPD update which could take a long time.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/issues/343.