GNOME Bugzilla – Bug 765728
adaptivedemux: Segfault since "use realtime_clock for waiting for a condition" commit
Last modified: 2016-05-06 08:55:17 UTC
Created attachment 326943 [details] Log no crash before the commit Since commit gst-plugins-bad aa58a70d6676f9bc394780a90a39ff47d538fa68 We do have a segfault when we stop the playing stream. With commit gst-plugins-bad 74d62b91449b2f21d417bc67b58792217c185b8d all runs ok. device linux stb settopbox. pli master-next image. including log on the last ok commit followed by log same stream where segaults occurs after hls finalize. But on logs can't be see what it is. The device is not powerfull enough to run gdb. Perhaps another log can bring more
Created attachment 326945 [details] log there segfault occurs up on media stop after hls finalize
For reference, that commit is: ommit aa58a70d6676f9bc394780a90a39ff47d538fa68 Author: Florin Apostol <florin.apostol@oregan.net> Date: Tue Feb 16 14:44:27 2016 +0000 adaptivedemux: use realtime_clock for waiting for a condition There are several places in adaptivedemux where it waits for time to pass, for example to wait until it should next download a fragment. The problem with this approach is that it means that unit tests are forced to execute in realtime. This commit replaces the use of g_cond_wait_until() with single shot GstClockID that signals the condition variable. Under normal usage, this behaves exactly as before. A unit test can replace the system clock with a GstTestClock, allowing the test to control the timing in adaptivedemux. https://bugzilla.gnome.org/show_bug.cgi?id=762147
Perhaps you could use some printing around the _finalize() function to pinpoint where it segfauls? Can you reproduce the failure on a desktop or any other machine? Also, can you share the stream that caused it? Or does it happen with any stream?
Yes if You wish I can try a run where extra debug output is set in. I'm no first recompiling the plugins-bad to commit aa58a70d6676f9bc394780a90a39ff47d538fa68 That is the first who gives segfault. I runned the gstreamer with debug options. Followed with the image start. GST_DEBUG=dvbaudiosink:4,dvbvideosink:4,hls:5,dashdemux:5,adaptivedemux:5,adaptivedemux-1.0:5 enigma2.sh perhaps also an extra module can be brought in. the error seems to come from adaptivedemux. I will also test others streams who are not using adaptivedemux
I doubled checked. It is only crashing up on stream stop by streams who are using adaptif streaming. It does not happen with all other live streams. ok streams: all file media 100 % ok live streams or non live streams. rtmp or whatever. but soon adaptivedemux is used segfault up on media stop. I also tried updating up to very last plugins-bad same issue. If a then just replace the libgstadaptivedemux-1.0.so by that build with commit: 74d62b91449b2f21d417bc67b58792217c185b8d no segfault anymore up on media stop.
Thanks for investigating, is this all on your settopbox? Did you try on a standard computer/desktop? 2 questions: Is this an HLS live stream or a on-demand? Can the URL be shared in case it is available online and publicly? How is it stopped? At the end of the stream or are you shutting down the pipeline in the middle of playback?
here a couple off public ones http://rtmp.infomaniak.ch/livecast/tvm3/PAZ.m3u8 http://live.rpctv.com/rpctv/tm.smil/playlist.m3u8 It is stopped in the middle of playing. When we switch back to channel or gstreamer application player in cpp wil stop gstreamer the playbin will stop the related sink's customed cause we are working with dvb are stopped. we wait on nothing.
apart off course or stop message .
Thanks, those streams doesn't seem to crash here. Did you try reproducing it in another device other than the settopbox?
now I can't since I do not have othet device running gstreamer at that update state. But for ubuntu I will need but that's for in a couple off weeks anyway upgrade to new lts 16.x which are the best programs to uninstall(the original ubuntu ones) and compile step by step to work with gstreamer head ?
You can use the gst-uninstalled script to build and use the latest git master without having to install gstreamer on your system: https://cgit.freedesktop.org/gstreamer/gstreamer/tree/scripts/gst-uninstalled
I still can't have enough info to really trace the bug. But what I well now is that on a stb one may not play around with system clock as this in the best case will lead once back to a channel no decoding anymore due to wrong time settings. But mostly when it's used some where seg fault almost always occurs. I really can see where but gies that the patch with wait function somewhere tries to take owner ship off system clock and ...
Here a log I now took last commit : from gst-plugins-bad the last append is the log. This time with debug off application included. It's really somewhere stuck in wait while gstreamer is stopped and that's why segfaults occurs .
Created attachment 326963 [details] really stuck when gst stops gues the wait state does that.
Created attachment 326964 [details] Again a log before the commit who breaks. Here I set the log, I just used the libgstadaptivedemux-1.0.so build with commit : 74d62b91449b2f21d417bc67b58792217c185b8d The rest is on gstreamer head. here no problem .
(In reply to christophe vr from comment #12) > I still can't have enough info to really trace the bug. But what I well now > is that on a stb one may not play around with system clock as this in the > best case will lead once back to a channel no decoding anymore due to wrong > time settings. But mostly when it's used some where seg fault almost always > occurs. I really can see where but gies that the patch with wait function > somewhere tries to take owner ship off system clock and ... That commit does not "play around with the system clock" or "take ownership", it references it to control waiting for a period of time. Internally g_cond_wait_until() will be using system time, therefore I don't think it is relevant to this discussion. I've not yet managed to reproduce the bug, but my suspicion is that the gst_adaptive_demux_wait_until() function is not thread safe.
Yes think something like that. For info the problem is only on mipsel32el boxes. The stb's who run with sh4 (same customized sink's and application) does not have this segfault.
+static gboolean +gst_adaptive_demux_wait_until (GstClock * clock, GCond * cond, GMutex * mutex, + GstClockTime end_time) +{ + GstAdaptiveDemuxTimer timer; + GstClockReturn res; + + if (G_UNLIKELY (!GST_CLOCK_TIME_IS_VALID (end_time))) { + /* for an invalid time, gst_clock_id_wait_async will try to call + * gst_adaptive_demux_clock_callback from the current thread. + * It still holds the mutex while doing that, so it will deadlock. + * g_cond_wait_until would return immediately with false, so we'll do the s ame. + */ + return FALSE; + } + timer.fired = FALSE; + timer.cond = cond; + timer.mutex = mutex; + timer.clock_id = gst_clock_new_single_shot_id (clock, end_time); + res = + gst_clock_id_wait_async (timer.clock_id, + gst_adaptive_demux_clock_callback, &timer, NULL); What happens if end_time is already in the past? Will the callback get called using this thread? If so, that would cause a deadlock as the mutex is already held. + /* clock does not support asynchronously wait. Assert and return */ + if (res == GST_CLOCK_UNSUPPORTED) { + gst_clock_id_unref (timer.clock_id); + g_return_val_if_reached (TRUE); + } + /* the gst_adaptive_demux_clock_callback will signal the + cond when the clock's single shot timer fires */ + g_cond_wait (cond, mutex); If the condition has been signalled by another thread (e.g. one wanting to stop the task) should gst_clock_id_unschedule() be called? Could the callback fire after this function has returned? + gst_clock_id_unref (timer.clock_id); + return !timer.fired; +}
(In reply to christophe vr from comment #17) > Yes think something like that. > For info the problem is only on mipsel32el boxes. > The stb's who run with sh4 (same customized sink's and application) does not > have this segfault. Interesting, I'm also using mips (BCM7409, BCM7231) boxes and have not yet been bitten by this problem. Of course that's the problem with race conditions, they are very sensitive to the execution environment. At least it's readily reproducible for you, so we have at least one person who can test fixes!
Created attachment 327009 [details] [review] adaptivedemux: unschedule clock callback when tasks are ending Looking at the code, the lack of an unschedule call for the async timer might be a cause of problems during finalize. Christophe, are you be able to try this patch to see if it helps?
Ok just tested and bingo works perfect and no segfaults anymore.
(In reply to christophe vr from comment #21) > Ok just tested and bingo works perfect and no segfaults anymore. Great news Hopefully Tim or Thiago can get this onto master.
Comment on attachment 327009 [details] [review] adaptivedemux: unschedule clock callback when tasks are ending Nice catch. At first glance this whole function looks a bit fishy to me though. There are three question I have: 1) as I understand it, g_cond_wait() may wake up spuriously without anyone having called g_cond_signal() on it, so g_cond_wait() should always be used in a loop really. Here we probably want to do { g_cond_wait(); } while (!timer.fired) or so, no? 2) This means we need to check another variable to see if we were cancelled and should break out of the loop. 3) It's not clear to me if we can use a stack-allocated GstAdaptiveDemuxTimer here. The clock callbacks will be called from a different thread, right? So when we call gst_clock_id_unschedule() as in the patch we have no guarantee that the timer callback is not currently running in the other thread (waiting for us to release the lock perhaps), so we might return from the function and trash our stack before the callback gets a chance to finish if I'm not mistaken.
any news if the last patch will be applied ?
Christophe, as you can see I have reviewed the patch and think it needs more changes. It will be applied as soon as we have consensus on what the right patch is.
Created attachment 327345 [details] [review] adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function The gst_adaptive_demux_wait_until() function can be woken up either by its end_time being reached, or from other threads that want to interrupt the waiting thread. If the thread is interrupted, it needs to cancel its async clock callback by unscheduling the clock callback. However, the callback task might already have been activated, but is waiting for the mutex to become available. In this case, the call to unschedule does not stop the callback from executing. The solution to this second issue is to use a reference counted object that is decremented by both the gst_adaptive_demux_wait_until() function and the call to gst_clock_id_wait_async (). In this way, the GstAdaptiveDemuxTimer object is only deleted when both the gst_adaptive_demux_wait_until() function and the async callback are finished with the object.
(In reply to Tim-Philipp Müller from comment #23) > Comment on attachment 327009 [details] [review] [review] > adaptivedemux: unschedule clock callback when tasks are ending > > Nice catch. At first glance this whole function looks a bit fishy to me > though. > > There are three question I have: > > 1) as I understand it, g_cond_wait() may wake up spuriously without anyone > having called g_cond_signal() on it, so g_cond_wait() should always be used > in a loop really. Here we probably want to do { g_cond_wait(); } while > (!timer.fired) or so, no? > > 2) This means we need to check another variable to see if we were cancelled > and should break out of the loop. The gst_adaptive_demux_wait_until() function is replacing a call to g_cond_wait(). Therefore that while() { } logic should already exist in the functions that call gst_adaptive_demux_wait_until(). > > 3) It's not clear to me if we can use a stack-allocated > GstAdaptiveDemuxTimer here. The clock callbacks will be called from a > different thread, right? So when we call gst_clock_id_unschedule() as in the > patch we have no guarantee that the timer callback is not currently running > in the other thread (waiting for us to release the lock perhaps), so we > might return from the function and trash our stack before the callback gets > a chance to finish if I'm not mistaken. Yes, I think that is a potential problem. I've created a new patch that uses a reference counted object.
> The gst_adaptive_demux_wait_until() function is replacing a call to > g_cond_wait(). Therefore that while() { } logic should already exist in the > functions that call gst_adaptive_demux_wait_until(). Good point, you're right. > > [whether GstAdaptiveDemuxTimer can be stack-allocated] > > Yes, I think that is a potential problem. I've created a new > patch that uses a reference counted object. Great. Do you think you could make it a mini object instead of a GObject? Or simply a custom ref/unref with g_atomic_int_*() wrapping the freeing?
> > Great. Do you think you could make it a mini object instead of a GObject? Or My first thought was to use a mini object, but according to the documentation, you can only make it writable when the reference count is one. We need to have the callback modify the fired boolean. At that point the reference count is normally two. > simply a custom ref/unref with g_atomic_int_*() wrapping the freeing? True, a custom ref/unref could be used. Is that faster than using GObject's implementation of the same thing?
I think we can simply ignore the writability semantics here and set the boolean to what we need, just as we do now, no? A custom ref/unref is faster because it means we don't have to go through the type system. Not that I think this is going to be performance relevant in practice, but using a GObject feels a bit over the top somehow :)
Created attachment 327367 [details] [review] adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function Replaced GObject with custom ref / unref functions.
Thanks, pushed (with minor fixup of comment formatting, removing the gtk-doc marker for function and adding '*' for all lines in multi-line comment) commit 792e9a60330d4559e119263400558b5be2f01715 Author: Alex Ashley <bugzilla@ashley-family.net> Date: Fri Apr 29 12:31:01 2016 +0100 adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function The gst_adaptive_demux_wait_until() function can be woken up either by its end_time being reached, or from other threads that want to interrupt the waiting thread. If the thread is interrupted, it needs to cancel its async clock callback by unscheduling the clock callback. However, the callback task might already have been activated, but is waiting for the mutex to become available. In this case, the call to unschedule does not stop the callback from executing. The solution to this second issue is to use a reference counted object that is decremented by both the gst_adaptive_demux_wait_until() function and the call to gst_clock_id_wait_async (). In this way, the GstAdaptiveDemuxTimer object is only deleted when both the gst_adaptive_demux_wait_until() function and the async callback are finished with the object. https://bugzilla.gnome.org/show_bug.cgi?id=765728