Bug 765728 – adaptivedemux: Segfault since "use realtime_clock for waiting for a condition" commit

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 765728 - adaptivedemux: Segfault since "use realtime_clock for waiting for a condition" commit


Summary:	adaptivedemux: Segfault since "use realtime_clock for waiting for a condition...


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-bad
Version:	git master
Hardware:	Other Linux

Importance:	Normal blocker
Target Milestone:	1.9.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-04-28 14:20 UTC by christophe vr
Modified:	2016-05-06 08:55 UTC

See Also:	762147
GNOME target:	---
GNOME version:	---

Attachments
Log no crash before the commit (115.86 KB, text/plain) 2016-04-28 14:20 UTC, christophe vr		Details
log there segfault occurs up on media stop after hls finalize (113.36 KB, text/plain) 2016-04-28 14:21 UTC, christophe vr		Details
really stuck when gst stops gues the wait state does that. (133.43 KB, text/plain) 2016-04-28 19:30 UTC, christophe vr		Details
Again a log before the commit who breaks. (146.49 KB, text/plain) 2016-04-28 19:43 UTC, christophe vr		Details
adaptivedemux: unschedule clock callback when tasks are ending (1.34 KB, patch) 2016-04-29 11:39 UTC, A Ashley	none	Details \| Review
adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function (5.92 KB, patch) 2016-05-05 12:20 UTC, A Ashley	none	Details \| Review
adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function (5.41 KB, patch) 2016-05-06 08:13 UTC, A Ashley	committed	Details \| Review

Description christophe vr 2016-04-28 14:20:14 UTC

Created attachment 326943 [details]
Log no crash before the commit

Since commit gst-plugins-bad aa58a70d6676f9bc394780a90a39ff47d538fa68
We do have a segfault when we stop the playing stream.
With commit gst-plugins-bad 74d62b91449b2f21d417bc67b58792217c185b8d
all runs ok.

device linux stb settopbox. pli master-next image.

including log on the last ok commit followed by log same stream where segaults occurs after hls finalize. But on logs can't be see what it is. The device is not powerfull enough to run gdb.

Perhaps another log can bring more

Comment 1 christophe vr 2016-04-28 14:21:49 UTC

Created attachment 326945 [details]
log there segfault occurs up on media stop after hls finalize

Comment 2 Tim-Philipp Müller 2016-04-28 14:27:01 UTC

For reference, that commit is:

ommit aa58a70d6676f9bc394780a90a39ff47d538fa68
Author: Florin Apostol <florin.apostol@oregan.net>
Date:   Tue Feb 16 14:44:27 2016 +0000

    adaptivedemux: use realtime_clock for waiting for a condition
    
    There are several places in adaptivedemux where it waits for
    time to pass, for example to wait until it should next download
    a fragment. The problem with this approach is that it means that
    unit tests are forced to execute in realtime.
    
    This commit replaces the use of g_cond_wait_until() with single
    shot GstClockID that signals the condition variable. Under normal
    usage, this behaves exactly as before. A unit test can replace the
    system clock with a GstTestClock, allowing the test to control the
    timing in adaptivedemux.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=762147

Comment 3 Thiago Sousa Santos 2016-04-28 14:27:42 UTC

Perhaps you could use some printing around the _finalize() function to pinpoint where it segfauls?

Can you reproduce the failure on a desktop or any other machine? Also, can you share the stream that caused it? Or does it happen with any stream?

Comment 4 christophe vr 2016-04-28 14:57:34 UTC

Yes if You wish I can try a run where extra debug output is set in.
I'm no first recompiling the plugins-bad to commit aa58a70d6676f9bc394780a90a39ff47d538fa68
That is the first who gives segfault.

I runned the gstreamer with debug options. Followed with the image start.

GST_DEBUG=dvbaudiosink:4,dvbvideosink:4,hls:5,dashdemux:5,adaptivedemux:5,adaptivedemux-1.0:5 enigma2.sh
perhaps also an extra module can be brought in.
the error seems to come from adaptivedemux. I will also test others streams who are not using adaptivedemux

Comment 5 christophe vr 2016-04-28 15:22:23 UTC

I doubled checked.

It is only crashing up on stream stop by streams who are using adaptif streaming.
It does not happen with all other live streams.

ok streams:
all file media 100 % ok
live streams or non live streams.
rtmp or whatever.
but soon adaptivedemux is used segfault up on media stop.
I also tried updating up to very last plugins-bad same issue. If a then just replace the libgstadaptivedemux-1.0.so by that build with  commit:
74d62b91449b2f21d417bc67b58792217c185b8d
no segfault anymore up on media stop.

Comment 6 Thiago Sousa Santos 2016-04-28 15:24:42 UTC

Thanks for investigating, is this all on your settopbox? Did you try on a standard computer/desktop?

2 questions:
Is this an HLS live stream or a on-demand? Can the URL be shared in case it is available online and publicly?

How is it stopped? At the end of the stream or are you shutting down the pipeline in the middle of playback?

Comment 7 christophe vr 2016-04-28 15:41:42 UTC

here a couple off public ones

http://rtmp.infomaniak.ch/livecast/tvm3/PAZ.m3u8
http://live.rpctv.com/rpctv/tm.smil/playlist.m3u8


It is stopped in the middle of playing. When we switch back to channel or gstreamer application player in cpp wil stop gstreamer the playbin will stop the related sink's customed cause we are working with dvb are stopped. we wait on nothing.

Comment 8 christophe vr 2016-04-28 15:43:47 UTC

apart off course or stop message .

Comment 9 Thiago Sousa Santos 2016-04-28 15:51:19 UTC

Thanks, those streams doesn't seem to crash here. Did you try reproducing it in another device other than the settopbox?

Comment 10 christophe vr 2016-04-28 16:09:06 UTC

now I can't since I do not have othet device running gstreamer at that update state.

But for ubuntu I will need but that's for in a couple off weeks anyway upgrade to new lts 16.x which are the best programs to uninstall(the original ubuntu ones) and compile step by step to work with gstreamer head ?

Comment 11 Thiago Sousa Santos 2016-04-28 16:27:06 UTC

You can use the gst-uninstalled script to build and use the latest git master without having to install gstreamer on your system:

https://cgit.freedesktop.org/gstreamer/gstreamer/tree/scripts/gst-uninstalled

Comment 12 christophe vr 2016-04-28 18:42:45 UTC

I still can't have enough info to really trace the bug. But what I well now is that on a stb one may not play around with system clock as this in the best case will lead once back to a channel no decoding anymore due to wrong time settings. But mostly when it's used some where seg fault almost always occurs. I really can see where but gies that the patch with wait function somewhere tries to take owner ship off system clock and ...

Comment 13 christophe vr 2016-04-28 19:29:50 UTC

Here a log I now took last commit : from gst-plugins-bad

the last append is the log. This time with debug off application included.

It's really somewhere stuck  in wait while gstreamer is stopped and that's why segfaults occurs .

Comment 14 christophe vr 2016-04-28 19:30:45 UTC

Created attachment 326963 [details]
really stuck when gst stops gues the wait state does that.

Comment 15 christophe vr 2016-04-28 19:43:25 UTC

Created attachment 326964 [details]
Again a log before the commit who breaks.

Here I set the log, I just used the libgstadaptivedemux-1.0.so
build with commit :
74d62b91449b2f21d417bc67b58792217c185b8d

The rest is on gstreamer head.

here no problem .

Comment 16 A Ashley 2016-04-29 08:54:50 UTC

(In reply to christophe vr from comment #12)
> I still can't have enough info to really trace the bug. But what I well now
> is that on a stb one may not play around with system clock as this in the
> best case will lead once back to a channel no decoding anymore due to wrong
> time settings. But mostly when it's used some where seg fault almost always
> occurs. I really can see where but gies that the patch with wait function
> somewhere tries to take owner ship off system clock and ...

That commit does not "play around with the system clock" or "take ownership", it references it to control waiting for a period of time. Internally g_cond_wait_until() will be using system time, therefore I don't think it is relevant to this discussion.

I've not yet managed to reproduce the bug, but my suspicion is that the gst_adaptive_demux_wait_until() function is not thread safe.

Comment 17 christophe vr 2016-04-29 09:25:01 UTC

Yes think something like that.
For info the problem is only on mipsel32el boxes.
The stb's who run with sh4 (same customized sink's and application) does not have this segfault.

Comment 18 A Ashley 2016-04-29 09:33:40 UTC

+static gboolean
+gst_adaptive_demux_wait_until (GstClock * clock, GCond * cond, GMutex * mutex,
+    GstClockTime end_time)
+{
+  GstAdaptiveDemuxTimer timer;
+  GstClockReturn res;
+
+  if (G_UNLIKELY (!GST_CLOCK_TIME_IS_VALID (end_time))) {
+    /* for an invalid time, gst_clock_id_wait_async will try to call
+     * gst_adaptive_demux_clock_callback from the current thread.
+     * It still holds the mutex while doing that, so it will deadlock.
+     * g_cond_wait_until would return immediately with false, so we'll do the s
ame.
+     */
+    return FALSE;
+  }
+  timer.fired = FALSE;
+  timer.cond = cond;
+  timer.mutex = mutex;
+  timer.clock_id = gst_clock_new_single_shot_id (clock, end_time);
+  res =
+      gst_clock_id_wait_async (timer.clock_id,
+      gst_adaptive_demux_clock_callback, &timer, NULL);

What happens if end_time is already in the past? Will the callback get called using this thread? If so, that would cause a deadlock as the mutex is already held.

+  /* clock does not support asynchronously wait. Assert and return */
+  if (res == GST_CLOCK_UNSUPPORTED) {
+    gst_clock_id_unref (timer.clock_id);
+    g_return_val_if_reached (TRUE);
+  }
+  /* the gst_adaptive_demux_clock_callback will signal the
+     cond when the clock's single shot timer fires */
+  g_cond_wait (cond, mutex);

If the condition has been signalled by another thread (e.g. one wanting to stop the task) should gst_clock_id_unschedule() be called? Could the callback fire after this function has returned?

+  gst_clock_id_unref (timer.clock_id);
+  return !timer.fired;
+}

Comment 19 A Ashley 2016-04-29 09:46:19 UTC

(In reply to christophe vr from comment #17)
> Yes think something like that.
> For info the problem is only on mipsel32el boxes.
> The stb's who run with sh4 (same customized sink's and application) does not
> have this segfault.

Interesting, I'm also using mips (BCM7409, BCM7231) boxes and have not yet been bitten by this problem. Of course that's the problem with race conditions, they are very sensitive to the execution environment. At least it's readily reproducible for you, so we have at least one person who can test fixes!

Comment 20 A Ashley 2016-04-29 11:39:59 UTC

Created attachment 327009 [details] [review]
adaptivedemux: unschedule clock callback when tasks are ending

Looking at the code, the lack of an unschedule call for the async timer might be a cause of problems during finalize.

Christophe, are you be able to try this patch to see if it helps?

Comment 21 christophe vr 2016-04-29 12:43:12 UTC

Ok just tested and bingo works perfect and no segfaults anymore.

Comment 22 A Ashley 2016-04-29 13:40:24 UTC

(In reply to christophe vr from comment #21)
> Ok just tested and bingo works perfect and no segfaults anymore.

Great news

Hopefully Tim or Thiago can get this onto master.

Comment 23 Tim-Philipp Müller 2016-04-29 17:06:19 UTC

Comment on attachment 327009 [details] [review]
adaptivedemux: unschedule clock callback when tasks are ending

Nice catch. At first glance this whole function looks a bit fishy to me though.

There are three question I have:

1) as I understand it, g_cond_wait() may wake up spuriously without anyone having called g_cond_signal() on it, so g_cond_wait() should always be used in a loop really. Here we probably want to do { g_cond_wait(); } while (!timer.fired) or so, no?

2) This means we need to check another variable to see if we were cancelled and should break out of the loop.

3) It's not clear to me if we can use a stack-allocated GstAdaptiveDemuxTimer here. The clock callbacks will be called from a different thread, right? So when we call gst_clock_id_unschedule() as in the patch we have no guarantee that the timer callback is not currently running in the other thread (waiting for us to release the lock perhaps), so we might return from the function and trash our stack before the callback gets a chance to finish if I'm not mistaken.

Comment 24 christophe vr 2016-05-02 13:53:45 UTC

any news if the last patch will be applied ?

Comment 25 Tim-Philipp Müller 2016-05-02 14:01:28 UTC

Christophe, as you can see I have reviewed the patch and think it needs more changes. It will be applied as soon as we have consensus on what the right patch is.

Comment 26 A Ashley 2016-05-05 12:20:39 UTC

Created attachment 327345 [details] [review]
adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function

The gst_adaptive_demux_wait_until() function can be woken up either by its end_time being reached, or from other threads that want to interrupt the waiting thread.
    
If the thread is interrupted, it needs to cancel its async clock callback by unscheduling the clock callback. However, the callback task might already have been activated, but is waiting for the mutex to become available. In this case, the call to unschedule does not stop the callback from executing.
    
The solution to this second issue is to use a reference counted object that is decremented by both the gst_adaptive_demux_wait_until() function and the call to gst_clock_id_wait_async (). In this way, the GstAdaptiveDemuxTimer object is only deleted when both the gst_adaptive_demux_wait_until() function and the async callback are finished with the object.

Comment 27 A Ashley 2016-05-05 12:24:04 UTC

(In reply to Tim-Philipp Müller from comment #23)
> Comment on attachment 327009 [details] [review] [review]
> adaptivedemux: unschedule clock callback when tasks are ending
> 
> Nice catch. At first glance this whole function looks a bit fishy to me
> though.
> 
> There are three question I have:
> 
> 1) as I understand it, g_cond_wait() may wake up spuriously without anyone
> having called g_cond_signal() on it, so g_cond_wait() should always be used
> in a loop really. Here we probably want to do { g_cond_wait(); } while
> (!timer.fired) or so, no?
> 
> 2) This means we need to check another variable to see if we were cancelled
> and should break out of the loop.

The gst_adaptive_demux_wait_until() function is replacing a call to g_cond_wait(). Therefore that while() { } logic should already exist in the functions that call gst_adaptive_demux_wait_until().
> 
> 3) It's not clear to me if we can use a stack-allocated
> GstAdaptiveDemuxTimer here. The clock callbacks will be called from a
> different thread, right? So when we call gst_clock_id_unschedule() as in the
> patch we have no guarantee that the timer callback is not currently running
> in the other thread (waiting for us to release the lock perhaps), so we
> might return from the function and trash our stack before the callback gets
> a chance to finish if I'm not mistaken.

Yes, I think that is a potential problem. I've created a new patch that uses a reference counted object.

Comment 28 Tim-Philipp Müller 2016-05-05 12:28:55 UTC

> The gst_adaptive_demux_wait_until() function is replacing a call to
> g_cond_wait(). Therefore that while() { } logic should already exist in the
> functions that call gst_adaptive_demux_wait_until().

Good point, you're right.


> > [whether GstAdaptiveDemuxTimer can be stack-allocated]
>
> Yes, I think that is a potential problem. I've created a new
> patch that uses a reference counted object.

Great. Do you think you could make it a mini object instead of a GObject? Or simply a custom ref/unref with g_atomic_int_*() wrapping the freeing?

Comment 29 A Ashley 2016-05-05 12:51:55 UTC

> 
> Great. Do you think you could make it a mini object instead of a GObject? Or

My first thought was to use a mini object, but according to the documentation, you can only make it writable when the reference count is one. We need to have the callback modify the fired boolean. At that point the reference count is normally two.

> simply a custom ref/unref with g_atomic_int_*() wrapping the freeing?

True, a custom ref/unref could be used. Is that faster than using GObject's implementation of the same thing?

Comment 30 Tim-Philipp Müller 2016-05-05 13:00:10 UTC

I think we can simply ignore the writability semantics here and set the boolean to what we need, just as we do now, no?

A custom ref/unref is faster because it means we don't have to go through the type system. Not that I think this is going to be performance relevant in practice, but using a GObject feels a bit over the top somehow :)

Comment 31 A Ashley 2016-05-06 08:13:27 UTC

Created attachment 327367 [details] [review]
adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function

Replaced GObject with custom ref / unref functions.

Comment 32 Tim-Philipp Müller 2016-05-06 08:52:55 UTC

Thanks, pushed (with minor fixup of comment formatting, removing the gtk-doc marker for function and adding '*' for all lines in multi-line comment)

commit 792e9a60330d4559e119263400558b5be2f01715
Author: Alex Ashley <bugzilla@ashley-family.net>
Date:   Fri Apr 29 12:31:01 2016 +0100

    adaptivedemux: fix threading issues in gst_adaptive_demux_wait_until function
    
    The gst_adaptive_demux_wait_until() function can be woken up either
    by its end_time being reached, or from other threads that want to
    interrupt the waiting thread.
    
    If the thread is interrupted, it needs to cancel its async clock callback
    by unscheduling the clock callback. However, the callback task might already
    have been activated, but is waiting for the mutex to become available. In this
    case, the call to unschedule does not stop the callback from executing.
    
    The solution to this second issue is to use a reference counted object that
    is decremented by both the gst_adaptive_demux_wait_until() function and the
    call to gst_clock_id_wait_async (). In this way, the GstAdaptiveDemuxTimer
    object is only deleted when both the gst_adaptive_demux_wait_until() function
    and the async callback are finished with the object.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=765728