GNOME Bugzilla – Bug 767879
Deadlock in WebKit when deleting the media player while HLS media was loaded
Last modified: 2016-07-25 14:35:02 UTC
It happens when the media player is destroyed by the garbacge collector and pipeline is set to NULL state. This always happens in the main thread, since the garbage colector frees the objects in the main thread. The situation is the following one: Main Thread: - blocked on GST_PAD_STREAM_LOCK (post_activate() gstpad.c) Secondary Thread: - blocked on GST_MANIFEST_LOCK (gst_adaptive_demux_src_event() reconfigure gstadaptivedemux.c) - has EXPOSE_LOCK of gstdecodebin2 taken from source_pad_blocked_cb() Secondary Thread: - blocked on EXPOSE_LOCK (no_more_pads_cb() gstdecodebin2.c) - it seems to have GST_PAD_STREAM_LOCK of gstpad taken from gst_pad_send_event_unchecked() - has GST_MANIFEST_LOCK of gstadaptivedemux taken from _src_event, got fragment EOS Secondary Thread: - infinite loop waiting for fragment download to finish (gst_adaptive_demux_stream_download_uri() I haven't been able to fix or workaround this from WebKit code, the only change the seemed to work was releasing the manifest lock in gst_adaptive_demux_expose_streams before calling gst_element_no_more_pads() and taking it again. But I have no idea if the change is correct or just a workaround, the gst code is full of locks and conditions that is very difficult to follow.
bts of relevant threads
+ Trace 236369
Thread 18 (Thread 0x7f07433fb700 (LWP 16767))
Thread 11 (Thread 0x7f073c9fd700 (LWP 16774))
Created attachment 330093 [details] [review] gstadaptivedemux: Release manifest lock before calling gst_element_no_more_pads()
Comment on attachment 330093 [details] [review] gstadaptivedemux: Release manifest lock before calling gst_element_no_more_pads() This needs to be carefully checked if we can just release the lock there, and if we might have to check again after taking the lock if things have changed. There is also the problem that subclasses might expect it to be not released when advancing fragments, etc.
(In reply to Sebastian Dröge (slomo) from comment #3) > Comment on attachment 330093 [details] [review] [review] > gstadaptivedemux: Release manifest lock before calling > gst_element_no_more_pads() > > This needs to be carefully checked if we can just release the lock there, > and if we might have to check again after taking the lock if things have > changed. There is also the problem that subclasses might expect it to be not > released when advancing fragments, etc. Yes, I really have no idea what I'm doing, TBH. So, I was just expecting you to help me to properly fix these deadlocks, because they are making popular sites like facebook or twitter unusable with WebKit.
It might be better or at least safer to defer the signal until a later point in one of the callers of the function. Also the same problem probably applies to pad-added. OTOH you can't really set the state of any element to NULL from no-more-pads. Is that what you're doing? I'm surprised that doesn't deadlock from other elements too :)
(In reply to Sebastian Dröge (slomo) from comment #5) > It might be better or at least safer to defer the signal until a later point > in one of the callers of the function. Also the same problem probably > applies to pad-added. What signal and function are you talking about exactly? > OTOH you can't really set the state of any element to NULL from > no-more-pads. Is that what you're doing? I'm surprised that doesn't deadlock > from other elements too :) No, we are setting the pipeline to NULL right before deleting it, no more pads happens in a different thread: Secondary Thread: - blocked on EXPOSE_LOCK (no_more_pads_cb() gstdecodebin2.c) - it seems to have GST_PAD_STREAM_LOCK of gstpad taken from gst_pad_send_event_unchecked() - has GST_MANIFEST_LOCK of gstadaptivedemux taken from _src_event, got fragment EOS See Thread 18 in the bt I submitted. Setting the pipeline to NULL is blocking on GST_PAD_STREAM_LOCK in post_activate() gstpad.c
Ok, so the problem is that there is: Thread 11: streaming thread of video decoder, pad probe in decodebin, having decodebin locks, linking things, causing reconfigure event, reconfigure event in adaptivedemux causes manifest lock Thread 18: streaming thread of one of the internal souphttpsrc, manifest lock, no-more-pads signal, decodebin no-more-pads handler, decodebin locks So in any case we need to find a way of moving the no-more-pads signal emission outside of the lock in a safe way.
(In reply to Sebastian Dröge (slomo) from comment #7) > Ok, so the problem is that there is: > > Thread 11: streaming thread of video decoder, pad probe in decodebin, having > decodebin locks, linking things, causing reconfigure event, reconfigure > event in adaptivedemux causes manifest lock > > Thread 18: streaming thread of one of the internal souphttpsrc, manifest > lock, no-more-pads signal, decodebin no-more-pads handler, decodebin locks Yes, and then main thread blocked on post_activate because the pad streams lock is never released. > > So in any case we need to find a way of moving the no-more-pads signal > emission outside of the lock in a safe way. My first attempt to fix this was moving the no more pads emission after the cancellation of the threads, but that didn't work and caused even more blocks. I want to help here, but I don't know gst enough to know how to properly fix this.
(In reply to Carlos Garcia Campos from comment #8) > (In reply to Sebastian Dröge (slomo) from comment #7) > > Ok, so the problem is that there is: > > > > Thread 11: streaming thread of video decoder, pad probe in decodebin, having > > decodebin locks, linking things, causing reconfigure event, reconfigure > > event in adaptivedemux causes manifest lock > > > > Thread 18: streaming thread of one of the internal souphttpsrc, manifest > > lock, no-more-pads signal, decodebin no-more-pads handler, decodebin locks > > Yes, and then main thread blocked on post_activate because the pad streams > lock is never released. Yes, that's expected if there is a deadlock in the streaming threads elsewhere. The main thread can be ignored here, fortunately the problem is only between two threads and not three :) > > > > So in any case we need to find a way of moving the no-more-pads signal > > emission outside of the lock in a safe way. > > My first attempt to fix this was moving the no more pads emission after the > cancellation of the threads, but that didn't work and caused even more > blocks. I want to help here, but I don't know gst enough to know how to > properly fix this. This will need some more thorough analysis of the locking in adaptivedemux unfortunately. Not sure when I'll have time to go through that... Maybe your patch is good, maybe not :) The problem with moving no-more-pads up to the callers is also that there are a few callers, and the caller is already holding the manifest lock.
This is probably fixed by https://cgit.freedesktop.org/gstreamer/gst-plugins-bad/commit/?id=03f96dd7319da1cb6e190f05cd37464404b09b1c Can you test and confirm? Please reopen if it's not fixed, thanks :)
(In reply to Sebastian Dröge (slomo) from comment #10) > This is probably fixed by > https://cgit.freedesktop.org/gstreamer/gst-plugins-bad/commit/ > ?id=03f96dd7319da1cb6e190f05cd37464404b09b1c > > Can you test and confirm? Please reopen if it's not fixed, thanks :) Cool, thanks! I'll do it.
(In reply to Carlos Garcia Campos from comment #11) > (In reply to Sebastian Dröge (slomo) from comment #10) > > This is probably fixed by > > https://cgit.freedesktop.org/gstreamer/gst-plugins-bad/commit/ > > ?id=03f96dd7319da1cb6e190f05cd37464404b09b1c > > > > Can you test and confirm? Please reopen if it's not fixed, thanks :) > > Cool, thanks! I'll do it. If it works, then it'd be great for WebKit if this one could be backported too.