GNOME Bugzilla – Bug 643087
pulsesink: deadlock in gst_pulseringbuffer_open_device
Last modified: 2011-03-08 15:33:09 UTC
I'm using decodebin2 and creating audiosink when needed from the "new-decoded-pad" signal callback. It happens quite often for me that there's a deadlock when two pipelines do this simultaneously. One thread is waiting in pulsesink.c:455: g_mutex_lock (pa_shared_resource_mutex); The other thread is waiting in pulsesink.c:511: pa_threaded_mainloop_wait (mainloop). In other words this thread is holding pa_shared_resource_mutex, which is taken further up in that function. Here's the full backtrace of the two threads that have deadlocked using the git master of all repos:
+ Trace 226073
Thread 6 (Thread 0x7fffe6f5d700 (LWP 24602))
Thread 8 (Thread 0x7fffe5a8c700 (LWP 24604))
Don't get confused by the fact that there's an s less in my pa_shared_resource_mutex, I just cleaned that up when starting to debug and that's not what's causing this deadlock.
It looks very odd that even though thread 8 must have called pa_threaded_mainloop_lock (mainloop), thread 6 was able to also get the lock and instead got stuck in waiting for g_mutex_lock (pa_shared_resource_mutex); Perhaps there's a bug in PulseAudio. A speculative fix might be to reverse the order of taking the locks, hiding the bug. Will try that.
Just a guess but maybe the lock order is different elsewhere, which then causes this deadlock?
In gst_pulsesink_change_state pa_shared_resource_mutex is taken before pa_threaded_mainloop_new (obviously) and pa_threaded_mainloop_start, but I'm not seeing any other thread in the GStreamer pulsesink code at the time of deadlock, so I doubt that's the problem. There is this thread, though:
+ Trace 226076
Thread 12 (Thread 0x7fffde486700 (LWP 24749))
I see that when _close_device calls _destroy_context, the locks will be taken in the same order as in _open_device, so reverting that as was my idea without also changing that would create another deadlock instead.
Created attachment 181725 [details] [review] s/ressource/resource/
Created attachment 181726 [details] [review] experimental fix (introduces another deadlock, probably)
I installed libpulse0-dbg and have some further information here. First a full backtrace of all the threads (except Opera's main thread): (gdb) thread apply all bt
+ Trace 226090
Thread 8 (Thread 0x7fffe5938700 (LWP 8054))
Thread 6 (Thread 0x7fffe6f5d700 (LWP 8052))
Now, thread 6 and 8 are still the interesting ones. Thread 6 is waiting in pa_threaded_mainloop_wait (m=0x7fffe81a6aa0) at pulse/thread-mainloop.c:212, that is pa_cond_wait(m->cond, m->mutex); That m->mutex is being held by thread 8, how did that happen? It turns out it's not that mysterious, pa_threaded_mainloop_wait is waiting for a condition, which of course releases the lock! The documentation says so quite clearly: "While waiting the lock will be released, immediately before returning it will be acquired again." So, this is a pretty basic deadlock, albeit a bit racy. The fix here must be to untangle the use of pa_shared_resource_mutex and pa_threaded_mainloop_lock/unlock.
Created attachment 181813 [details] [review] naive patch Here's a patch that should fix the problem and is probably safe, but I have another one that is probably better but needs careful review.
Created attachment 181814 [details] [review] candidate fix This patch is the one I'm now using and with it I can't reproduce the deadlock.
Finally, there are some things I find odd about the code, here in the form of comments in a diff (after "s/ressource/resource/" and "candidate fix" have been applied): diff --git a/ext/pulse/pulsesink.c b/ext/pulse/pulsesink.c index c45acb9..ddce402 100644 --- a/ext/pulse/pulsesink.c +++ b/ext/pulse/pulsesink.c @@ -337,6 +337,7 @@ gst_pulseringbuffer_finalize (GObject * object) ringbuffer = GST_PULSERING_BUFFER_CAST (object); + /* why is this not protected by the mainloop lock? */ gst_pulsering_destroy_context (ringbuffer); G_OBJECT_CLASS (ring_parent_class)->finalize (object); } @@ -451,6 +452,8 @@ gst_pulseringbuffer_open_device (GstRingBuffer * buf) else pbuf->context_name = g_strdup (psink->client_name); + /* is it really necessary to take this lock here rather than just + after the g_mutex_unlock (pa_shared_resource_mutex) below? */ pa_threaded_mainloop_lock (mainloop); g_mutex_lock (pa_shared_resource_mutex);
(In reply to comment #12) > ringbuffer = GST_PULSERING_BUFFER_CAST (object); > > + /* why is this not protected by the mainloop lock? */ > gst_pulsering_destroy_context (ringbuffer); > G_OBJECT_CLASS (ring_parent_class)->finalize (object); Because the mainloop should not run anymore at this point, everything should be cleaned up already when going back to NULL state. > @@ -451,6 +452,8 @@ gst_pulseringbuffer_open_device (GstRingBuffer * buf) > else > pbuf->context_name = g_strdup (psink->client_name); > > + /* is it really necessary to take this lock here rather than just > + after the g_mutex_unlock (pa_shared_resource_mutex) below? */ > pa_threaded_mainloop_lock (mainloop); > > g_mutex_lock (pa_shared_resource_mutex); I think it is because it might create a new pulse context with that mainloop if none exists already.
(In reply to comment #13) > (In reply to comment #12) > > @@ -451,6 +452,8 @@ gst_pulseringbuffer_open_device (GstRingBuffer * buf) > > else > > pbuf->context_name = g_strdup (psink->client_name); > > > > + /* is it really necessary to take this lock here rather than just > > + after the g_mutex_unlock (pa_shared_resource_mutex) below? */ > > pa_threaded_mainloop_lock (mainloop); > > > > g_mutex_lock (pa_shared_resource_mutex); > > I think it is because it might create a new pulse context with that mainloop if > none exists already. The documentation for pa_context_new pa_threaded_mainloop_get_api doesn't say that a lock is needed, but I guess there's no harm.
Created attachment 181825 [details] [review] pulsesink: Avoid deadlock in open_device() Make sure that we have the resource lock before locking the mainloop to avoid a deadlock when there are two instances opening the device concurrently.
From what I can tell, the deadlock is caused by the following steps: 1. Thread 1 takes the mainloop lock followed by the resource lock in open_device(), sets up a context, and does a wait() 2. In the mean time, thread 2 enters open_device(), takes the mainloop lock (which is a recursive lock, and blocks any events from being dispatched), and then blocks on the resource lock 3. Since thread 2 is sleeping with the mainloop lock, the context state callback will never be issued and thus it cannot signal thread 1 to wake up Philip, could you try the patch I've attached to see if the deadlock goes away?
Arun, what you describe is correct. Your patch is almost identical to my <https://bugzilla.gnome.org/show_bug.cgi?id=643087#c7> on which I commented "introduces another deadlock, probably". The problem is that your code takes the lock in order "resource lock, mainloop lock", while there's other code that takes it in the reverse order, at least gst_pulseringbuffer_close_device and gst_pulseringbuffer_release+gst_pulsering_destroy_stream. In other words, that fix will introduce another deadlock which will happen if you're creating and destroying pulsesinks at the same time, which will eventually happen.
In the interest of figuring out how widespread this deadlock is, I found that the problem was probably introduced in <http://cgit.freedesktop.org/gstreamer/gst-plugins-good/commit/?id=ae8d210fdbad6bd79824bb57c7356f4e10e7c5b5>, a refactoring done by Stefan Kost. I tried CC'ing Stefan but don't appear to have the privileges to do that. Could someone add him, he might have some insight into what exactly the two locks are supposed to protect...
Phillip, I am reading bug email anyway. That commit was a fixup of previous patches made by your colleague Phil :) As you can see the patch does not change the lock order. Are you saying that everything works if you revert the patch?
Stefan, I haven't tried reverting to before your commit, it's just that before each pulsesink had its own pa_threaded_mainloop, so the situation where one thread locked the shared pa_threaded_mainloop while another was in pa_threaded_mainloop_wait just couldn't happen. Anyway, I just thought you might have a better understanding about what exactly should be protected by each lock, if you wanted to comment.
To clarify, the deadlock can only be reproduced when you have at least two pulsesink elements. The reason that I can easily reproduce it is because I have a HTML test case with two <video> elements using the same slow-loading data source. This creates two GStreamer pipelines which are fed with the same data at (almost) the same time, so the two pulsesink elements will be created at (almost) the same time.
(In reply to comment #21) > To clarify, the deadlock can only be reproduced when you have at least two > pulsesink elements. The reason that I can easily reproduce it is because I have > a HTML test case with two <video> elements using the same slow-loading data > source. This creates two GStreamer pipelines which are fed with the same data > at (almost) the same time, so the two pulsesink elements will be created at > (almost) the same time. Yes, I understand that. But the pa_threaded_mainloop was in the *class* structure before. That means all instances where already sharing the same instance. The chain of chainged that lead to the current setup started with the patch http://cgit.freedesktop.org/gstreamer/gst-plugins-good/commit/?id=69a397c32f4baf07a7b2937c610f9e8f383e9ae9 in bug #624338 I was fixing that with http://cgit.freedesktop.org/gstreamer/gst-plugins-good/commit/?id=f62dc6976b611384c98efb37d407b5299daf8c17 some discussion in bug #628996
(In reply to comment #22) > (In reply to comment #21) > > To clarify, the deadlock can only be reproduced when you have at least two > > pulsesink elements. The reason that I can easily reproduce it is because I have > > a HTML test case with two <video> elements using the same slow-loading data > > source. This creates two GStreamer pipelines which are fed with the same data > > at (almost) the same time, so the two pulsesink elements will be created at > > (almost) the same time. > > Yes, I understand that. But the pa_threaded_mainloop was in the *class* > structure before. That means all instances where already sharing the same > instance. Oh, I wasn't paying attention, I just saw psink->mainloop and assumed that there was one mainloop per instance before. > The chain of chainged that lead to the current setup started with the patch > http://cgit.freedesktop.org/gstreamer/gst-plugins-good/commit/?id=69a397c32f4baf07a7b2937c610f9e8f383e9ae9 > in bug #624338 > > I was fixing that with > http://cgit.freedesktop.org/gstreamer/gst-plugins-good/commit/?id=f62dc6976b611384c98efb37d407b5299daf8c17 > some discussion in bug #628996 OK, so it was probably introduced here, since release 0.10.26 then. Anyway, the problem seem mostly clear, I'm hoping for a code review of my "candidate fix" patch.
Comment on attachment 181825 [details] [review] pulsesink: Avoid deadlock in open_device() (patch is incorrect, as pointed out)
Comment on attachment 181813 [details] [review] naive patch Obsoleting "naive patch", I don't really think we should solve the problem like this.
After looking over the code again, the "candidate fix" looks good. Pushed both your patches - thanks.