GNOME Bugzilla – Bug 766064
basesink: deadlock waiting prerolling while pipeline goes to NULL state
Last modified: 2016-06-23 15:06:40 UTC
Hello, there is a deadlock when a pipeline goes to NULL state and BaseSink is waiting for prerolling. There are 2 threads: streaming_thread: managed by the task of the SrcPad of a BaseSrc (in our case a NiceSrc) app_thread: performs the state change of the pipeline to NULL (gst_element_set_state (pipeline, GST_STATE_NULL);) The deadlock takes places with this race condition: 1 - [app_thread] when the pipeline is going to NULL the BaseSink (in our case a FakeSink) is an state which it needs to do prerolling (see [gdb-fakesink_state.txt] where you can see the state) 2 - [streaming_thread] A buffer arrives to the BaseSink 2.1 - This causes preroll_wait [1] (see [gdb-wait_preroll.txt]) 2.2 - Here, the streaming_thread is blocked and so all upstream pads [2] 3- [app_thread] tries to change the state of one of the pads blocked in step (2) 3.1 - it is blocked in [3] why the mutex is locked by [streaming_thread] (ver [gdb-activate_mode_internal.txt]) Refs [1]https://github.com/Kurento/gstreamer/blob/0fb3a083ce04551fbdba7a94f0ac5612515bda67/libs/gst/base/gstbasesink.c#L2212 [2] https://github.com/Kurento/gstreamer/blob/0fb3a083ce04551fbdba7a94f0ac5612515bda67/gst/gstpad.c#L4125 [3] https://github.com/Kurento/gstreamer/blob/0fb3a083ce04551fbdba7a94f0ac5612515bda67/gst/gstpad.c#L1012
Created attachment 327396 [details] GDB info about the state of the BaseSink
Created attachment 327397 [details] GDB trace where the streaming_thread is blocked
Created attachment 327398 [details] GDB trace where the app_thread is blocked
Did you implement unlock() virtual method properly ?
Hello @stomer, I don't see that FakeSink implements the unlock() virtual method [1] Should FakeSink implement this method? If yes, in which way? Refs [1] https://github.com/Kurento/gstreamer/blob/0fb3a083ce04551fbdba7a94f0ac5612515bda67/plugins/elements/gstfakesink.c
Could you distill this into a little unit test (in tests/check/libs/basesink.c) that demonstrates the problem?
Fakesink only blocks within basesink. I cannot reproduce the issue with your instructions. Can you provide a test that reproduce this (like Tim also propose) ? When looking at your trace, your deadlock occures while deactivating a ghostpad, which indicate that you may be doing a complex pipeline manipulation, in which case, is it likely that your the issue come from something else, which may or may not be related to the base class. In general, it's better to report the bugs rather then trying to report an explanation for that bug, unless you have a patch and trust the solution. For backtraces, you should also provide a complete backtrace instead of only providing the parts that you think means something.
Created attachment 328634 [details] Test that reproduces the deadlock
Created attachment 328635 [details] GDB info threads of the deadlock test
Any idea about how we can deal with this?
The state change should've first shut down the fakesink, then deactivated the fakesink sink pad, and only then went on to deactivate the ghostpad. The first steps would've unblocked wait_preroll() (and returned FLUSHING) so that the stream lock is unlocked all the way upstream. It's expected that you will deadlock if you just shut down elements upstream while downstream their streaming threads are waiting for something. Something first has to unlock downstream, or you have to wait until downstream is finished waiting for whatever it is waiting for.
@slomo, I agree with your explanation and this is what does NOT happen when gst_element_set_state (pipeline, GST_STATE_NULL). Because of that the deadlock ;(. The test is a simple way of simulate the situation. If I am not wrong, the root of the problem is that when the pipeline is set to NULL: 1 - All elements of the pipeline are set to PAUSED 2 - All elements of the pipeline are set to READY 3 - All elements of the pipeline are set to NULL and in the (1) step the pipeline is deadlocked. So, what can we do to fix this issue? There is a way of checking that the BaseSink is going to NULL state to not wait for prerolling?
Your problem is that you don't set the pipeline to NULL state, but some element that is upstream of a sink while its streaming thread is busy inside the sink. This can't work and will block the state change until something is shutting down the sink (or flushing it). If you set the pipeline to NULL state it should work just fine, as the state changes are happening from sink to source. It all seems to be behaving like it should, no? If you want to shut down just the upstream part of a pipeline you need to ensure that none of the streaming threads of that part are blocked somewhere downstream.
(In reply to Sebastian Dröge (slomo) from comment #13) > Your problem is that you don't set the pipeline to NULL state, but some > element that is upstream of a sink while its streaming thread is busy inside > the sink. This can't work and will block the state change until something is > shutting down the sink (or flushing it). No, in the real case I am setting the entire pipeline to NULL (the test is an example that forces the situation in an easy way, in the GDB output attached you can see that the deadlock is the same in both cases). > If you set the pipeline to NULL state it should work just fine, as the state > changes are happening from sink to source. > Here is the problem, when all elements of the pipeline are set to PAUSE state. Take a look to this race condition: 1 - [app_thread] when the pipeline is going to NULL the BaseSink (in our case a FakeSink) is set to READY, so it needs to do prerolling. 2 - [streaming_thread] A buffer arrives to the BaseSink 2.1 - This causes preroll_wait [1]. 2.2 - Here, the streaming_thread is blocked and so all upstream pads 3- [app_thread] tries to change the state to PAUSE of the bin that contains the as deadlocked ProsyPad (in 2.2) 3.1 - it is blocked in [3] because the mutex is locked by [streaming_thread] Reviewing the GDB traces I have noticed that the transition to be set is PAUSED_TO_READY [1]. This should've been PLAYING_TO_PAUSE, shouldn't it? Can the problem be here? Refs [1]
+ Trace 236304
Can you provide a testcase that replicates the real situation?
Created attachment 329313 [details] Test that reproduces the deadlock (changing the whole pipeline state)
Created attachment 329314 [details] [review] bin: set SINK flag for bin parents when a sink element is added
Created attachment 329337 [details] [review] bin: set SINK flag for bin parents when a sink element is added Fix the bug
Patch #329337 fixes the test case #329313, but does not fix the bug entirely. What happens if we have more than one bin containing sink elements and their elements are connected directly to other elements in other bins using gst_pad_link_full (srcpad, sinkpad, GST_PAD_LINK_CHECK_CAPS)? (that is, the bins does not have sinkpads) In that case the algorithm to set the "degree" used for changing the state of the elements in the correct order does not work properly because of [1] and [2]. Refs [1] https://github.com/Kurento/gstreamer/blob/0fb3a083ce04551fbdba7a94f0ac5612515bda67/gst/gstbin.c#L2052 [2] https://github.com/Kurento/gstreamer/blob/0fb3a083ce04551fbdba7a94f0ac5612515bda67/gst/gstbin.c#L2078
Review of attachment 329337 [details] [review]: This will need a closer look, in any case thanks for the testcase :) I can't see anything obviously wrong in it, that is supposed to work ::: gst/gstbin.c @@ +1146,3 @@ + GST_DEBUG_OBJECT (parent, "Set as sink"); + GST_OBJECT_FLAG_SET (parent, GST_ELEMENT_FLAG_SINK); + parent = GST_OBJECT_PARENT (parent); This seems racy, both the setting of the flags and the getting of the parent elements
Oh actually the testcase *is* wrong. You're linking elements together that are not in the same bin, and this only works because you don't have GST_PAD_LINK_CHECK_HIERARCHY during linking. This is expected to fail in various interesting ways. You need to use ghostpads to link elements from different bins together.
Yes, I am linking elements in this way because I am using bins to add and remove a set of elements dynamically in an easy way, in other words, these bins have NO logic. This shouldn't be wrong if the API allow it, should it? So, from your explanation and my answer: - Should we fix the algorithm to set the "degree" of the elements taking into account these cases? - Should we mandate users to ALWAYS create ghostpads if they use bins? In that case: - Which is the easier way of doing this? - We have to take into account that this is less efficient (more and more useless pads :()
(In reply to Miguel París Díaz from comment #22) > Yes, I am linking elements in this way because I am using bins to add and > remove a set of elements dynamically in an easy way, in other words, these > bins have NO logic. This shouldn't be wrong if the API allow it, should it? The gst_pad_link_full() API allows to disable checks for performance reasons. When you know that what you're doing is correct and checks don't have to happen. In your case, what you're doing is not correct, so the result after linking is wrong. > So, from your explanation and my answer: > - Should we fix the algorithm to set the "degree" of the elements taking > into account these cases? No, the pipeline graph that is created is invalid :) > - Should we mandate users to ALWAYS create ghostpads if they use bins? In > that case: > - Which is the easier way of doing this? gst_element_link() and gst_element_link_pads() will automatically create ghostpads for you if needed. Or gst_pad_link_maybe_ghosting() > - We have to take into account that this is less efficient (more and > more useless pads :() That would have to be optimized instead then :) If ghostpads are too expensive, then something has to be improved there in general. Ghostpads are not doing anything at all other than proxying.
Created attachment 329377 [details] [review] utils: fix memory leak in find_common_root This patch fixed a memory leak that I have found using gst_element_link () instead of linking directly the pads of the elements into different bins.
I have tried to apply attachment 329377 [details] [review] in master branch, and I have noticed that it was already fixed by https://bugzilla.gnome.org/show_bug.cgi?id=765961 I should have tested it before, sorry for the noise :(.
Comment on attachment 329377 [details] [review] utils: fix memory leak in find_common_root No problem, shows that the patch was more than correct :)
Review of attachment 329337 [details] [review]: ::: gst/gstbin.c @@ +1146,3 @@ + GST_DEBUG_OBJECT (parent, "Set as sink"); + GST_OBJECT_FLAG_SET (parent, GST_ELEMENT_FLAG_SINK); + parent = GST_OBJECT_PARENT (parent); Why is it racy? How can I fix it?
The parent could go away while you're in the loop for example
Note that using an invalid pipeline like this is going to cause you more problems than what you noticed so far. Better build valid pipelines and use those in Kurento and if there are performance problems, fix those instead of creating invalid pipelines.
Created attachment 329386 [details] [review] bin: set SINK flag for bin parents when a sink element is added
Comment on attachment 329386 [details] [review] bin: set SINK flag for bin parents when a sink element is added You probably want to do the same for IS_SOURCE... and probably something is needed to *unset* the flag again. That is, if you have a bin A containing a bin B, and B contains a sink C. Then you want A, B and C to have the IS_SINK flag set. But when C is removed from B, both A and B shouldn't have it set anymore. I think with the current code, only B would get the flag removed, right?
It would also be good to add some tests for this flags handling :) And this should all be done in a different bug report, as the flags handling problems are valid problems while this bug is about a problem that happened because of a problem in your application/testcase
Created attachment 330265 [details] Simple program to force sticky event misordeing in tee src's peer pad
Created attachment 330266 [details] [review] pad: simple patch to force race conditions
I am so sorry, I have uploaded attachment 330265 [details] and attachment 330266 [details] [review] here, but I was wrong, could you remove them please?