GNOME Bugzilla – Bug 690420
decodebin: Race between GstBin and decodebin trying to change states of child elements
Last modified: 2015-09-19 12:33:16 UTC
Created attachment 231808 [details] Sample program After several-thousend iterations with the attached program, I end up in a situation like in the backtrace. The file doesn't really matter, mostly it's on some random MP3 or FLAC. Happens a lot more often in Rygel (which is using async discovery, maybe because of that). Also doesn't matter if the discoverer is re-used or re-created. Freeing the information seems to change the timing enough to make the lock appear way less often.
+ Trace 231301
Thread 2 (Thread 0x7fa3d7744700 (LWP 1994))
Created attachment 231809 [details] Async version of sample code The async version seems to lock up "more reliably"
Also checked with 0.10 (from ubuntu 12.04) and 1.0.0 (from ppa) 0.10 happily runs for >500k iterations, while 1.0 managed ~500.
The pitivi people have also been running into these lock-ups (and they happen with 0.10 as well aiui). In any case, a stack trace:
+ Trace 231306
Thread 3 (Thread 0x7f17737fe700 (LWP 15716))
Thread 1 (Thread 0x7f177f55a7a0 (LWP 15691))
So I think this is a state-change / typefind issue. It looks like: THREAD 5: flacparse ! flacdec ! queue ! ... queue waiting for ALLOCATION query to be handled by downstream waiting on WAIT_DEL_CHECK GCond in gst_queue_handle_sink_query() cur_level = { buffers = 0, bytes = 0, time = 0}, max_size = {buffers = 1, bytes = 10485760, time = 1000000000}, THREAD 4: GstTask on typefind:sink pad has gone to PAUSED state waiting on GCond to resume THREAD 3: another GstTask on typefind:sink pad ?!? in "have-type" emission from typefind loop function in decodebin's connect_pad setting flacparse element to READY state stuck in GST_STATE_LOCK (flacparse) THREAD 2: GstBaseSink prerolled, stuck in wait_preroll waiting to go to PLAYING THREAD 1: app/discoverer thread -> ASYNC_DONE message on bus -> discoverer_collect -> "discovered" signal -> old discoverer destroyed, new one set up and start_async -> new uridecodebin set to PAUSED state (after releasing disco lock) -> which will activate typefind pads, which will start a new pad task to do the typefinding from, and from which the 'have-type' signal is being emitted from as well -> at the end stuck in gst_base_parse_sink_activate (flacparse) The big question for me is: how did thread 1 get to change state on flacparse? Typefinding is not done in the application thread any more in 1.0, so if I'm not mistaken what happens is basically that the pad task where the typefinding occurs emits have-type, and decodebin then adds the flacparse element to the bin, and proceeds to set it to READY/PAUSED state, BUT before it gets to that the state change mechanism in the application thread from where we originally started the typefinding picks up on the new element and tries to set its state too, which then gets stuck doing stuff to the typefind pad task holding the flacparse STATE lock, while the typefind pad task is stuck in decodebin waiting for flacparse's STATE_LOCK to be released. Any suggestions how to best fix this?
Nasty workaround: diff --git a/plugins/elements/gsttypefindelement.c b/plugins/elements/gsttypefindelement.c index a94439e..2bac590 100644 --- a/plugins/elements/gsttypefindelement.c +++ b/plugins/elements/gsttypefindelement.c @@ -889,6 +889,17 @@ gst_type_find_element_chain_do_typefinding (GstTypeFindElement * typefind, /* probability is good enough too, so let's make it known ... emiting this * signal calls our object handler which sets the caps. */ + { + GstObject *_parent; + + if ((_parent = gst_object_get_parent (GST_OBJECT (typefind)))) { + /* make sure our bin has finished any state change stuff before + * emitting have-type */ + GST_STATE_LOCK (_parent); + GST_STATE_UNLOCK (_parent); + gst_object_unref (_parent); + } + } g_signal_emit (typefind, gst_type_find_element_signals[HAVE_TYPE], 0, probability, caps); @@ -1048,6 +1059,17 @@ gst_type_find_element_loop (GstPad * pad) } GST_DEBUG ("Emiting found caps %" GST_PTR_FORMAT, found_caps); + { + GstObject *_parent; + + if ((_parent = gst_object_get_parent (GST_OBJECT (typefind)))) { + /* make sure our bin has finished any state change stuff before + * emitting have-type */ + GST_STATE_LOCK (_parent); + GST_STATE_UNLOCK (_parent); + gst_object_unref (_parent); + } + } g_signal_emit (typefind, gst_type_find_element_signals[HAVE_TYPE], 0, probability, found_caps); typefind->mode = MODE_NORMAL;
Thread 1 and 3 are important. Thread 5,4,2 are from a previous run. Thread 1 is doing the state change on the decodebin and has just managed to set the state on typefind. Thread 3 has just finished the typefind and is signaling the type-found signal. Decodebin then added the flacparse element. Thread 1 sees that a new element flacparse3 is added to the pipeline and restarts the state change, it now tries to set the state of the new flacparse to PAUSED. Thread 3 tries to set the flacparse to READY but blocks on the state lock because thread 1 was first. In thread 1: The flacparse tries to activate the peer pad in pull mode and for that it needs to deactivate the stream thread first, this deadlocks because thread 3 holds the stream lock and is blocked in the state-lock. It sounds like we need to prevent the bin from doing the resync and changing the state for us.
commit b163c091047b3766c40229b3d53b3b0beaf341ca Author: Wim Taymans <wim.taymans@collabora.co.uk> Date: Fri Dec 21 16:36:37 2012 +0100 bin: add flag to disable resync state change Add a GST_BIN_FLAG_NO_RESYNC that disables a resync when an element is added, removed or linked in the bin. This is interesting for complex bins that dynamically add elements to themselves and want to manage the state of those elements without interference from resyncs. See https://bugzilla.gnome.org/show_bug.cgi?id=690420 commit 0feecef2754ef208372eb39332b4f6fa2067d3d5 Author: Wim Taymans <wim.taymans@collabora.co.uk> Date: Fri Dec 21 16:38:36 2012 +0100 decodebin2: use NO_RESYNC flag to avoid the state change function from messing with the state of the elements that we add. Fixes https://bugzilla.gnome.org/show_bug.cgi?id=690420
*** Bug 681015 has been marked as a duplicate of this bug. ***
As Xabier found in bug #699124 this is introducing a fancy new bug caused by GstBin *not* resyncing anymore and handling the element states for us. Which is also part of bug #700342 and potentially others I had on my list since some days. What happens here is that we're still autoplugging in decodebin, but a state change back from PAUSED to READY happens. Thread 1 is currently inside decodebin's autoplugging code (which is called from the streaming threads of the elements) and currently adding e.g. a parser as a result of typefind's have-type signal. Thread 2 at this time is inside GstBin's loop to change the decodebin children from PAUSED to READY but does not know about the parser yet. It will block when shutting down typefind, because thread 1 currently holds the stream lock of typefind. Thread 1 adds the parser, sets it to READY and PAUSED (and starts the thread 3 of the parser that pulls data from upstream) and let's the autoplugging continue. Thread 2 unblocks at some point because thread 1 does not hold the stream lock anymore, and stops typefind. Thread 3 continues autoplugging. thread 2 returns from the GstBin state change function to the GstDecodeBin state change function, which now assumes that all data flow has stopped. And now interested deadlocks and crashes are happening everywhere because of broken assumptions everywhere. For example: Thread 3 holding the stream lock of the parser is in the decodebin autoplugging functions and wants to take decodebin's chain mutex. Thread 2 is freeing all chains and holding the chain mutex (thus thread 3 can't take it), and while holding it is setting the parsers state from PAUSED to READY, thus trying to get the stream lock of the parser. Deadlock. I'm not sure yet where all the crashes come from, there are also other bugs involved not caused by this here but e.g. by bug #700806 (sticky events getting dropped from pads before they're stored because of flushes and never resent) or bug #688824 (sticky events getting dropped from queues before they're stored on the srcpads of the queue and never resend... and serialized queries are never answered) or bug #700006 (audio/video codec base classes returning NULL from the allocate_buffer() function when flushing). Basically there seems to be a whole bunch of loosely interconnected bugs that now include this one here too :)
Maybe instead of this NO_RESYNC flag that was added, instead decodebin should just have used gst_element_set_locked_state(TRUE) on the parser as long as it was handling its state manually, and after it was done with that pass it back to GstBin again with gst_element_set_locked_state(FALSE). If I'm not missing anything this would have exactly the same behaviour of the NO_RESYNC flag for the original bug, but without causing the bug mentioned in comment 9: Thread 1 would add the parser, set it to READY and PAUSED, etc (with having the parser in locked state), while Thread 2 is waiting to be able to shut down typefind (which it only can after Thread 1 left the streaming thread of typefind). Then locked state of the parser is unset, thread 1 goes out of the streaming thread of typefind. Then thread 2 is shutting down typefind finally, continuing the iterator and seeing that it changed (the parser was added since it was waiting when changnig typefind's state!), resync, shut down typefind, shut down *all* data flow. And then return to GstDecodeBin's state change function without any dataflow happening anymore. I will experiment with that tomorrow.
Created attachment 245235 [details] [review] 0001-decodebin-Lock-the-state-of-child-elements-as-long-a.patch After reverting the commit to set the NO_RESYNC flag on decodebin (so it is *NOT* set), I can reproduce the bug in about 8000 iterations with the test program. With this patch here it is now already had 14000
It seems to not cause one deadlock of the other bugs anymore (the example I gave above), but all other bugs are still present (crashes, events disappearing, ...). At least an improvement. Jens, can you try if this works properly for you too after reverting the NO_RESYNC decodebin change?
commit fa7ad8743c47ec9256cf4f6e16c1bb8c57d91db7 Author: Sebastian Dröge <sebastian.droege@collabora.co.uk> Date: Fri May 24 13:41:46 2013 +0200 decodebin: Lock the state of child elements as long as we manage their states https://bugzilla.gnome.org/show_bug.cgi?id=690420 commit be4dba23ea1650c69a81e3b153496e51d388d35c Author: Sebastian Dröge <sebastian.droege@collabora.co.uk> Date: Fri May 24 11:47:13 2013 +0200 Revert "decodebin2: use NO_RESYNC flag" This reverts commit 0feecef2754ef208372eb39332b4f6fa2067d3d5.
Should be merged into 1.0 branch after some more testing!
Created attachment 311659 [details] [review] Fix another race condition In our application I run into this issue quite frequently. This patch seems to help, this is what I usually would run into: 1:11:49.314779015 18194 0xb20d9d40 WARN basesrc gstbasesrc.c:3460:gst_base_src_start_complete:<source> pad not activated yet 1:11:49.315069258 18194 0xb20d9d40 WARN GST_PADS gstpad.c:1058:gst_pad_set_active:<qtdemux456:sink> Failed to activate pad 1:11:49.315317966 18194 0xb20d9d40 WARN decodebin gstdecodebin2.c:2218:connect_pad:<decodebin294> Couldn't set qtdemux456 to PAUSED GLib (gthread-posix.c): Unexpected error from C library during 'pthread_mutex_lock': Invalid argument. Aborting. [New LWP 24512] Program received signal SIGABRT, Aborted. [Switching to LWP 24512] 0xffffe424 in __kernel_vsyscall () (gdb) thread apply all bt
+ Trace 235470
Thread 11 (LWP 24512)
Please open a new bug for this, it will get lost otherwise. Also related is probably bug #752651
(In reply to Sebastian Dröge (slomo) from comment #16) > Please open a new bug for this, it will get lost otherwise. Also related is > probably bug #752651 Done. Bug #755260