GNOME Bugzilla – Bug 760584
adaptivedemux: tests: hls seek tests sometimes time out
Last modified: 2016-01-20 11:52:51 UTC
HLS seek tests (testSeekSnapAfterPosition, testReverseSeekSnapBeforePosition, testSeekPosition) are sometimes timing out. The reproducing rate is around once in 5-10 runs. From my investigation they block in AppSink preroll waiting for a switch to PLAYING event that never comes. When the AppSink changes state from PAUSED to PLAYING on first buffer received, it will trigger a pipeline state change from PAUSED to PLAYING to be done on a separate thread. This will race with the thread performing seek that will flush the AppSink and make it loose state and change from PLAYING to PAUSED. Changing pipeline to PLAYING means changing first AppSink element and then the adaptivedemux element. I identified 3 different scenarios for how AppSink behaves when it receives the first buffer after the seek: 1. AppSink has the pending state PLAYING when doing the preroll. I noticed that this happens when the pipeline change from PAUSED to PLAYING has finished. In this case gst_base_sink_do_preroll will change state to PLAYING and will not call gst_base_sink_wait_preroll. The test finishes successfully. 2. AppSink has the pending state PAUSED when doing the preroll. In this case it will wait for a switch to PLAYING. This event comes probably because the pipeline did not change yet the state for AppSink and is doing it now. The test finishes successfully. 3. Similar with 2, but the switch to PLAYING event does not come. The test will block in AppSink gst_base_sink_wait_preroll. It seems that pipeline changed state of AppSink to PLAYING but this was lost during the flush. Changing adaptivedemux state blocks during the seek. It will resume after the seek, so in this scenario pipeline change managed to change state on Appsink, was interrupted by the seek and finished changing state of adaptivedemux after the seek.
Got this today:
+ Trace 235903
Thread 6 (Thread 0x7f6a4a03c700 (LWP 18001))
Thread 5 (Thread 0x7f6a4963a700 (LWP 18002))
Thread 4 (Thread 0x7f6a48bde700 (LWP 18003))
Thread 3 (Thread 0x7f6a3bfff700 (LWP 18004))
Thread 2 (Thread 0x7f6a3a843700 (LWP 18006))
This is what happens: 1) The pipeline is doing its change from PAUSED to PLAYING in gstbin.c:2764:gst_bin_change_state_func 2) the appsink (src_0) is set from PAUSED to PLAYING successfully 3) Immediately after that appsink does a lost-state and posts an async-start message 4) The pipeline finishes its state change with all returns == SUCCESS and does this check: if (ret != GST_STATE_CHANGE_ASYNC && !bin->priv->pending_async_done) and finishes its state changes. VOID_PENDING 5) appsink posts its async-done and waits for its state to be set to playing but: GST_STATES gstbin.c:3326:bin_handle_async_done:<pipeline> nothing pending it is ignored and keeps stalled.
wonder if this is the same as I'm seeing in bug #760532.
(In reply to Tim-Philipp Müller from comment #3) > wonder if this is the same as I'm seeing in bug #760532. Looks very similar, yes
I would say it's exactly the same thing. We have two state changes happening at the same time in different threads. For the other bug, it seems to be fixed by reverting (as then sending the next seek event is blocked until the state change to PLAYING is finished). commit b427997119a2b6aacbeb550f729936f8b963e24b Author: Sebastian Dröge <sebastian@centricular.com> Date: Thu Dec 10 11:35:05 2015 +0200 element: Don't hold state lock all the time while sending an event This lock seems to exist only to prevent elements from changing states while events are being processed. However events are going to be processed nonetheless in those elements if sent directly via pads, so protection must already be implemented inside the elements for event handling if it is needed. As such having the lock here is not very useful and is actually causing various deadlocks in different situations as described in https://bugzilla.gnome.org/show_bug.cgi?id=744040 That only seems like a workaround though, specific to seek events. There seems to be a generic race condition somewhere with finishing asynchronous state changes and loss of state at the same time.
Also the lock there would only catch it when actually sending a SEEK event to an element, as compared to sending to pads. It all seems like a bad workaround :)
The reason for everything in bug #760532, including the gst-rtsp-server hangs. Let's track it there then. *** This bug has been marked as a duplicate of bug 760532 ***