GNOME Bugzilla – Bug 760532
bin, pipeline, sink: state change / async-done propagation issue with repeated rtspsrc seeks
Last modified: 2018-11-03 12:32:15 UTC
I'm having a issue at the moment with continuously seeks with RTSP playback. The rtsp client is connected with TCP interleave and I am seeking continuously, lets say every second or so I do a new seek. Then sometimes after seeking 3-5 times the video playback stalls for 20 seconds or so, but sometimes I can seek for 30 times before it happens. Server: rtsp server serving a mkv file, the pipeline looks something like: .---------. .---------------. .------------. | filesrc |->| matroskademux |->| rtph264pay | '---------' '---------------' '------------' But I have also tried adding queues in between the elements. The video file: resolution: 1280x720 bitrate: 4096kb framerate: 30fps duration: 60 min gov length: 10 Client: rtspsrc location=rtsp://127.0.0.1:8554/something protocols=0x4 ! queue ! rtph264depay! avdec_h264 ! autovideosink Test: I have tried to create a test application for this, but it seems to not always happen. #include <gst/gst.h> void do_seek(GstElement *pipeline, gint64 timestamp) { unsigned int seek_flags = GST_SEEK_FLAG_FLUSH | GST_SEEK_FLAG_KEY_UNIT | GST_SEEK_FLAG_SNAP_NEAREST; gst_element_seek_simple(pipeline, GST_FORMAT_TIME, seek_flags, timestamp); } int main() { GstElement *pipeline; GstBus *bus; GstMessage *msg; gint64 position = GST_SECOND * 5; // initialize GStreamer gst_init(NULL, NULL); pipeline = gst_parse_launch("rtspsrc name=rtspsrc ! queue ! rtph264depay ! fakesink", NULL); GstElement *rtspsrc = gst_bin_get_by_name(GST_BIN(pipeline), "rtspsrc"); g_object_set(rtspsrc, "location", "rtsp://localhost:8554/something", "protocols", 4, // 4 = GST_RTSP_LOWER_TRANS_TCP NULL); g_object_unref(rtspsrc); gst_element_set_state(pipeline, GST_STATE_PLAYING); gst_element_get_state(pipeline, NULL, NULL, GST_CLOCK_TIME_NONE); // Wait until error, eos or async done bus = gst_element_get_bus(pipeline); unsigned int bus_flags = GST_MESSAGE_ERROR | GST_MESSAGE_EOS | GST_MESSAGE_ASYNC_DONE; msg = gst_bus_timed_pop_filtered(bus, GST_CLOCK_TIME_NONE, (GstMessageType)(bus_flags)); unsigned int seek_count = 0; while(GST_MESSAGE_TYPE(msg) != GST_MESSAGE_EOS) { g_print("seek (%u) again... :D\n", seek_count++); position += GST_SECOND; do_seek(pipeline, position); gst_message_unref(msg); msg = gst_bus_timed_pop_filtered(bus, GST_CLOCK_TIME_NONE, (GstMessageType)(bus_flags)); } // Free resources gst_message_unref(msg); gst_object_unref(bus); gst_element_set_state(pipeline, GST_STATE_NULL); gst_object_unref(pipeline); g_print("done"); return 0; }
I see what you mean.. At first glance it looks to me like a problem on the client side, and more specifically something in GstBin or GstPipeline. What happens is that the fakesink posts an async-done message but the pipeline drops it and doesn't forward it to the app or post its own async-done message, and then the pipeline and/or sink don't go back to playing it seems.
Interesting, but that would mean that I should not be able to reproduce it with VLC? Update: Just to have a throw at the above statement, so I tried to reproduce it with VLC, and after a bit of seeking I hit the same stall with VLC.
There might be multiple issues then, so perhaps we should clone this bug for the rtsp-server side of things then.
What happens here on the client side is the following: 1) Based on async-done GstBin continues the state change PAUSED->PLAYING from a new thread *and* a new seek is done 2) The new seek is causing state loss in fakesink So at this point we have the seek thread causing a state change to PAUSED, and the state continue thread from GstBin doing a state change to PLAYING. At the same time. 3) state continue thread finishes first, sets pending state to VOID 4) fakesink posts async-done after a while for the state change to PAUSED but void is pending and thus it's never posted This seems the same as bug #760584
*** Bug 760584 has been marked as a duplicate of this bug. ***
Exactly the same, the gst-rtsp-server part was confirmed to be caused by b427997119a2b6aacbeb550f729936f8b963e24b in core too. Looking into a solution now.
So the main problem here is that the async-start that happens because of the lost-state in fakesink is ignored: gstbin.c:3181:bin_handle_async_start:<pipeline0> state change busy And it is ignored because GstBin currently continues to do the state change from PAUSED to PLAYING from the previous time when fakesink lost its state (the previous seek). To confuse more in the logs, based on the seek event rtspsrc also changes states of its rtpbin interleaved with everything else but I think that can be ignored for this bug. Now at some point the state change to PLAYING finishes, a bit later fakesink does async-done (so we actually have the pipeline go to PLAYING although fakesink is not prerolled!) but the async-done is ignored because we have no pending state anymore (we went to PLAYING successfully already and GstBin did not notice that fakesink lost the state in the meantime). So inside handle_async_start() we should probably do something else than just ignoring when a new async-start arrives. Additional to all that, there are many places in gstbin.c and gstelement.c that say they must be called with the GST_STATE_LOCK but they are not. Or that they are always called with it the GST_STATE_LOCK, but they are not. For which I have a patch but that doesn't help with this specific problem and it also seems like it could introduce deadlocks. I think the locking for state changes has to be generally reviewed again. The biggest issue here is now that state changes can be triggered from the top (gst_element_set_state) and from the bottom (lost_state), which just calls from lock order problems if we use the GST_STATE_LOCK of each bin to protect things here.
Created attachment 319561 [details] [review] element: Take state lock in all the places where the comments say it should be done
The patch causes a deadlock when shutting down server-VTS-H264-rtx.sh from gst-plugins-good rtp examples. So more stuff broken here :)
Reverted b427997119a2b6aacbeb550f729936f8b963e24b for now until we fixed this properly.
FWIW, this can also be reproduced with local files :)
Created attachment 322677 [details] [review] Revert "Revert "element: Don't hold state lock all the time while sending an event"" This reverts commit a0b3a7f65878a844b1f983441749afdd3512ee54.
Created attachment 322678 [details] [review] element: Add gst_element_call_async() This calls a function from another thread, asynchronously. This is to be used for cases when a state change has to be performed from a streaming thread, directly via gst_element_set_state() or indirectly e.g. via SEEK events. Calling those functions directly from the streaming thread will cause deadlocks in many situations, as they might involve waiting for the streaming thread to shut down from this very streaming thread. This is mostly a convenience function around a GThreadPool and is for example used by GstBin to continue asynchronous state changes.
Created attachment 322679 [details] [review] element: Add gst_element_recursive_state_lock() and unlock() This recursively takes the state lock top to bottom from the top-level bin to this element. This is useful to ensure that no other state change is happening anywhere in the bin.
Created attachment 322680 [details] [review] WIP: element: Ensure to always take all state locks when changing states ... and also call anything that might change the state from a streaming thread from another thread. TODO: All kinds of unit tests are currently failing because of broken assumptions.
Created attachment 322887 [details] [review] WIP: element: Ensure to always take all state locks when changing states ... and also call anything that might change the state from a streaming thread from another thread. TODO: All kinds of unit tests are currently failing because of broken assumptions.
Created attachment 323269 [details] [review] WIP: element: Ensure to always take all state locks when changing states ... and also ensure that nothing that changes the state of an outer bin happens from the same thread that changes the state of an inner bin. That is, prevent that the outer state lock is taken while the inner state lock is already taken too. Otherwise we are going to run into lock order problems. Also make sure to never take the state lock from the streaming thread, which also causes deadlocks during shutdown now. TODO: All kinds of unit tests are currently failing because of broken assumptions.
Patches in comment 16 and comment 17 are alternatives. The one in comment 17 needs some further debugging, but fixes the problem that the other was running into deadlocks easily when changing state of elements from inside a bin from the streaming thread.
Created attachment 323514 [details] [review] WIP: element: Ensure to always take all state locks when changing states ... and also ensure that nothing that changes the state of an outer bin happens from the same thread that changes the state of an inner bin. That is, prevent that the outer state lock is taken while the inner state lock is already taken too. Otherwise we are going to run into lock order problems. Also make sure to never take the state lock from the streaming thread, which also causes deadlocks during shutdown now. TODO: All kinds of unit tests are currently failing because of broken assumptions.
Created attachment 326877 [details] [review] rtspsrc: Don't propagate spurious state change returns from internal elements further We handle them inside rtspsrc and override them in all other cases anyway, so do the same for "internal" state changes like PAUSED->PAUSED and PLAYING->PLAYING. This keeps unexpected NO_PREROLL to confuse state changes in GstBin.
Created attachment 326878 [details] [review] rtspsrc: Don't propagate spurious state change returns from internal elements further We handle them inside rtspsrc and override them in all other cases anyway, so do the same for "internal" state changes like PAUSED->PAUSED and PLAYING->PLAYING. This keeps unexpected NO_PREROLL to confuse state changes in GstBin.
Comment on attachment 322678 [details] [review] element: Add gst_element_call_async() Attachment 322678 [details] pushed as 8177173 - element: Add gst_element_call_async()
(In reply to Sebastian Dröge (slomo) from comment #21) > Created attachment 326878 [details] [review] [review] > rtspsrc: Don't propagate spurious state change returns from internal > elements further > > We handle them inside rtspsrc and override them in all other cases anyway, so > do the same for "internal" state changes like PAUSED->PAUSED and > PLAYING->PLAYING. > > This keeps unexpected NO_PREROLL to confuse state changes in GstBin. This patch alone fixes the testcase... half. There seems to be an unrelated problem with it freezeing every ~100 seeks for a few 10 seconds, just to continue properly again afterwards. Which is probably something wrong on the RTSP side. Independent of that, there are at least two other problems with state changes. One of which I understand fully now, the other needs further investigation: 1) Pipelines with a NO_PREROLL and ASYNC element. The initial state change to PLAYING will succeed just fine, as the pending state of everything will be PLAYING. The bin will go to PLAYING immediately, the sink will follow soonish afterwards (commit_state there will go to PLAYING). So far so good. Now once the sink loses the PLAYING state, it will get pending=PAUSED. If the bin forgot the NO_PREROLL (because PAUSED->PLAYING returned SUCCESS from the NO_PREROLL element), everything is fine. The bin will go to pending=PAUSED too due to the ASYNC_START message and once the sink has prerolled will move everything over to PLAYING again. Question 1: is it intentional that a) the bin forgets about NO_PREROLL and b) the bin goes to PAUSED too if the sink loses state... but would go to PLAYING immediately without waiting for the sink during the initial state change to PLAYING? It seems inconsistent at best. Now the actual problem is if the bin did not forget about NO_PREROLL because of timing related reasons (which the rtspsrc fix works around for this specific case). The bin will then ignore the ASYNC_START message, the sink will preroll at some point and send ASYNC_DONE which is also ignored ("nothing pending). And that's it, in commit_state in basesink pending=PAUSED, so we just go to PAUSED and wait there for someone else to move us to PLAYING. Usually the bin would be responsible for lifting the element from PAUSED to PLAYING again but the bin ignored ASYNC_DONE (and ASYNC_START to begin with) so won't do it. Question 2: should the bin check on ASYNC_DONE if a) itself has no state change pending but b) the element that posted ASYNC_DONE is not in the target state (note: pending state is PAUSED here, target state is PLAYING) yet, and if so lift the element to PLAYING without affecting any other elements? 2) ASYNC_START being ignored if a state change is already happening. For ASYNC_DONE there is code for handling this case already: if polling==TRUE it will remember that there were ASYNC_DONE pending, and if so will handle the ASYNC_DONE before setting polling to FALSE again. For ASYNC_START there are two cases where we ignore them. If polling==TRUE and if GST_STATE_PENDING() is not VOID_PENDING. The first case can be solved similar to ASYNC_DONE, the second case seems not solveable easily as GST_STATE_PENDING() is outside the control of the bin. E.g. gst_element_state_continue() (as called from gst_element_change_state()) will call this. But from my understanding, the GST_STATE_PENDING() check also does not make sense: if a state is pending, we either would not modify it or rightfully lower the pending state (from PLAYING to PAUSED) because a new async state change is necessary. I think this is how these should be solved, but I'm struggling a bit with that. And 3) is a potential problem I'm not 100% sure about yet. gst_bin_continue_func() and handle_async_start() seem to be racy. If ASYNC_START is received between scheduling gst_bin_continue_func() and actually doing gst_bin_continue_func(), I think we might miss an asynchronous state change altogether.
Created attachment 326882 [details] [review] rtpjitterbuffer: Ensure to not take caps with the wrong pt for getting the clock-rate Especially the caps on the pad might be out of date, and the new caps would be provided for the current pt via the request-pt-map signal.
Created attachment 326883 [details] [review] rtspsrc: Update caps for TCP whenever they change We only changed them for UDP so far, which caused the wrong seqnum-base and other information to be passed to rtpjitterbuffer/etc when seeking. This usually wasn't that much of a problem as the code there is robust enough, but every now and then it causes us to drop up to 32756 packets before we continue doing anything meaningful.
As the RTSP problem seems largely unrelated to this specific bug, let's track it elsewhere. It's solved by these 3 patches though, the first one solving a problem in rtspsrc that would trigger the original state change bug here (which is still valid, just not for this test case).
See the current status of this specific bug here is still comment 23.
*** Bug 766835 has been marked as a duplicate of this bug. ***
*** Bug 766884 has been marked as a duplicate of this bug. ***
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gstreamer/issues/150.