Bug 760532 – bin, pipeline, sink: state change / async-done propagation issue with repeated rtspsrc seeks

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 760532 - bin, pipeline, sink: state change / async-done propagation issue with repeated rtspsrc seeks


Summary:	bin, pipeline, sink: state change / async-done propagation issue with repeate...


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	git master
Hardware:	Other Windows

Importance:	Normal critical
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Duplicates:	760584 766835 766884 (view as bug list)
Depends on:
Blocks:	744040

Reported:	2016-01-12 15:17 UTC by stevenhoving
Modified:	2018-11-03 12:32 UTC

See Also:	768522
GNOME target:	---
GNOME version:	---

Attachments
element: Take state lock in all the places where the comments say it should be done (4.17 KB, patch) 2016-01-22 16:07 UTC, Sebastian Dröge (slomo)	rejected	Details \| Review
Revert "Revert "element: Don't hold state lock all the time while sending an event"" (1.08 KB, patch) 2016-02-29 17:12 UTC, Sebastian Dröge (slomo)	none	Details \| Review
element: Add gst_element_call_async() (9.74 KB, patch) 2016-02-29 17:12 UTC, Sebastian Dröge (slomo)	committed	Details \| Review
element: Add gst_element_recursive_state_lock() and unlock() (3.63 KB, patch) 2016-02-29 17:12 UTC, Sebastian Dröge (slomo)	rejected	Details \| Review
WIP: element: Ensure to always take all state locks when changing states (29.89 KB, patch) 2016-02-29 17:12 UTC, Sebastian Dröge (slomo)	none	Details \| Review
WIP: element: Ensure to always take all state locks when changing states (27.42 KB, patch) 2016-03-02 17:07 UTC, Sebastian Dröge (slomo)	rejected	Details \| Review
WIP: element: Ensure to always take all state locks when changing states (24.07 KB, patch) 2016-03-07 13:12 UTC, Sebastian Dröge (slomo)	none	Details \| Review
WIP: element: Ensure to always take all state locks when changing states (24.68 KB, patch) 2016-03-09 14:14 UTC, Sebastian Dröge (slomo)	rejected	Details \| Review
rtspsrc: Don't propagate spurious state change returns from internal elements further (1.26 KB, patch) 2016-04-27 15:29 UTC, Sebastian Dröge (slomo)	none	Details \| Review
rtspsrc: Don't propagate spurious state change returns from internal elements further (1.39 KB, patch) 2016-04-27 15:32 UTC, Sebastian Dröge (slomo)	none	Details \| Review
rtpjitterbuffer: Ensure to not take caps with the wrong pt for getting the clock-rate (3.04 KB, patch) 2016-04-27 17:48 UTC, Sebastian Dröge (slomo)	none	Details \| Review
rtspsrc: Update caps for TCP whenever they change (3.56 KB, patch) 2016-04-27 17:48 UTC, Sebastian Dröge (slomo)	none	Details \| Review

Description stevenhoving 2016-01-12 15:17:34 UTC

I'm having a issue at the moment with continuously seeks with RTSP playback. The rtsp client is connected with TCP interleave and I am seeking continuously, lets say every second or so I do a new seek. Then sometimes after seeking 3-5 times the video playback stalls for 20 seconds or so, but sometimes I can seek for 30 times before it happens.

Server:
rtsp server serving a mkv file, the pipeline looks something like:
.---------.  .---------------.  .------------.
| filesrc |->| matroskademux |->| rtph264pay |
'---------'  '---------------'  '------------'

But I have also tried adding queues in between the elements.

The video file:
  resolution: 1280x720
  bitrate: 4096kb
  framerate: 30fps
  duration: 60 min
  gov length: 10

Client:
rtspsrc location=rtsp://127.0.0.1:8554/something protocols=0x4 ! queue ! rtph264depay! avdec_h264 ! autovideosink


Test:
I have tried to create a test application for this, but it seems to not always happen.

#include <gst/gst.h>

void do_seek(GstElement *pipeline, gint64 timestamp)
{
    unsigned int seek_flags = GST_SEEK_FLAG_FLUSH | GST_SEEK_FLAG_KEY_UNIT | GST_SEEK_FLAG_SNAP_NEAREST;
    gst_element_seek_simple(pipeline, GST_FORMAT_TIME, seek_flags, timestamp);
}

int main()
{
    GstElement *pipeline;
    GstBus *bus;
    GstMessage *msg;
    gint64 position = GST_SECOND * 5;

    // initialize GStreamer
    gst_init(NULL, NULL);

    pipeline = gst_parse_launch("rtspsrc name=rtspsrc ! queue ! rtph264depay ! fakesink", NULL);

    GstElement *rtspsrc = gst_bin_get_by_name(GST_BIN(pipeline), "rtspsrc");
    g_object_set(rtspsrc,
        "location", "rtsp://localhost:8554/something",
        "protocols", 4, // 4 = GST_RTSP_LOWER_TRANS_TCP
        NULL);
    g_object_unref(rtspsrc);

    gst_element_set_state(pipeline, GST_STATE_PLAYING);
    gst_element_get_state(pipeline, NULL, NULL, GST_CLOCK_TIME_NONE);

    // Wait until error, eos or async done
    bus = gst_element_get_bus(pipeline);

    unsigned int bus_flags = GST_MESSAGE_ERROR | GST_MESSAGE_EOS | GST_MESSAGE_ASYNC_DONE;
    msg = gst_bus_timed_pop_filtered(bus, GST_CLOCK_TIME_NONE, (GstMessageType)(bus_flags));

    unsigned int seek_count = 0;
    while(GST_MESSAGE_TYPE(msg) != GST_MESSAGE_EOS)
    {
        g_print("seek (%u) again... :D\n", seek_count++);
        position += GST_SECOND;
        do_seek(pipeline, position);

        gst_message_unref(msg);
        msg = gst_bus_timed_pop_filtered(bus, GST_CLOCK_TIME_NONE, (GstMessageType)(bus_flags));
    }

    // Free resources
    gst_message_unref(msg);
    gst_object_unref(bus);
    gst_element_set_state(pipeline, GST_STATE_NULL);
    gst_object_unref(pipeline);

    g_print("done");

    return 0;
}

Comment 1 Tim-Philipp Müller 2016-01-12 23:30:03 UTC

I see what you mean..

At first glance it looks to me like a problem on the client side, and more specifically something in GstBin or GstPipeline.

What happens is that the fakesink posts an async-done message but the pipeline drops it and doesn't forward it to the app or post its own async-done message, and then the pipeline and/or sink don't go back to playing it seems.

Comment 2 stevenhoving 2016-01-13 08:17:18 UTC

Interesting, but that would mean that I should not be able to reproduce it with VLC?

Update:
Just to have a throw at the above statement, so I tried to reproduce it with VLC, and after a bit of seeking I hit the same stall with VLC.

Comment 3 Tim-Philipp Müller 2016-01-13 13:45:19 UTC

There might be multiple issues then, so perhaps we should clone this bug for the rtsp-server side of things then.

Comment 4 Sebastian Dröge (slomo) 2016-01-19 17:31:59 UTC

What happens here on the client side is the following:

1) Based on async-done GstBin continues the state change PAUSED->PLAYING from a new thread *and* a new seek is done
2) The new seek is causing state loss in fakesink

So at this point we have the seek thread causing a state change to PAUSED, and the state continue thread from GstBin doing a state change to PLAYING. At the same time.

3) state continue thread finishes first, sets pending state to VOID
4) fakesink posts async-done after a while for the state change to PAUSED but void is pending and thus it's never posted


This seems the same as bug #760584

Comment 5 Sebastian Dröge (slomo) 2016-01-20 11:52:51 UTC

*** Bug 760584 has been marked as a duplicate of this bug. ***

Comment 6 Sebastian Dröge (slomo) 2016-01-20 11:53:21 UTC

Exactly the same, the gst-rtsp-server part was confirmed to be caused by b427997119a2b6aacbeb550f729936f8b963e24b in core too.

Looking into a solution now.

Comment 7 Sebastian Dröge (slomo) 2016-01-22 16:05:22 UTC

So the main problem here is that the async-start that happens because of the lost-state in fakesink is ignored: gstbin.c:3181:bin_handle_async_start:<pipeline0> state change busy

And it is ignored because GstBin currently continues to do the state change from PAUSED to PLAYING from the previous time when fakesink lost its state (the previous seek). To confuse more in the logs, based on the seek event rtspsrc also changes states of its rtpbin interleaved with everything else but I think that can be ignored for this bug. Now at some point the state change to PLAYING finishes, a bit later fakesink does async-done (so we actually have the pipeline go to PLAYING although fakesink is not prerolled!) but the async-done is ignored because we have no pending state anymore (we went to PLAYING successfully already and GstBin did not notice that fakesink lost the state in the meantime).

So inside handle_async_start() we should probably do something else than just ignoring when a new async-start arrives.


Additional to all that, there are many places in gstbin.c and gstelement.c that say they must be called with the GST_STATE_LOCK but they are not. Or that they are always called with it the GST_STATE_LOCK, but they are not. For which I have a patch but that doesn't help with this specific problem and it also seems like it could introduce deadlocks.

I think the locking for state changes has to be generally reviewed again. The biggest issue here is now that state changes can be triggered from the top (gst_element_set_state) and from the bottom (lost_state), which just calls from lock order problems if we use the GST_STATE_LOCK of each bin to protect things here.

Comment 8 Sebastian Dröge (slomo) 2016-01-22 16:07:00 UTC

Created attachment 319561 [details] [review]
element: Take state lock in all the places where the comments say it should be done

Comment 9 Sebastian Dröge (slomo) 2016-02-16 13:32:11 UTC

The patch causes a deadlock when shutting down server-VTS-H264-rtx.sh from gst-plugins-good rtp examples. So more stuff broken here :)

Comment 10 Sebastian Dröge (slomo) 2016-02-17 14:44:31 UTC

Reverted b427997119a2b6aacbeb550f729936f8b963e24b for now until we fixed this properly.

Comment 11 Sebastian Dröge (slomo) 2016-02-24 16:40:37 UTC

FWIW, this can also be reproduced with local files :)

Comment 12 Sebastian Dröge (slomo) 2016-02-29 17:12:39 UTC

Created attachment 322677 [details] [review]
Revert "Revert "element: Don't hold state lock all the time while sending an event""

This reverts commit a0b3a7f65878a844b1f983441749afdd3512ee54.

Comment 13 Sebastian Dröge (slomo) 2016-02-29 17:12:45 UTC

Created attachment 322678 [details] [review]
element: Add gst_element_call_async()

This calls a function from another thread, asynchronously. This is to be
used for cases when a state change has to be performed from a streaming
thread, directly via gst_element_set_state() or indirectly e.g. via SEEK
events.

Calling those functions directly from the streaming thread will cause
deadlocks in many situations, as they might involve waiting for the
streaming thread to shut down from this very streaming thread.

This is mostly a convenience function around a GThreadPool and is for example
used by GstBin to continue asynchronous state changes.

Comment 14 Sebastian Dröge (slomo) 2016-02-29 17:12:50 UTC

Created attachment 322679 [details] [review]
element: Add gst_element_recursive_state_lock() and unlock()

This recursively takes the state lock top to bottom from the top-level bin to
this element. This is useful to ensure that no other state change is happening
anywhere in the bin.

Comment 15 Sebastian Dröge (slomo) 2016-02-29 17:12:56 UTC

Created attachment 322680 [details] [review]
WIP: element: Ensure to always take all state locks when changing states

... and also call anything that might change the state from a streaming thread
from another thread.

TODO: All kinds of unit tests are currently failing because of broken
assumptions.

Comment 16 Sebastian Dröge (slomo) 2016-03-02 17:07:21 UTC

Created attachment 322887 [details] [review]
WIP: element: Ensure to always take all state locks when changing states

... and also call anything that might change the state from a streaming thread
from another thread.

TODO: All kinds of unit tests are currently failing because of broken
assumptions.

Comment 17 Sebastian Dröge (slomo) 2016-03-07 13:12:55 UTC

Created attachment 323269 [details] [review]
WIP: element: Ensure to always take all state locks when changing states

... and also ensure that nothing that changes the state of an outer bin
happens from the same thread that changes the state of an inner bin. That is,
prevent that the outer state lock is taken while the inner state lock is
already taken too. Otherwise we are going to run into lock order problems.

Also make sure to never take the state lock from the streaming thread, which
also causes deadlocks during shutdown now.

TODO: All kinds of unit tests are currently failing because of broken
assumptions.

Comment 18 Sebastian Dröge (slomo) 2016-03-07 13:15:32 UTC

Patches in comment 16 and comment 17 are alternatives. The one in comment 17 needs some further debugging, but fixes the problem that the other was running into deadlocks easily when changing state of elements from inside a bin from the streaming thread.

Comment 19 Sebastian Dröge (slomo) 2016-03-09 14:14:52 UTC

Created attachment 323514 [details] [review]
WIP: element: Ensure to always take all state locks when changing states

... and also ensure that nothing that changes the state of an outer bin
happens from the same thread that changes the state of an inner bin. That is,
prevent that the outer state lock is taken while the inner state lock is
already taken too. Otherwise we are going to run into lock order problems.

Also make sure to never take the state lock from the streaming thread, which
also causes deadlocks during shutdown now.

TODO: All kinds of unit tests are currently failing because of broken
assumptions.

Comment 20 Sebastian Dröge (slomo) 2016-04-27 15:29:17 UTC

Created attachment 326877 [details] [review]
rtspsrc: Don't propagate spurious state change returns from internal elements further

We handle them inside rtspsrc and override them in all other cases anyway, so
do the same for "internal" state changes like PAUSED->PAUSED and
PLAYING->PLAYING.

This keeps unexpected NO_PREROLL to confuse state changes in GstBin.

Comment 21 Sebastian Dröge (slomo) 2016-04-27 15:32:36 UTC

Created attachment 326878 [details] [review]
rtspsrc: Don't propagate spurious state change returns from internal elements further

We handle them inside rtspsrc and override them in all other cases anyway, so
do the same for "internal" state changes like PAUSED->PAUSED and
PLAYING->PLAYING.

This keeps unexpected NO_PREROLL to confuse state changes in GstBin.

Comment 22 Sebastian Dröge (slomo) 2016-04-27 15:54:02 UTC

Comment on attachment 322678 [details] [review]
element: Add gst_element_call_async()

Attachment 322678 [details] pushed as 8177173 - element: Add gst_element_call_async()

Comment 23 Sebastian Dröge (slomo) 2016-04-27 16:13:59 UTC

(In reply to Sebastian Dröge (slomo) from comment #21)
> Created attachment 326878 [details] [review] [review]
> rtspsrc: Don't propagate spurious state change returns from internal
> elements further
> 
> We handle them inside rtspsrc and override them in all other cases anyway, so
> do the same for "internal" state changes like PAUSED->PAUSED and
> PLAYING->PLAYING.
> 
> This keeps unexpected NO_PREROLL to confuse state changes in GstBin.

This patch alone fixes the testcase... half. There seems to be an unrelated problem with it freezeing every ~100 seeks for a few 10 seconds, just to continue properly again afterwards. Which is probably something wrong on the RTSP side.


Independent of that, there are at least two other problems with state changes. One of which I understand fully now, the other needs further investigation:

1) Pipelines with a NO_PREROLL and ASYNC element. The initial state change to PLAYING will succeed just fine, as the pending state of everything will be PLAYING. The bin will go to PLAYING immediately, the sink will follow soonish afterwards (commit_state there will go to PLAYING). So far so good.

Now once the sink loses the PLAYING state, it will get pending=PAUSED. If the bin forgot the NO_PREROLL (because PAUSED->PLAYING returned SUCCESS from the NO_PREROLL element), everything is fine. The bin will go to pending=PAUSED too due to the ASYNC_START message and once the sink has prerolled will move everything over to PLAYING again.

Question 1: is it intentional that a) the bin forgets about NO_PREROLL and b) the bin goes to PAUSED too if the sink loses state... but would go to PLAYING immediately without waiting for the sink during the initial state change to PLAYING? It seems inconsistent at best.

Now the actual problem is if the bin did not forget about NO_PREROLL because of timing related reasons (which the rtspsrc fix works around for this specific case). The bin will then ignore the ASYNC_START message, the sink will preroll at some point and send ASYNC_DONE which is also ignored ("nothing pending). And that's it, in commit_state in basesink pending=PAUSED, so we just go to PAUSED and wait there for someone else to move us to PLAYING. Usually the bin would be responsible for lifting the element from PAUSED to PLAYING again but the bin ignored ASYNC_DONE (and ASYNC_START to begin with) so won't do it.

Question 2: should the bin check on ASYNC_DONE if a) itself has no state change pending but b) the element that posted ASYNC_DONE is not in the target state (note: pending state is PAUSED here, target state is PLAYING) yet, and if so lift the element to PLAYING without affecting any other elements?


2) ASYNC_START being ignored if a state change is already happening. For ASYNC_DONE there is code for handling this case already: if polling==TRUE it will remember that there were ASYNC_DONE pending, and if so will handle the ASYNC_DONE before setting polling to FALSE again.

For ASYNC_START there are two cases where we ignore them. If polling==TRUE and if GST_STATE_PENDING() is not VOID_PENDING. The first case can be solved similar to ASYNC_DONE, the second case seems not solveable easily as GST_STATE_PENDING() is outside the control of the bin. E.g. gst_element_state_continue() (as called from gst_element_change_state()) will call this. But from my understanding, the GST_STATE_PENDING() check also does not make sense: if a state is pending, we either would not modify it or rightfully lower the pending state (from PLAYING to PAUSED) because a new async state change is necessary.
I think this is how these should be solved, but I'm struggling a bit with that.


And 3) is a potential problem I'm not 100% sure about yet. gst_bin_continue_func() and handle_async_start() seem to be racy. If ASYNC_START is received between scheduling gst_bin_continue_func() and actually doing gst_bin_continue_func(), I think we might miss an asynchronous state change altogether.

Comment 24 Sebastian Dröge (slomo) 2016-04-27 17:48:31 UTC

Created attachment 326882 [details] [review]
rtpjitterbuffer: Ensure to not take caps with the wrong pt for getting the clock-rate

Especially the caps on the pad might be out of date, and the new caps would be
provided for the current pt via the request-pt-map signal.

Comment 25 Sebastian Dröge (slomo) 2016-04-27 17:48:36 UTC

Created attachment 326883 [details] [review]
rtspsrc: Update caps for TCP whenever they change

We only changed them for UDP so far, which caused the wrong seqnum-base and
other information to be passed to rtpjitterbuffer/etc when seeking. This
usually wasn't that much of a problem as the code there is robust enough, but
every now and then it causes us to drop up to 32756 packets before we
continue doing anything meaningful.

Comment 26 Sebastian Dröge (slomo) 2016-04-27 17:50:02 UTC

As the RTSP problem seems largely unrelated to this specific bug, let's track it elsewhere. It's solved by these 3 patches though, the first one solving a problem in rtspsrc that would trigger the original state change bug here (which is still valid, just not for this test case).

Comment 27 Sebastian Dröge (slomo) 2016-04-27 17:53:32 UTC

See the current status of this specific bug here is still comment 23.

Comment 28 Sebastian Dröge (slomo) 2016-05-24 12:06:22 UTC

*** Bug 766835 has been marked as a duplicate of this bug. ***

Comment 29 Sebastian Dröge (slomo) 2016-05-25 20:47:53 UTC

*** Bug 766884 has been marked as a duplicate of this bug. ***

Comment 30 GStreamer system administrator 2018-11-03 12:32:15 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gstreamer/issues/150.