Bug 734688 – queue: race when receiving flush-stop event during shutdown, task gets re-started

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 734688 - queue: race when receiving flush-stop event during shutdown, task gets re-started


Summary:	queue: race when receiving flush-stop event during shutdown, task gets re-sta...


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.4.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:	735357 735359

Reported:	2014-08-12 18:41 UTC by Linus Svensson
Modified:	2014-08-25 07:52 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
backtrace (11.78 KB, application/octet-stream) 2014-08-12 18:41 UTC, Linus Svensson		Details
prints of GstTask and GstQueue (2.72 KB, text/plain) 2014-08-12 18:51 UTC, Linus Svensson		Details
rtsp-stream: Remove the multicast group udp sources when removing from the bin (1.94 KB, patch) 2014-08-13 14:23 UTC, Sebastian Dröge (slomo)	committed	Details \| Review
backtrace (16.74 KB, text/plain) 2014-08-15 14:14 UTC, Linus Svensson		Details
Testcase that triggers a deadlock when changing state of a gstqueue (2.14 KB, patch) 2014-08-20 11:13 UTC, Linus Svensson	committed	Details \| Review
queue: fix race when flush-stop event comes in whilst shutting down (1.93 KB, patch) 2014-08-21 13:10 UTC, Tim-Philipp Müller	committed	Details \| Review
pad: don't accept flush-stop on inactive pads (possibly a good idea, but doesn't help with this bug) (3.05 KB, patch) 2014-08-21 13:53 UTC, Wim Taymans	none	Details \| Review

Description Linus Svensson 2014-08-12 18:41:21 UTC

Created attachment 283230 [details]
backtrace

I have seen a deadlock when a queue element is about to get finalized. For some reason, the task is still running.

Attached a backtrace with relevant threads.

Comment 1 Linus Svensson 2014-08-12 18:51:49 UTC

Created attachment 283231 [details]
prints of GstTask and GstQueue

Comment 2 Sebastian Dröge (slomo) 2014-08-13 14:13:54 UTC

Can you provide a testcase for reproducing this problem? This looks like the bin is removed from the server while it is still in a state > READY and doing data flow... without anything ever having it shut down.

Comment 3 Sebastian Dröge (slomo) 2014-08-13 14:23:24 UTC

Created attachment 283285 [details] [review]
rtsp-stream: Remove the multicast group udp sources when removing from the bin

Comment 4 Sebastian Dröge (slomo) 2014-08-13 14:23:42 UTC

Does this fix it by any chance?

Comment 5 Linus Svensson 2014-08-13 15:22:12 UTC

I'm working to find a way reproducing this faster with a test case, but no luck so far.

The pipeline is in state NULL and so are the queue. The udp sources are still in PLAYING state, but I don't think that's the problem here.

Comment 6 Sebastian Dröge (slomo) 2014-08-13 15:34:28 UTC

Does my patch help? And which is the exact line in rtsp-stream.c where it deadlocks? Your line numbers are different than mine :)

Comment 7 Linus Svensson 2014-08-13 15:36:15 UTC

Looks to me that the task_func is called for the queue element, when it
shouldn't. There is a blocking probe installed on queue's sink pad looking
like:

gst_pad_add_probe (pad,
    GST_PAD_PROBE_TYPE_BLOCK | GST_PAD_PROBE_TYPE_BUFFER |
    GST_PAD_PROBE_TYPE_BUFFER_LIST, pad_drop, NULL, NULL);

I have no idea if this makes any difference

Comment 8 Linus Svensson 2014-08-13 15:45:00 UTC

This does not reproduce that easy. I'll have to wait at least one day to know if your patch helps slomo.

The deadlock occurs here in gst_rtsp_stream_leave_bin:

1843	      gst_bin_remove (bin, priv->appqueue[i]);
(gdb) list
1842	    if (priv->appqueue[i])
1843	      gst_bin_remove (bin, priv->appqueue[i]);
1844	    if (priv->tee[i])
1845	      gst_bin_remove (bin, priv->tee[i]);

In the backtrace, I guess that thread 8 (rtsp thread pool) and thread 10 (task thread for queue) are of interest.

In thread 8, one can see that this wait's for a task to finish (namely the streaming thread in queue's task). But that task is executing the loop function in gstqueue.

Comment 9 Linus Svensson 2014-08-15 14:14:22 UTC

Created attachment 283502 [details]
backtrace

Comment 10 Linus Svensson 2014-08-15 14:32:49 UTC

New backtrace supplied. This is another case, but the same element is involved as the first backtrace I supplied. It's not very easy to reproduce this, but I manage to do it more and more frequently.

A couple of interesting observations is that it's always the same element (the queue after the tee in the tcp part of an rtsp-server pipeline, corresponding to RTP). In my test setup I'm streaming both video and audio, but I can not say that this always happen for the video part of the pipeline. The sync property on the sink is false in this particular pipeline.

Comment 11 Linus Svensson 2014-08-15 14:37:12 UTC

In the last supplied backtrace one can see two tasks that (at least in my oponion) looks to be the streaming thread of the same element. Any idea on how this could happen is very welcome.

Comment 12 Linus Svensson 2014-08-19 13:01:08 UTC

I'm getting different kind of deadlocks when reproducing this problem. As mentioned before it's always the same element (queue between tee and appsink. this is a pipeline in gst-rtsp-server when using tcp as transport) that causes problems.

There always seem to be some problem with the task on the src pad of the queue. In one case I actually had two tasks, both of them looked like the task on the src pad. One task in state STOPPED and one task in state STARTED. The src pads task pointer pointed to the STARTED task. It looked like the STOPPED task was stopped from the rtsp server's client thread (currently running with only one thread serving clients, even if it's possible to use more). The STOPPED task was leaving state PAUSED and got stuck when it called g_rec_mutex_lock (lock) on it's way out of the while loop in gst_task_func. This lock was held by the STARTED task. The STARTED task waited for buffers in gst_queue_loop. I guess that the bug is that the second task even got started in this case.

I got the output from GST_DEBUG=queue:6,queue_dataflow:6,task:6,rtspclient:6 today. When TEARDOWN is beging served, the pipeline is set to GST_STATE_NULL, but a deadlock occurs when finalizing the src pad of the queue (the queue is in state NULL). The task on the pad is set to state STOPPED, but the task's thread is executing gst_queue_loop (in NULL state). It looks like the streaming task is stopped when doing the state transition PAUSED->READY, but a new task is started. From the debug log I can see that gst_pad_start_task() when we received a FLASH_STOP event.

I suspect that there might be a problem with this flush stop event. I don't know why I see this event and where it comes from.

Comment 13 Linus Svensson 2014-08-19 17:08:11 UTC

I have located the problem that happened.

We serve an RTSP TEARDOWN request which makes the pipeline to change state to NULL. At the same time the src element sends FLUSH_START and FLUSH_STOP. If the pipeline is doing the PAUSED->READY transition and the FLUSH_STOP arrives at the queue, the queue will create a new task that is never ended.

This src element is a bin element (filesrc ! matroskademux ! multiqueue) that plays a sequence of matroska files and demuxes them to one stream. It will send FLUSH_START+FLUSH_STOP if it's in state PAUSED when it changes file to avoid a deadlock when multiqueue is full. When changing matroska file, the filesrc and matroskademux will temporarily be lowered to state READY in order to change the location property of the file src.

Comment 14 Linus Svensson 2014-08-20 11:13:25 UTC

Created attachment 283963 [details] [review]
Testcase that triggers a deadlock when changing state of a gstqueue

I was able to reproduce this deadlock every time when running this

GST_CHECKS=test_state_change_when_flushing make -C tests/check elements/queue.check

Comment 15 Tim-Philipp Müller 2014-08-21 13:10:14 UTC

Created attachment 284082 [details] [review]
queue: fix race when flush-stop event comes in whilst shutting down

How about this?

    Don't re-start the queue push task on the source pad when a
    flush-stop event comes in and we're in the process of shutting
    down, otherwise that task will never be stopped again.
    
    When the element is set to READY state, the pads get de-activated.
    The source pad gets deactivated before the queue's own activate_mode
    function on the source pads gets called (which will stop the thread),
    so checking whether the pad is active before re-starting the task on
    receving flush-stop should be fine. The problem would happen when the
    flush-stop handler was called just after the queue's activate mode
    function had stopped the task.

Comment 16 Wim Taymans 2014-08-21 13:18:27 UTC

Maybe we should ignore those flush events on in-active pads. Other events should already be ignored when the pad is flushing.

Comment 17 Tim-Philipp Müller 2014-08-21 13:27:45 UTC

I was wondering whether we should just make gst_pad_start_task() do nothing if the pad is inactive.

I'm not sure what you suggest would help in this case, since the sink pad might still be active.

Comment 18 Wim Taymans 2014-08-21 13:53:54 UTC

Created attachment 284089 [details] [review]
pad: don't accept flush-stop on inactive pads (possibly a good idea, but doesn't help with this bug)

Comment 19 Wim Taymans 2014-08-21 13:55:01 UTC

(In reply to comment #17)
> I was wondering whether we should just make gst_pad_start_task() do nothing if
> the pad is inactive.

That sounds like a good idea as well.

> 
> I'm not sure what you suggest would help in this case, since the sink pad might
> still be active.

Ah, indeed.

Comment 20 Wim Taymans 2014-08-21 14:00:40 UTC

Well, my patch makes sense but it doesn't solve this queue issue, you somewhere need to explicitly check the active status of the queue srcpad.

Comment 21 Tim-Philipp Müller 2014-08-23 11:59:10 UTC

Nice work debugging this, and thanks for the unit test!

Pushed these to master now, will cherry-pick the fix into 1.4 as well:

commit ae74a1a83a8a5034c37da283d6176c3abab4b3cc
Author: Linus Svensson <linussn@axis.com>
Date:   Wed Aug 20 12:55:51 2014 +0200

    tests: add test that triggers deadlock in state change of queue
    
    When receiving FLASH_STOP in a state transition to READY, a queue
    element can end up with an active task that will never end.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=734688

commit 93d8679b00c1402aaf44f37bb67487d1a3d0ffc4
Author: Tim-Philipp Müller <tim@centricular.com>
Date:   Thu Aug 21 14:02:16 2014 +0100

    queue: fix race when flush-stop event comes in whilst shutting down
    
    Don't re-start the queue push task on the source pad when a
    flush-stop event comes in and we're in the process of shutting
    down, otherwise that task will never be stopped again.
    
    When the element is set to READY state, the pads get de-activated.
    The source pad gets deactivated before the queue's own activate_mode
    function on the source pads gets called (which will stop the thread),
    so checking whether the pad is active before re-starting the task on
    receiving flush-stop should be fine. The problem would happen when the
    flush-stop handler was called just after the queue's activate mode
    function had stopped the task.
    
    Spotted and debugged by Linus Svensson <linux.svensson@axis.com>
    
    https://bugzilla.gnome.org/show_bug.cgi?id=734688


Wim's patch is independent. We might still want to refuse to start a task on a deactivated pad, but should probably look at the code of the various users a bit before that.

Comment 22 Sebastian Dröge (slomo) 2014-08-25 07:38:41 UTC

(In reply to comment #20)
> Well, my patch makes sense but it doesn't solve this queue issue, you somewhere
> need to explicitly check the active status of the queue srcpad.

Should we create another bug for this or how should we handle this patch?