GNOME Bugzilla – Bug 763711
splitmuxsink: deadlock when one streams doesn't have regular buffers
Last modified: 2016-08-07 15:16:05 UTC
I am recording audio+video+subtitles into small matroska files. For doing that I have a splitmuxsink and matroskamux. - splitmuxsink calculates the size of buffers entering its multiqueue, using a blocking pad prob on each input pads. Once the threshold is reached the state becomes SPLITMUX_STATE_ENDING_FILE and all input threads wait for EOS to reach the bus. It sends EOS event downstream as soon as downstream unblocks to take the next buffer from the queue. - matroskamux has a GstCollectPads which won't deliver any buffer to the matroska muxer until there is at least one buffer queued on every pads. This results in a deadlock if the file size threshold is reached before the first subtitle buffer arrives. In that case the GstCollectPads is blocked waiting for all its pads to have queued packets, but the splutmuxsink blocked its input because threshold has been reached so even if a subtitle arrives later it won't reach the collect pads. I can reproduce that deadlock in my application but it does lots of more complicated stuff. I could work on writing a more minimal test case if needed. I have no idea what would be the proper fix. Note that, even without splitmuxsink, if I understand correctly how GstCollectPads works, it means that muxing a matroska with few subtitles could result in huge memory usage, since it's going to queue video buffers until it gets a subtitle... Imagine a movie that has a long periods with nobody speaking...
Created attachment 324068 [details] minimal test case app showing the deadlock Backtrace when the testcase deadlocked:
+ Trace 236085
Thread 12 (Thread 0x7f0431d23700 (LWP 877))
Thread 11 (Thread 0x7f0431522700 (LWP 878))
Thread 10 (Thread 0x7f0430d21700 (LWP 879))
Thread 9 (Thread 0x7f0423fff700 (LWP 880))
(In reply to Xavier Claessens from comment #0) > > Note that, even without splitmuxsink, if I understand correctly how > GstCollectPads works, it means that muxing a matroska with few subtitles > could result in huge memory usage, since it's going to queue video buffers > until it gets a subtitle... Imagine a movie that has a long periods with > nobody speaking... Upstream elements are supposed to send GAP events on subtitle / sparse streams so that downstream elements get timely position updates even when there are large gaps in the data and avoid that problem.
I think the fix for this will be around handling the drained signal for the multiqueue when waiting for the fragment to drain out and using that to trigger completion as well.
(In reply to Jan Schmidt from comment #2) > (In reply to Xavier Claessens from comment #0) > > > > Note that, even without splitmuxsink, if I understand correctly how > > GstCollectPads works, it means that muxing a matroska with few subtitles > > could result in huge memory usage, since it's going to queue video buffers > > until it gets a subtitle... Imagine a movie that has a long periods with > > nobody speaking... > > Upstream elements are supposed to send GAP events on subtitle / sparse > streams so that downstream elements get timely position updates even when > there are large gaps in the data and avoid that problem. Ok, fair enough, forget about that part then. (In reply to Jan Schmidt from comment #3) > I think the fix for this will be around handling the drained signal for the > multiqueue when waiting for the fragment to drain out and using that to > trigger completion as well. My understanding is that we should send EOS event in all empty queues when going to SPLITMUX_STATE_ENDING_FILE state. Not sure there is a drain signal that would be emitted if the queue is already empty when we block in handle_mq_input.
I found a minimal pipeline to reproduce the problem. test.srt: (I think any .srt file will reproduce the issue) 1 00:00:00,000 --> 00:01:00,000 Text Working Pipeline: (Produces one 1.1MB file at 2.0MB max size) gst-launch-1.0 filesrc location=test.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0 \ videotestsrc num-buffers=10 ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux \ max-size-bytes=2000000 Locking pipeline: (Tries to produce 1.0MB max size) gst-launch-1.0 filesrc location=test.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0 \ videotestsrc num-buffers=10 ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux \ max-size-bytes=1000000 Backtrace when deadlocked:
+ Trace 236087
Thread 5 (Thread 0x7fffe7fff700 (LWP 4414))
Thread 4 (Thread 0x7fffecb86700 (LWP 4413))
Thread 3 (Thread 0x7fffed387700 (LWP 4412))
Thread 2 (Thread 0x7fffedb88700 (LWP 4411))
Comment on attachment 324068 [details] minimal test case app showing the deadlock Marking as obsolete since lubosz posted an easier way to reproduce the deadlock.
Created attachment 324300 [details] [review] splitmuxsink: Fix deadlock happening when one stream receives few buffers
Review of attachment 324300 [details] [review]: Please put some explanation in the commit message. It's always easier to find things in the commit history with better commit messages. ::: gst/multifile/gstsplitmuxsink.c @@ +1141,3 @@ + splitmux->max_in_running_time); + event = gst_event_new_gap (pos, GST_CLOCK_TIME_NONE); + gst_pad_send_event (ctx->sinkpad, event); How well did you test this, and with which mux formats? I'm not entirely sure it won't introduce actual gaps in the playback of the resulting pieces with at least some formats @@ +1273,2 @@ (GstPadProbeCallback) handle_mq_input, ctx, (GDestroyNotify) _pad_block_destroy_sink_notify); Why did you move the pad probe from the multiqueue sink pad to the ghostpad?
Created attachment 324383 [details] [review] splitmuxsink: Fix occasional deadlock when ending file with subtitle Deadlock occurs when splitting files if one stream received no buffer during the first GOP of the next file. That can happen in that scenario for example: 1) The first GOP of video is collected, it has a duration of 10s. max_in_running_time is set to 10s. 2) Other streams catchup and we receive the first subtitle buffer at ts=0 and has a duration of 1min. 3) We receive the 2nd subtitle buffer with a ts=1min. in_running_time is set to 1min. That buffer is blocked in handle_mq_input() because max_in_running_time is still 10s. 4) Since all in_running_time are now > 10s, max_out_running_time is now set to 10s. That first GOP gets recorded into the file. The muxer pop buffers out of the mq, when it tries to pop a 2nd subtitle buffer it blocks because the GstDataQueue is empty. 5) A 2nd GOP of video is collected and has a duration of 10s as well. max_in_running_time is now 20s. Since subtitle's in_running_time is already 1min, that GOP is already complete. 6) But let's say we overran the max file size, we thus set state to SPLITMUX_STATE_ENDING_FILE now. As soon as a buffer with ts > 10s (end of previous GOP) arrives in handle_mq_output(), EOS event is sent downstream instead. But since the subtitle queue is empty, that's never going to happen. Pipeline is now deadlocked. To fix this situation we have to: - Send a dummy event through the queue to wakeup output thread. - Update out_running_time to at least max_out_running_time so it sends EOS. - Respect time order, so we set out_running_tim=max_in_running_time because that's bigger than previous buffer and smaller than next.
(In reply to Jan Schmidt from comment #8) > Review of attachment 324300 [details] [review] [review]: > > Please put some explanation in the commit message. It's always easier to > find things in the commit history with better commit messages. Right, done. > ::: gst/multifile/gstsplitmuxsink.c > @@ +1141,3 @@ > + splitmux->max_in_running_time); > + event = gst_event_new_gap (pos, GST_CLOCK_TIME_NONE); > + gst_pad_send_event (ctx->sinkpad, event); > > How well did you test this, and with which mux formats? I'm not entirely > sure it won't introduce actual gaps in the playback of the resulting pieces > with at least some formats I tested with the command from comment #5. It reliably deadlock without the patch, and works fine with it. I didn't notice problems with GAPs recorded, but you're right that we should drop those dummy events. Updated the patch doing that. > @@ +1273,2 @@ > (GstPadProbeCallback) handle_mq_input, ctx, (GDestroyNotify) > _pad_block_destroy_sink_notify); > > Why did you move the pad probe from the multiqueue sink pad to the ghostpad? Added a comment explaining why. I have to send an event into the mq without triggering handle_mq_input().
(In reply to Xavier Claessens from comment #10) > (In reply to Jan Schmidt from comment #8) > > Review of attachment 324300 [details] [review] [review] [review]: > > > > Please put some explanation in the commit message. It's always easier to > > find things in the commit history with better commit messages. > > Right, done. > > > ::: gst/multifile/gstsplitmuxsink.c > > @@ +1141,3 @@ > > + splitmux->max_in_running_time); > > + event = gst_event_new_gap (pos, GST_CLOCK_TIME_NONE); > > + gst_pad_send_event (ctx->sinkpad, event); > > > > How well did you test this, and with which mux formats? I'm not entirely > > sure it won't introduce actual gaps in the playback of the resulting pieces > > with at least some formats > > I tested with the command from comment #5. It reliably deadlock without the > patch, and works fine with it. I didn't notice problems with GAPs recorded, > but you're right that we should drop those dummy events. Updated the patch > doing that. > > > @@ +1273,2 @@ > > (GstPadProbeCallback) handle_mq_input, ctx, (GDestroyNotify) > > _pad_block_destroy_sink_notify); > > > > Why did you move the pad probe from the multiqueue sink pad to the ghostpad? > > Added a comment explaining why. I have to send an event into the mq without > triggering handle_mq_input(). I don't think that's needed - handle_mq_input() will just pass the custom event anyway (and works here if I just revert those 2 lines). Moving the probe to the externally visible pad can cause problems if the app or other plugins install their own probes and interfere with the internal one. There's a different stall I found which we should also address if pads don't all EOS in the same fragment - if pads are already EOS when the new fragment starts, the EOS isn't repeated into the muxer things never finish up.
Created attachment 324412 [details] [review] splitmuxsink: Fix occasional deadlock when ending file with subtitle Deadlock occurs when splitting files if one stream received no buffer during the first GOP of the next file. That can happen in that scenario for example: 1) The first GOP of video is collected, it has a duration of 10s. max_in_running_time is set to 10s. 2) Other streams catchup and we receive the first subtitle buffer at ts=0 and has a duration of 1min. 3) We receive the 2nd subtitle buffer with a ts=1min. in_running_time is set to 1min. That buffer is blocked in handle_mq_input() because max_in_running_time is still 10s. 4) Since all in_running_time are now > 10s, max_out_running_time is now set to 10s. That first GOP gets recorded into the file. The muxer pop buffers out of the mq, when it tries to pop a 2nd subtitle buffer it blocks because the GstDataQueue is empty. 5) A 2nd GOP of video is collected and has a duration of 10s as well. max_in_running_time is now 20s. Since subtitle's in_running_time is already 1min, that GOP is already complete. 6) But let's say we overran the max file size, we thus set state to SPLITMUX_STATE_ENDING_FILE now. As soon as a buffer with ts > 10s (end of previous GOP) arrives in handle_mq_output(), EOS event is sent downstream instead. But since the subtitle queue is empty, that's never going to happen. Pipeline is now deadlocked. To fix this situation we have to: - Send a dummy event through the queue to wakeup output thread. - Update out_running_time to at least max_out_running_time so it sends EOS. - Respect time order, so we set out_running_tim=max_in_running_time because that's bigger than previous buffer and smaller than next.
(In reply to Jan Schmidt from comment #11) > I don't think that's needed - handle_mq_input() will just pass the custom > event anyway (and works here if I just revert those 2 lines). Moving the > probe to the externally visible pad can cause problems if the app or other > plugins install their own probes and interfere with the internal one. Hm, in some older versions of the patch that was causing deadlocks, but it seems to work now indeed. I didn't know it was safe to send an event on a pad from within a blocking probe on that same pad. > There's a different stall I found which we should also address if pads don't > all EOS in the same fragment - if pads are already EOS when the new fragment > starts, the EOS isn't repeated into the muxer things never finish up. I think my patch fix that case already. restart_context() resets ctx->out_eos to FALSE and the custom event will wakeup that queue. If you're happy with the patch, could you please push it, I don't have permission for that. Thanks.
(In reply to Xavier Claessens from comment #13) > (In reply to Jan Schmidt from comment #11) > > I don't think that's needed - handle_mq_input() will just pass the custom > > event anyway (and works here if I just revert those 2 lines). Moving the > > probe to the externally visible pad can cause problems if the app or other > > plugins install their own probes and interfere with the internal one. > > Hm, in some older versions of the patch that was causing deadlocks, but it > seems to work now indeed. I didn't know it was safe to send an event on a > pad from within a blocking probe on that same pad. > > > There's a different stall I found which we should also address if pads don't > > all EOS in the same fragment - if pads are already EOS when the new fragment > > starts, the EOS isn't repeated into the muxer things never finish up. > > I think my patch fix that case already. restart_context() resets > ctx->out_eos to FALSE and the custom event will wakeup that queue. > No - once there's EOS on a pad, there won't be any more data to trigger handle_mq_input again. There's also a second problem - once EOS has passed on the mq sink pad, all further events that aren't either flush-start or stream-start are discarded by the pad. I think when handling the unblock event on any mq srcpad, it needs to iterate the other contexts and check if their mq srcpad has GST_PAD_FLAG_EOS without ctx->out_eos - i.e., was the pad at EOS but hasn't sent an EOS in this fragment. If the iterator then sets ctx->out_eos as normal, we can still be sure that one and only one EOS is sent per pad. > If you're happy with the patch, could you please push it, I don't have > permission for that. Thanks. Are you OK to work on that extra lock? A unit test would, of course, also be cool if you're up for that.
I don't think that's a problem, EOS event is sticky so in restart_context() it will resend it for every new file. I tested with this: test.srt: 1 00:00:00,000 --> 00:00:10,000 Text $ gst-launch-1.0 filesrc location=test.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0 videotestsrc is-live=true ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux max-size-bytes=20971520 The subtitle stream sends EOS after 10s, and it continues producing files after that.
Created attachment 324492 [details] SRT test file Perhaps I misunderstood the hang that I see with the attached SRT file - I didn't spend very long it.
With my patch applied, I get no hang with your file.
I'll take another look at what's going on. For me, the test pipeline above succeeds rarely, but hangs most of the time with that longer test file.
Created attachment 325024 [details] [review] splitmuxsink: Fix occasional deadlock when ending file with subtitle Deadlock occurs when splitting files if one stream received no buffer during the first GOP of the next file. That can happen in that scenario for example: 1) The first GOP of video is collected, it has a duration of 10s. max_in_running_time is set to 10s. 2) Other streams catchup and we receive the first subtitle buffer at ts=0 and has a duration of 1min. 3) We receive the 2nd subtitle buffer with a ts=1min. in_running_time is set to 1min. That buffer is blocked in handle_mq_input() because max_in_running_time is still 10s. 4) Since all in_running_time are now > 10s, max_out_running_time is now set to 10s. That first GOP gets recorded into the file. The muxer pop buffers out of the mq, when it tries to pop a 2nd subtitle buffer it blocks because the GstDataQueue is empty. 5) A 2nd GOP of video is collected and has a duration of 10s as well. max_in_running_time is now 20s. Since subtitle's in_running_time is already 1min, that GOP is already complete. 6) But let's say we overran the max file size, we thus set state to SPLITMUX_STATE_ENDING_FILE now. As soon as a buffer with ts > 10s (end of previous GOP) arrives in handle_mq_output(), EOS event is sent downstream instead. But since the subtitle queue is empty, that's never going to happen. Pipeline is now deadlocked. To fix this situation we have to: - Send a dummy event through the queue to wakeup output thread. - Update out_running_time to at least max_out_running_time so it sends EOS. - Respect time order, so we set out_running_tim=max_in_running_time because that's bigger than previous buffer and smaller than next.
Created attachment 325025 [details] [review] splitmux: Handle a hang draining out at EOS Make sure that all data is drained out when the reference pad goes EOS. Fixes a problem where data that arrives on other pads after the reference pad finishes can stall forever and never pass EOS.
Finally figured out the hang that I was seeing at EOS. Between them, these 2 patches fix both things for me.
Makes sense to me, thanks.
MY 2nd patch caused the regression in bug #765072 but reverting that part reintroduces a hang at EOS for me, so this bug needs revisiting. This pipeline hangs again: gst-launch-1.0 filesrc location=~/media/srt/test-short.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0 videotestsrc num-buffers=10 ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux max-size-bytes=1000000 using the srt file attached to this bug, or the one in comment #15.
I can't reproduce the hang now. Looks like it was fixed since.