Bug 763711 – splitmuxsink: deadlock when one streams doesn't have regular buffers

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 763711 - splitmuxsink: deadlock when one streams doesn't have regular buffers


Summary:	splitmuxsink: deadlock when one streams doesn't have regular buffers


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-good
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.8.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-03-15 16:08 UTC by Xavier Claessens
Modified:	2016-08-07 15:16 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
minimal test case app showing the deadlock (3.47 KB, text/x-csrc) 2016-03-16 00:56 UTC, Xavier Claessens		Details
splitmuxsink: Fix deadlock happening when one stream receives few buffers (2.60 KB, patch) 2016-03-18 19:49 UTC, Xavier Claessens	none	Details \| Review
splitmuxsink: Fix occasional deadlock when ending file with subtitle (5.33 KB, patch) 2016-03-20 20:27 UTC, Xavier Claessens	none	Details \| Review
splitmuxsink: Fix occasional deadlock when ending file with subtitle (4.34 KB, patch) 2016-03-21 07:15 UTC, Jan Schmidt	none	Details \| Review
SRT test file (963 bytes, application/x-subrip) 2016-03-21 18:19 UTC, Jan Schmidt		Details
splitmuxsink: Fix occasional deadlock when ending file with subtitle (4.34 KB, patch) 2016-03-30 15:24 UTC, Jan Schmidt	committed	Details \| Review
splitmux: Handle a hang draining out at EOS (5.23 KB, patch) 2016-03-30 15:24 UTC, Jan Schmidt	committed	Details \| Review

Description Xavier Claessens 2016-03-15 16:08:05 UTC

I am recording audio+video+subtitles into small matroska files. For doing that I have a splitmuxsink and matroskamux.

 - splitmuxsink calculates the size of buffers entering its multiqueue, using a blocking pad prob on each input pads. Once the threshold is reached the state becomes SPLITMUX_STATE_ENDING_FILE and all input threads wait for EOS to reach the bus. It sends EOS event downstream as soon as downstream unblocks to take the next buffer from the queue.

 - matroskamux has a GstCollectPads which won't deliver any buffer to the matroska muxer until there is at least one buffer queued on every pads.

This results in a deadlock if the file size threshold is reached before the first subtitle buffer arrives. In that case the GstCollectPads is blocked waiting for all its pads to have queued packets, but the splutmuxsink blocked its input because threshold has been reached so even if a subtitle arrives later it won't reach the collect pads.

I can reproduce that deadlock in my application but it does lots of more complicated stuff. I could work on writing a more minimal test case if needed. I have no idea what would be the proper fix.

Note that, even without splitmuxsink, if I understand correctly how GstCollectPads works, it means that muxing a matroska with few subtitles could result in huge memory usage, since it's going to queue video buffers until it gets a subtitle... Imagine a movie that has a long periods with nobody speaking...

Comment 1 Xavier Claessens 2016-03-16 00:56:00 UTC

Created attachment 324068 [details]
minimal test case app showing the deadlock

Backtrace when the testcase deadlocked:

+ Trace 236085

Thread 12 (Thread 0x7f0431d23700 (LWP 877))

#0 syscall
at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
#1 g_cond_wait
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread-posix.c line 1397
#2 _gst_data_queue_wait_non_empty
at gstdataqueue.c line 553
#3 gst_data_queue_pop
at gstdataqueue.c line 595
#4 gst_multi_queue_loop
at gstmultiqueue.c line 1552
#5 gst_task_func
at gsttask.c line 332
#6 g_thread_pool_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthreadpool.c line 307
#7 g_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread.c line 780
#8 start_thread
at pthread_create.c line 333
#9 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Thread 11 (Thread 0x7f0431522700 (LWP 878))

#0 syscall
at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
#1 g_cond_wait
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread-posix.c line 1397
#2 gst_collect_pads_chain
at gstcollectpads.c line 2251
#3 gst_pad_push_data
at gstpad.c line 4155
#4 gst_pad_push_data
at gstpad.c line 4407
#5 gst_pad_push
at gstpad.c line 4526
#6 gst_multi_queue_loop
at gstmultiqueue.c line 1394
#7 gst_multi_queue_loop
at gstmultiqueue.c line 1677
#8 gst_task_func
at gsttask.c line 332
#9 g_thread_pool_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthreadpool.c line 307
#10 g_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread.c line 780
#11 start_thread
at pthread_create.c line 333
#12 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Thread 10 (Thread 0x7f0430d21700 (LWP 879))

#0 syscall
at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
#1 g_cond_wait
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread-posix.c line 1397
#2 handle_mq_input
at gstsplitmuxsink.c line 1132
#3 probe_hook_marshal
at gstpad.c line 3420
#4 g_hook_list_marshal
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/ghook.c line 672
#5 do_probe_callbacks
at gstpad.c line 3572
#6 gst_pad_push_data
at gstpad.c line 4135
#7 gst_pad_push_data
at gstpad.c line 4407
#8 gst_pad_push
at gstpad.c line 4526
#9 gst_proxy_pad_chain_default
at gstghostpad.c line 126
#10 gst_pad_push_data
at gstpad.c line 4155
#11 gst_pad_push_data
at gstpad.c line 4407
#12 gst_pad_push
at gstpad.c line 4526
#13 gst_base_src_loop
at gstbasesrc.c line 2850
#14 gst_task_func
at gsttask.c line 332
#15 g_thread_pool_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthreadpool.c line 307
#16 g_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread.c line 780
#17 start_thread
at pthread_create.c line 333
#18 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Thread 9 (Thread 0x7f0423fff700 (LWP 880))

#0 syscall
at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
#1 g_cond_wait
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread-posix.c line 1397
#2 handle_mq_input
at gstsplitmuxsink.c line 1132
#3 probe_hook_marshal
at gstpad.c line 3420
#4 g_hook_list_marshal
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/ghook.c line 672
#5 do_probe_callbacks
at gstpad.c line 3572
#6 gst_pad_push_data
at gstpad.c line 4135
#7 gst_pad_push_data
at gstpad.c line 4407
#8 gst_pad_push
at gstpad.c line 4526
#9 gst_proxy_pad_chain_default
at gstghostpad.c line 126
#10 gst_pad_push_data
at gstpad.c line 4155
#11 gst_pad_push_data
at gstpad.c line 4407
#12 gst_pad_push
at gstpad.c line 4526
#13 gst_proxy_pad_chain_default
at gstghostpad.c line 126
#14 gst_pad_push_data
at gstpad.c line 4155
#15 gst_pad_push_data
at gstpad.c line 4407
#16 gst_pad_push
at gstpad.c line 4526
#17 gst_video_encoder_finish_frame
at gstvideoencoder.c line 2197
#18 0x00007f0433e90f33 in
#19 0x00007f0433e959f9 in
#20 gst_video_encoder_chain
at gstvideoencoder.c line 1480
#21 gst_pad_push_data
at gstpad.c line 4155
#22 gst_pad_push_data
at gstpad.c line 4407
#23 gst_pad_push
at gstpad.c line 4526
#24 gst_base_transform_chain
at gstbasetransform.c line 2369
#25 gst_pad_push_data
at gstpad.c line 4155
#26 gst_pad_push_data
at gstpad.c line 4407
#27 gst_pad_push
at gstpad.c line 4526
#28 gst_base_src_loop
at gstbasesrc.c line 2850
#29 gst_task_func
at gsttask.c line 332
#30 g_thread_pool_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthreadpool.c line 307
#31 g_thread_proxy
at /build/glib2.0-SNH0tt/glib2.0-2.47.6/./glib/gthread.c line 780
#32 start_thread
at pthread_create.c line 333
#33 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Comment 2 Jan Schmidt 2016-03-16 05:14:52 UTC

(In reply to Xavier Claessens from comment #0)
> 
> Note that, even without splitmuxsink, if I understand correctly how
> GstCollectPads works, it means that muxing a matroska with few subtitles
> could result in huge memory usage, since it's going to queue video buffers
> until it gets a subtitle... Imagine a movie that has a long periods with
> nobody speaking...

Upstream elements are supposed to send GAP events on subtitle / sparse streams so that downstream elements get timely position updates even when there are large gaps in the data and avoid that problem.

Comment 3 Jan Schmidt 2016-03-16 05:15:58 UTC

I think the fix for this will be around handling the drained signal for the multiqueue when waiting for the fragment to drain out and using that to trigger completion as well.

Comment 4 Xavier Claessens 2016-03-16 12:19:10 UTC

(In reply to Jan Schmidt from comment #2)
> (In reply to Xavier Claessens from comment #0)
> > 
> > Note that, even without splitmuxsink, if I understand correctly how
> > GstCollectPads works, it means that muxing a matroska with few subtitles
> > could result in huge memory usage, since it's going to queue video buffers
> > until it gets a subtitle... Imagine a movie that has a long periods with
> > nobody speaking...
> 
> Upstream elements are supposed to send GAP events on subtitle / sparse
> streams so that downstream elements get timely position updates even when
> there are large gaps in the data and avoid that problem.

Ok, fair enough, forget about that part then.

(In reply to Jan Schmidt from comment #3)
> I think the fix for this will be around handling the drained signal for the
> multiqueue when waiting for the fragment to drain out and using that to
> trigger completion as well.

My understanding is that we should send EOS event in all empty queues when going to SPLITMUX_STATE_ENDING_FILE state. Not sure there is a drain signal that would be emitted if the queue is already empty when we block in handle_mq_input.

Comment 5 Lubosz Sarnecki 2016-03-16 13:37:11 UTC

I found a minimal pipeline to reproduce the problem.

test.srt:
(I think any .srt file will reproduce the issue)

1
00:00:00,000 --> 00:01:00,000
Text


Working Pipeline:
(Produces one 1.1MB file at 2.0MB max size)

gst-launch-1.0 filesrc location=test.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0 \
  videotestsrc num-buffers=10 ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux \
  max-size-bytes=2000000


Locking pipeline:
(Tries to produce 1.0MB max size)

gst-launch-1.0 filesrc location=test.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0 \
  videotestsrc num-buffers=10 ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux \
  max-size-bytes=1000000

Backtrace when deadlocked:

+ Trace 236087

Thread 5 (Thread 0x7fffe7fff700 (LWP 4414))

#0 syscall
from /usr/lib/libc.so.6
#1 g_cond_wait
from /usr/lib/libglib-2.0.so.0
#2 handle_mq_input
at gstsplitmuxsink.c line 1132
#3 probe_hook_marshal
at gstpad.c line 3420
#4 g_hook_list_marshal
from /usr/lib/libglib-2.0.so.0
#5 do_probe_callbacks
at gstpad.c line 3572
#6 gst_pad_chain_data_unchecked
at gstpad.c line 4135
#7 gst_pad_push_data
at gstpad.c line 4407
#8 gst_pad_push
at gstpad.c line 4526
#9 gst_proxy_pad_chain_default
at gstghostpad.c line 126
#10 gst_pad_chain_data_unchecked
at gstpad.c line 4155
#11 gst_pad_push_data
at gstpad.c line 4407
#12 gst_pad_push
at gstpad.c line 4526
#13 gst_base_src_loop
at gstbasesrc.c line 2850
#14 gst_task_func
at gsttask.c line 332
#15 ??
from /usr/lib/libglib-2.0.so.0
#16 ??
from /usr/lib/libglib-2.0.so.0
#17 start_thread
from /usr/lib/libpthread.so.0
#18 clone
from /usr/lib/libc.so.6

Thread 4 (Thread 0x7fffecb86700 (LWP 4413))

#0 syscall
from /usr/lib/libc.so.6
#1 g_cond_wait
from /usr/lib/libglib-2.0.so.0
#2 handle_mq_input
at gstsplitmuxsink.c line 1132
#3 probe_hook_marshal
at gstpad.c line 3420
#4 g_hook_list_marshal
from /usr/lib/libglib-2.0.so.0
#5 do_probe_callbacks
at gstpad.c line 3572
#6 gst_pad_chain_data_unchecked
at gstpad.c line 4135
#7 gst_pad_push_data
at gstpad.c line 4407
#8 gst_pad_push
at gstpad.c line 4526
#9 gst_proxy_pad_chain_default
at gstghostpad.c line 126
#10 gst_pad_chain_data_unchecked
at gstpad.c line 4155
#11 gst_pad_push_data
at gstpad.c line 4407
#12 gst_pad_push
at gstpad.c line 4526
#13 gst_kate_enc_push_buffer
at gstkateenc.c line 429
#14 gst_kate_enc_push_and_free_kate_packet
at gstkateenc.c line 456
#15 gst_kate_enc_chain_push_packet
at gstkateenc.c line 686
#16 gst_kate_enc_chain_text
at gstkateenc.c line 960
#17 gst_kate_enc_chain
at gstkateenc.c line 994
#18 gst_pad_chain_data_unchecked
at gstpad.c line 4155
#19 gst_pad_push_data
at gstpad.c line 4407
#20 gst_pad_push
at gstpad.c line 4526
#21 handle_buffer
at gstsubparse.c line 1577
#22 gst_sub_parse_chain
at gstsubparse.c line 1604
#23 gst_sub_parse_sink_event
at gstsubparse.c line 1633
#24 gst_pad_send_event_unchecked
at gstpad.c line 5554
#25 gst_pad_push_event_unchecked
at gstpad.c line 5212
#26 push_sticky
at gstpad.c line 3757
#27 events_foreach
at gstpad.c line 598
#28 check_sticky
at gstpad.c line 3814
#29 gst_pad_push_event
at gstpad.c line 5343
#30 gst_base_src_loop
at gstbasesrc.c line 2934
#31 gst_task_func
at gsttask.c line 332
#32 ??
from /usr/lib/libglib-2.0.so.0
#33 ??
from /usr/lib/libglib-2.0.so.0
#34 start_thread
from /usr/lib/libpthread.so.0
#35 clone
from /usr/lib/libc.so.6

Thread 3 (Thread 0x7fffed387700 (LWP 4412))

#0 syscall
from /usr/lib/libc.so.6
#1 g_cond_wait
from /usr/lib/libglib-2.0.so.0
#2 gst_collect_pads_chain
at gstcollectpads.c line 2251
#3 gst_pad_chain_data_unchecked
at gstpad.c line 4155
#4 gst_pad_push_data
at gstpad.c line 4407
#5 gst_pad_push
at gstpad.c line 4526
#6 gst_single_queue_push_one
at gstmultiqueue.c line 1394
#7 gst_multi_queue_loop
at gstmultiqueue.c line 1677
#8 gst_task_func
at gsttask.c line 332
#9 ??
from /usr/lib/libglib-2.0.so.0
#10 ??
from /usr/lib/libglib-2.0.so.0
#11 start_thread
from /usr/lib/libpthread.so.0
#12 clone
from /usr/lib/libc.so.6

Thread 2 (Thread 0x7fffedb88700 (LWP 4411))

#0 syscall
from /usr/lib/libc.so.6
#1 g_cond_wait
from /usr/lib/libglib-2.0.so.0
#2 _gst_data_queue_wait_non_empty
at gstdataqueue.c line 553
#3 gst_data_queue_pop
at gstdataqueue.c line 595
#4 gst_multi_queue_loop
at gstmultiqueue.c line 1552
#5 gst_task_func
at gsttask.c line 332
#6 ??
from /usr/lib/libglib-2.0.so.0
#7 ??
from /usr/lib/libglib-2.0.so.0
#8 start_thread
from /usr/lib/libpthread.so.0
#9 clone
from /usr/lib/libc.so.6

Comment 6 Xavier Claessens 2016-03-16 14:01:56 UTC

Comment on attachment 324068 [details]
minimal test case app showing the deadlock

Marking as obsolete since lubosz posted an easier way to reproduce the deadlock.

Comment 7 Xavier Claessens 2016-03-18 19:49:35 UTC

Created attachment 324300 [details] [review]
splitmuxsink: Fix deadlock happening when one stream receives few buffers

Comment 8 Jan Schmidt 2016-03-18 22:03:35 UTC

Review of attachment 324300 [details] [review]:

Please put some explanation in the commit message. It's always easier to find things in the commit history with better commit messages.

::: gst/multifile/gstsplitmuxsink.c
@@ +1141,3 @@
+            splitmux->max_in_running_time);
+        event = gst_event_new_gap (pos, GST_CLOCK_TIME_NONE);
+        gst_pad_send_event (ctx->sinkpad, event);

How well did you test this, and with which mux formats? I'm not entirely sure it won't introduce actual gaps in the playback of the resulting pieces with at least some formats

@@ +1273,2 @@
       (GstPadProbeCallback) handle_mq_input, ctx, (GDestroyNotify)
       _pad_block_destroy_sink_notify);

Why did you move the pad probe from the multiqueue sink pad to the ghostpad?

Comment 9 Xavier Claessens 2016-03-20 20:27:07 UTC

Created attachment 324383 [details] [review]
splitmuxsink: Fix occasional deadlock when ending file with subtitle

Deadlock occurs when splitting files if one stream received no buffer during
the first GOP of the next file. That can happen in that scenario for example:
 1) The first GOP of video is collected, it has a duration of 10s.
    max_in_running_time is set to 10s.
 2) Other streams catchup and we receive the first subtitle buffer at ts=0 and
    has a duration of 1min.
 3) We receive the 2nd subtitle buffer with a ts=1min. in_running_time is set to
    1min. That buffer is blocked in handle_mq_input() because
    max_in_running_time is still 10s.
 4) Since all in_running_time are now > 10s, max_out_running_time is now set to
    10s. That first GOP gets recorded into the file. The muxer pop buffers out
    of the mq, when it tries to pop a 2nd subtitle buffer it blocks because the
    GstDataQueue is empty.
 5) A 2nd GOP of video is collected and has a duration of 10s as well.
    max_in_running_time is now 20s. Since subtitle's in_running_time is already
    1min, that GOP is already complete.
 6) But let's say we overran the max file size, we thus set state to
    SPLITMUX_STATE_ENDING_FILE now. As soon as a buffer with ts > 10s (end of
    previous GOP) arrives in handle_mq_output(), EOS event is sent downstream
    instead. But since the subtitle queue is empty, that's never going to
    happen. Pipeline is now deadlocked.

To fix this situation we have to:
 - Send a dummy event through the queue to wakeup output thread.
 - Update out_running_time to at least max_out_running_time so it sends EOS.
 - Respect time order, so we set out_running_tim=max_in_running_time because
   that's bigger than previous buffer and smaller than next.

Comment 10 Xavier Claessens 2016-03-20 20:36:57 UTC

(In reply to Jan Schmidt from comment #8)
> Review of attachment 324300 [details] [review] [review]:
> 
> Please put some explanation in the commit message. It's always easier to
> find things in the commit history with better commit messages.

Right, done.

> ::: gst/multifile/gstsplitmuxsink.c
> @@ +1141,3 @@
> +            splitmux->max_in_running_time);
> +        event = gst_event_new_gap (pos, GST_CLOCK_TIME_NONE);
> +        gst_pad_send_event (ctx->sinkpad, event);
> 
> How well did you test this, and with which mux formats? I'm not entirely
> sure it won't introduce actual gaps in the playback of the resulting pieces
> with at least some formats

I tested with the command from comment #5. It reliably deadlock without the patch, and works fine with it. I didn't notice problems with GAPs recorded, but you're right that we should drop those dummy events. Updated the patch doing that.

> @@ +1273,2 @@
>        (GstPadProbeCallback) handle_mq_input, ctx, (GDestroyNotify)
>        _pad_block_destroy_sink_notify);
> 
> Why did you move the pad probe from the multiqueue sink pad to the ghostpad?

Added a comment explaining why. I have to send an event into the mq without triggering handle_mq_input().

Comment 11 Jan Schmidt 2016-03-21 07:15:17 UTC

(In reply to Xavier Claessens from comment #10)
> (In reply to Jan Schmidt from comment #8)
> > Review of attachment 324300 [details] [review] [review] [review]:
> > 
> > Please put some explanation in the commit message. It's always easier to
> > find things in the commit history with better commit messages.
> 
> Right, done.
> 
> > ::: gst/multifile/gstsplitmuxsink.c
> > @@ +1141,3 @@
> > +            splitmux->max_in_running_time);
> > +        event = gst_event_new_gap (pos, GST_CLOCK_TIME_NONE);
> > +        gst_pad_send_event (ctx->sinkpad, event);
> > 
> > How well did you test this, and with which mux formats? I'm not entirely
> > sure it won't introduce actual gaps in the playback of the resulting pieces
> > with at least some formats
> 
> I tested with the command from comment #5. It reliably deadlock without the
> patch, and works fine with it. I didn't notice problems with GAPs recorded,
> but you're right that we should drop those dummy events. Updated the patch
> doing that.
> 
> > @@ +1273,2 @@
> >        (GstPadProbeCallback) handle_mq_input, ctx, (GDestroyNotify)
> >        _pad_block_destroy_sink_notify);
> > 
> > Why did you move the pad probe from the multiqueue sink pad to the ghostpad?
> 
> Added a comment explaining why. I have to send an event into the mq without
> triggering handle_mq_input().

I don't think that's needed - handle_mq_input() will just pass the custom event anyway (and works here if I just revert those 2 lines). Moving the probe to the externally visible pad can cause problems if the app or other plugins install their own probes and interfere with the internal one.

There's a different stall I found which we should also address if pads don't all EOS in the same fragment - if pads are already EOS when the new fragment starts, the EOS isn't repeated into the muxer things never finish up.

Comment 12 Jan Schmidt 2016-03-21 07:15:42 UTC

Created attachment 324412 [details] [review]
splitmuxsink: Fix occasional deadlock when ending file with subtitle

Deadlock occurs when splitting files if one stream received no buffer during
the first GOP of the next file. That can happen in that scenario for example:
 1) The first GOP of video is collected, it has a duration of 10s.
    max_in_running_time is set to 10s.
 2) Other streams catchup and we receive the first subtitle buffer at ts=0 and
    has a duration of 1min.
 3) We receive the 2nd subtitle buffer with a ts=1min. in_running_time is set to
    1min. That buffer is blocked in handle_mq_input() because
    max_in_running_time is still 10s.
 4) Since all in_running_time are now > 10s, max_out_running_time is now set to
    10s. That first GOP gets recorded into the file. The muxer pop buffers out
    of the mq, when it tries to pop a 2nd subtitle buffer it blocks because the
    GstDataQueue is empty.
 5) A 2nd GOP of video is collected and has a duration of 10s as well.
    max_in_running_time is now 20s. Since subtitle's in_running_time is already
    1min, that GOP is already complete.
 6) But let's say we overran the max file size, we thus set state to
    SPLITMUX_STATE_ENDING_FILE now. As soon as a buffer with ts > 10s (end of
    previous GOP) arrives in handle_mq_output(), EOS event is sent downstream
    instead. But since the subtitle queue is empty, that's never going to
    happen. Pipeline is now deadlocked.

To fix this situation we have to:
 - Send a dummy event through the queue to wakeup output thread.
 - Update out_running_time to at least max_out_running_time so it sends EOS.
 - Respect time order, so we set out_running_tim=max_in_running_time because
   that's bigger than previous buffer and smaller than next.

Comment 13 Xavier Claessens 2016-03-21 13:44:14 UTC

(In reply to Jan Schmidt from comment #11)
> I don't think that's needed - handle_mq_input() will just pass the custom
> event anyway (and works here if I just revert those 2 lines). Moving the
> probe to the externally visible pad can cause problems if the app or other
> plugins install their own probes and interfere with the internal one.

Hm, in some older versions of the patch that was causing deadlocks, but it seems to work now indeed. I didn't know it was safe to send an event on a pad from within a blocking probe on that same pad.

> There's a different stall I found which we should also address if pads don't
> all EOS in the same fragment - if pads are already EOS when the new fragment
> starts, the EOS isn't repeated into the muxer things never finish up.

I think my patch fix that case already. restart_context() resets ctx->out_eos to FALSE and the custom event will wakeup that queue.


If you're happy with the patch, could you please push it, I don't have permission for that. Thanks.

Comment 14 Jan Schmidt 2016-03-21 15:25:21 UTC

(In reply to Xavier Claessens from comment #13)
> (In reply to Jan Schmidt from comment #11)
> > I don't think that's needed - handle_mq_input() will just pass the custom
> > event anyway (and works here if I just revert those 2 lines). Moving the
> > probe to the externally visible pad can cause problems if the app or other
> > plugins install their own probes and interfere with the internal one.
> 
> Hm, in some older versions of the patch that was causing deadlocks, but it
> seems to work now indeed. I didn't know it was safe to send an event on a
> pad from within a blocking probe on that same pad.
> 
> > There's a different stall I found which we should also address if pads don't
> > all EOS in the same fragment - if pads are already EOS when the new fragment
> > starts, the EOS isn't repeated into the muxer things never finish up.
> 
> I think my patch fix that case already. restart_context() resets
> ctx->out_eos to FALSE and the custom event will wakeup that queue.
>

No - once there's EOS on a pad, there won't be any more data to trigger handle_mq_input again. There's also a second problem - once EOS has passed on the mq sink pad, all further events that aren't either flush-start or stream-start are discarded by the pad.

I think when handling the unblock event on any mq srcpad, it needs to iterate the other contexts and check if their mq srcpad has GST_PAD_FLAG_EOS without ctx->out_eos - i.e., was the pad at EOS but hasn't sent an EOS in this fragment.

If the iterator then sets ctx->out_eos as normal, we can still be sure that one and only one EOS is sent per pad.
 
> If you're happy with the patch, could you please push it, I don't have
> permission for that. Thanks.

Are you OK to work on that extra lock? A unit test would, of course, also be cool if you're up for that.

Comment 15 Xavier Claessens 2016-03-21 18:15:56 UTC

I don't think that's a problem, EOS event is sticky so in restart_context() it will resend it for every new file.

I tested with this:

test.srt:
1
00:00:00,000 --> 00:00:10,000
Text

$ gst-launch-1.0 filesrc location=test.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0   videotestsrc is-live=true ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux   max-size-bytes=20971520

The subtitle stream sends EOS after 10s, and it continues producing files after that.

Comment 16 Jan Schmidt 2016-03-21 18:19:23 UTC

Created attachment 324492 [details]
SRT test file

Perhaps I misunderstood the hang that I see with the attached SRT file - I didn't spend very long it.

Comment 17 Xavier Claessens 2016-03-21 19:18:58 UTC

With my patch applied, I get no hang with your file.

Comment 18 Jan Schmidt 2016-03-22 02:11:55 UTC

I'll take another look at what's going on. For me, the test pipeline above succeeds rarely, but hangs most of the time with that longer test file.

Comment 19 Jan Schmidt 2016-03-30 15:24:25 UTC

Created attachment 325024 [details] [review]
splitmuxsink: Fix occasional deadlock when ending file with subtitle

Deadlock occurs when splitting files if one stream received no buffer during
the first GOP of the next file. That can happen in that scenario for example:
 1) The first GOP of video is collected, it has a duration of 10s.
    max_in_running_time is set to 10s.
 2) Other streams catchup and we receive the first subtitle buffer at ts=0 and
    has a duration of 1min.
 3) We receive the 2nd subtitle buffer with a ts=1min. in_running_time is set to
    1min. That buffer is blocked in handle_mq_input() because
    max_in_running_time is still 10s.
 4) Since all in_running_time are now > 10s, max_out_running_time is now set to
    10s. That first GOP gets recorded into the file. The muxer pop buffers out
    of the mq, when it tries to pop a 2nd subtitle buffer it blocks because the
    GstDataQueue is empty.
 5) A 2nd GOP of video is collected and has a duration of 10s as well.
    max_in_running_time is now 20s. Since subtitle's in_running_time is already
    1min, that GOP is already complete.
 6) But let's say we overran the max file size, we thus set state to
    SPLITMUX_STATE_ENDING_FILE now. As soon as a buffer with ts > 10s (end of
    previous GOP) arrives in handle_mq_output(), EOS event is sent downstream
    instead. But since the subtitle queue is empty, that's never going to
    happen. Pipeline is now deadlocked.

To fix this situation we have to:
 - Send a dummy event through the queue to wakeup output thread.
 - Update out_running_time to at least max_out_running_time so it sends EOS.
 - Respect time order, so we set out_running_tim=max_in_running_time because
   that's bigger than previous buffer and smaller than next.

Comment 20 Jan Schmidt 2016-03-30 15:24:35 UTC

Created attachment 325025 [details] [review]
splitmux: Handle a hang draining out at EOS

Make sure that all data is drained out when the reference pad
goes EOS. Fixes a problem where data that arrives on other
pads after the reference pad finishes can stall forever and
never pass EOS.

Comment 21 Jan Schmidt 2016-03-30 15:25:52 UTC

Finally figured out the hang that I was seeing at EOS. Between them, these 2 patches fix both things for me.

Comment 22 Xavier Claessens 2016-03-31 09:56:20 UTC

Makes sense to me, thanks.

Comment 23 Jan Schmidt 2016-04-16 12:48:37 UTC

MY 2nd patch caused the regression in bug #765072 but reverting that part reintroduces a hang at EOS for me, so this bug needs revisiting.

This pipeline hangs again:

gst-launch-1.0 filesrc location=~/media/srt/test-short.srt ! subparse ! kateenc category=SUB ! mux.subtitle_0   videotestsrc num-buffers=10 ! splitmuxsink location=foo%02d.mkv muxer=matroskamux name=mux   max-size-bytes=1000000

using the srt file attached to this bug, or the one in comment #15.

Comment 24 Jan Schmidt 2016-08-07 15:16:05 UTC

I can't reproduce the hang now. Looks like it was fixed since.