Bug 752651 – decodebin: segfault on setting to NULL

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 752651 - decodebin: segfault on setting to NULL


Summary:	decodebin: segfault on setting to NULL


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other All

Importance:	Normal major
Target Milestone:	1.7.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:	752623
Blocks:

Reported:	2015-07-21 03:26 UTC by Thiago Sousa Santos
Modified:	2015-12-02 18:20 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
lock the expose lock around decode chain use (1.41 KB, patch) 2015-08-20 16:31 UTC, Vincent Penquerc'h	accepted-commit_now	Details \| Review
early out on pad-added if the pad is inactive (1.02 KB, patch) 2015-08-20 16:32 UTC, Vincent Penquerc'h	none	Details \| Review
early out on pad-added if the pad is inactive (1.32 KB, patch) 2015-08-21 12:13 UTC, Vincent Penquerc'h	none	Details \| Review
fix deadlock in discoverer (993 bytes, patch) 2015-08-26 13:44 UTC, Vincent Penquerc'h	accepted-commit_now	Details \| Review
early out on pad-added if the pad is inactive (1.34 KB, patch) 2015-10-21 11:20 UTC, Vincent Penquerc'h	none	Details \| Review
WIP trying to get rid of races (4.97 KB, patch) 2015-10-23 12:40 UTC, Vincent Penquerc'h	none	Details \| Review
WIP trying to get rid of races (7.04 KB, patch) 2015-10-23 14:57 UTC, Vincent Penquerc'h	none	Details \| Review
lock the expose lock around decode chain use (4.01 KB, patch) 2015-11-03 15:22 UTC, Vincent Penquerc'h	none	Details \| Review
lock the expose lock around decode chain use (4.01 KB, patch) 2015-12-01 14:57 UTC, Vincent Penquerc'h	committed	Details \| Review
early out on pad-added if the pad is inactive (1.35 KB, patch) 2015-12-01 14:58 UTC, Vincent Penquerc'h	committed	Details \| Review

Description Thiago Sousa Santos 2015-07-21 03:26:31 UTC

+++ This bug was initially created as a clone of Bug #752623 +++

With the sample application from the #752623 bug decodebin segfaults. The trace is as follows:

Program received signal SIGSEGV, Segmentation fault.

+ Trace 235290

Thread 140737050691328 (LWP 26464)

#0 connect_pad
at gstdecodebin2.c line 2047
#1 analyze_new_pad
at gstdecodebin2.c line 1770
#2 pad_added_cb
at gstdecodebin2.c line 2837
#3 g_cclosure_marshal_VOID__OBJECTv
at /tmp/buildd/glib2.0-2.44.1/./gobject/gmarshal.c line 2102
#4 _g_closure_invoke_va
at /tmp/buildd/glib2.0-2.44.1/./gobject/gclosure.c line 831
#5 g_signal_emit_valist
at /tmp/buildd/glib2.0-2.44.1/./gobject/gsignal.c line 3214
#6 g_signal_emit
at /tmp/buildd/glib2.0-2.44.1/./gobject/gsignal.c line 3361
#7 gst_element_add_pad
at gstelement.c line 694
#8 qtdemux_expose_streams
at qtdemux.c line 6476
#9 qtdemux_expose_streams
at qtdemux.c line 9695
#10 gst_qtdemux_loop_state_header
at qtdemux.c line 3480
#11 gst_qtdemux_loop
at qtdemux.c line 4780
#12 gst_task_func
at gsttask.c line 331
#13 g_thread_pool_thread_proxy
at /tmp/buildd/glib2.0-2.44.1/./glib/gthreadpool.c line 307
#14 g_thread_proxy
at /tmp/buildd/glib2.0-2.44.1/./glib/gthread.c line 764
#15 start_thread
at pthread_create.c line 309
#16 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 111

Comment 1 Vincent Penquerc'h 2015-08-06 17:09:30 UTC

Not quite sure what to make of this.
In gst_decode_group_control_demuxer_pad, group->parent is NULL. This seems totally expected, as gst_decode_chain_new can be called with an explicitely NULL parent in type_found. However, CHAIN_MUTEX_LOCK does seem to protect an awful lot of stuff, so I'm not confident it's safe just to omit the lock when group->parent is NULL...

Comment 2 Vincent Penquerc'h 2015-08-20 11:16:38 UTC

Actually, the above turns out to be wrong, a group isn't a chain, and gst_decode_group_new is always called with a parent that must non non-NULL as it was always dereferenced before the call.

Comment 3 Vincent Penquerc'h 2015-08-20 12:27:01 UTC

Running the test case from the linked bug, I saw a couple other related crashes.

+ Trace 235358

#0 gst_decode_group_new
at gstdecodebin2.c line 3591
#1 gst_decode_chain_get_current_group
at gstdecodebin2.c line 3122
#2 analyze_new_pad
at gstdecodebin2.c line 1523
#3 pad_added_cb
at gstdecodebin2.c line 2835
#4 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#5 _g_closure_invoke_va
at gclosure.c line 831
#6 g_signal_emit_valist
at gsignal.c line 3215
#7 g_signal_emit
at gsignal.c line 3363
#8 gst_element_add_pad
at gstelement.c line 691
#9 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2722
#10 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3358
#11 gst_ogg_demux_loop
at gstoggdemux.c line 4526
#12 gst_task_func
at gsttask.c line 331
#13 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#14 g_thread_proxy
at gthread.c line 764
#15 start_thread
at pthread_create.c line 308
#16 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#0 __GI_raise
at ../nptl/sysdeps/unix/sysv/linux/raise.c line 64
#0 __GI_raise
at ../nptl/sysdeps/unix/sysv/linux/raise.c line 64
#1 __GI_abort
at abort.c line 91
#2 __assert_fail_base
at assert.c line 94
#3 __GI___assert_fail
at assert.c line 103
#4 __pthread_mutex_lock
at pthread_mutex_lock.c line 62
#5 g_mutex_lock
at gthread-posix.c line 209
#6 analyze_new_pad
at gstdecodebin2.c line 1865
#7 pad_added_cb
at gstdecodebin2.c line 2835
#8 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#9 _g_closure_invoke_va
at gclosure.c line 831
#10 g_signal_emit_valist
at gsignal.c line 3215
#11 g_signal_emit
at gsignal.c line 3363
#12 gst_element_add_pad
at gstelement.c line 691
#13 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2722
#14 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3358
#15 gst_ogg_demux_loop
at gstoggdemux.c line 4526
#16 gst_task_func
at gsttask.c line 331
#17 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#18 g_thread_proxy
at gthread.c line 764
#19 start_thread
at pthread_create.c line 308
#20 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112

Comment 4 Vincent Penquerc'h 2015-08-20 12:37:26 UTC

Trying to see if some previous locking patches were the cause, I can still get this crash with 2813e08210c04ae7de60e67af3ed81c5b65b1723. However, it's hard to get because we get deadlocks sooner most of the time.

(gdb) bt

+ Trace 235359

#0 __GI_raise
at ../nptl/sysdeps/unix/sysv/linux/raise.c line 64
#1 __GI_abort
at abort.c line 91
#2 g_thread_abort
at gthread-posix.c line 75
#3 g_mutex_lock
at gthread-posix.c line 210
#4 analyze_new_pad
at gstdecodebin2.c line 1859
#5 pad_added_cb
at gstdecodebin2.c line 2826
#6 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#7 _g_closure_invoke_va
at gclosure.c line 831
#8 g_signal_emit_valist
at gsignal.c line 3215
#9 g_signal_emit
at gsignal.c line 3363
#10 gst_element_add_pad
at gstelement.c line 691
#11 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2722
#12 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3358
#13 gst_ogg_demux_loop
at gstoggdemux.c line 4526
#14 gst_task_func
at gsttask.c line 331
#15 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#16 g_thread_proxy
at gthread.c line 764
#17 start_thread
at pthread_create.c line 308
#18 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#19 ??

Comment 5 Sebastian Dröge (slomo) 2015-08-20 13:03:54 UTC

Can you get a backtrace of all threads of the deadlock?

Comment 6 Vincent Penquerc'h 2015-08-20 13:09:13 UTC

I got these while testing a version of -base from before patches that fixed those deadlocks, to be clear. This was to see if I could still get the crash then, or whether the crash was introduced by those locking patches. So I'm not seeing any deadlock with current gstreamer.

Or did you want to see the stack traces anyway ?

Comment 7 Sebastian Dröge (slomo) 2015-08-20 13:29:53 UTC

No, I only care about problems with latest GIT master ;) So it only crashes now?

Comment 8 Vincent Penquerc'h 2015-08-20 13:30:40 UTC

Yes, just the crashes now.

Comment 9 Sebastian Dröge (slomo) 2015-08-20 13:42:55 UTC

Can you provide a backtrace with up-to-date GStreamer then? And check why it crashes in that line, e.g. if whatever there is NULL or some random pointer.

Comment 10 Vincent Penquerc'h 2015-08-20 13:46:44 UTC

The main backtrace is in the first comment from Thiago. It is up to date AFAIK. The parent pointer was NULL (not sure why, as none of the code seems to allow it to be NULL). Comment 3 has a couple other traces that seem related, which I put in for reference.

Comment 11 Sebastian Dröge (slomo) 2015-08-20 13:55:09 UTC

The line numbers don't match anymore, that's why I was asking for a new backtrace :)

Comment 12 Sebastian Dröge (slomo) 2015-08-20 13:56:16 UTC

Also gst_decode_group_new() handles parent==NULL just fine. For the very first demuxer in a stream (the most upstream one) that happens IIRC.

Comment 13 Vincent Penquerc'h 2015-08-20 14:03:29 UTC

I can't seem to be able to get the original crash, I always end up with one of the two ones I posted traces earlier today.

I do recall getting the crash when I first looked at it a few weeks ago, and group->parent was NULL.

The only calls I see to gst_decode_group_new are from gst_decode_chain_get_current_group, which calls them with the chain variable, which was always dereferenced earlier, here:

  /* Now we know that we can really return something useful */
  if (!chain->active_group) {

So I don't see how it can end up NULL in _group_new.

Comment 14 Sebastian Dröge (slomo) 2015-08-20 14:11:16 UTC

The line numbers in the two backtraces you pasted earlier today also don't map to anything useful in the latest gstdecodebin2.c :)

Comment 15 Vincent Penquerc'h 2015-08-20 14:44:38 UTC

Sorry, I had been using different trees indeed. Here's new traces, including a similar but different one.

+ Trace 235361

#0 __GI_raise
at ../nptl/sysdeps/unix/sysv/linux/raise.c line 64
#1 __GI_abort
at abort.c line 91
#2 g_thread_abort
at gthread-posix.c line 75
#3 g_mutex_lock
at gthread-posix.c line 210
#4 analyze_new_pad
at gstdecodebin2.c line 1867
#5 pad_added_cb
at gstdecodebin2.c line 2838
#6 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#7 _g_closure_invoke_va
at gclosure.c line 831
#8 g_signal_emit_valist
at gsignal.c line 3215
#9 g_signal_emit
at gsignal.c line 3363
#10 gst_element_add_pad
at gstelement.c line 694
#11 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#12 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#13 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#14 gst_task_func
at gsttask.c line 331
#15 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#16 g_thread_proxy
at gthread.c line 764
#17 start_thread
at pthread_create.c line 308
#18 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#19 ??
#0 gst_decode_group_new
at gstdecodebin2.c line 3605
#1 gst_decode_chain_get_current_group
at gstdecodebin2.c line 3125
#2 analyze_new_pad
at gstdecodebin2.c line 1525
#3 pad_added_cb
at gstdecodebin2.c line 2838
#4 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#5 _g_closure_invoke_va
at gclosure.c line 831
#6 g_signal_emit_valist
at gsignal.c line 3215
#7 g_signal_emit
at gsignal.c line 3363
#8 gst_element_add_pad
at gstelement.c line 694
#9 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#10 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#11 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#12 gst_task_func
at gsttask.c line 331
#13 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#14 g_thread_proxy
at gthread.c line 764
#15 start_thread
at pthread_create.c line 308
#16 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#17 ??
#0 __lll_lock_wait
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S line 132
#1 _L_lock_1006
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 __pthread_mutex_lock
at pthread_mutex_lock.c line 101
#3 g_mutex_lock
at gthread-posix.c line 209
#4 analyze_new_pad
at gstdecodebin2.c line 1524
#5 pad_added_cb
at gstdecodebin2.c line 2838
#6 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#7 _g_closure_invoke_va
at gclosure.c line 831
#8 g_signal_emit_valist
at gsignal.c line 3215
#9 g_signal_emit
at gsignal.c line 3363
#10 gst_element_add_pad
at gstelement.c line 694
#11 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#12 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#13 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#14 gst_task_func
at gsttask.c line 331
#15 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#16 g_thread_proxy
at gthread.c line 764
#17 start_thread
at pthread_create.c line 308
#18 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#19 ??
#0 g_mutex_get_impl
at gthread-posix.c line 120
#1 g_mutex_lock
at gthread-posix.c line 209
#2 gst_decode_group_control_demuxer_pad
at gstdecodebin2.c line 3678
#3 connect_pad
at gstdecodebin2.c line 2052
#4 analyze_new_pad
at gstdecodebin2.c line 1770
#5 pad_added_cb
at gstdecodebin2.c line 2838
#6 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#7 _g_closure_invoke_va
at gclosure.c line 831
#8 g_signal_emit_valist
at gsignal.c line 3215
#9 g_signal_emit
at gsignal.c line 3363
#10 gst_element_add_pad
at gstelement.c line 694
#11 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#12 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#13 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#14 gst_task_func
at gsttask.c line 331
#15 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#16 g_thread_proxy
at gthread.c line 764
#17 start_thread
at pthread_create.c line 308
#18 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#19 ??

Comment 16 Vincent Penquerc'h 2015-08-20 14:45:43 UTC

Interestingly, while trying to repro the original trace, which I can't so far, I had a lone deadlock:

+ Trace 235362

Thread 2 (Thread 0x7fcd008c3700 (LWP 31266))

#0 __lll_lock_wait
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S line 132
#1 _L_lock_858
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 __pthread_mutex_lock
at pthread_mutex_lock.c line 61
#3 g_mutex_lock
at gthread-posix.c line 209
#4 analyze_new_pad
at gstdecodebin2.c line 1867
#5 pad_added_cb
at gstdecodebin2.c line 2838
#6 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#7 _g_closure_invoke_va
at gclosure.c line 831
#8 g_signal_emit_valist
at gsignal.c line 3215
#9 g_signal_emit
at gsignal.c line 3363
#10 gst_element_add_pad
at gstelement.c line 694
#11 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#12 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#13 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#14 gst_task_func
at gsttask.c line 331
#15 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#16 g_thread_proxy
at gthread.c line 764
#17 start_thread
at pthread_create.c line 308
#18 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#19 ??

Comment 17 Vincent Penquerc'h 2015-08-20 14:56:11 UTC

Another crash for completeness. All happening in the same vicinity and
pointing to some missing locking. Still can't get the original one.

+ Trace 235363

#0 g_list_prepend
at glist.c line 314
#1 analyze_new_pad
at gstdecodebin2.c line 1876
#2 pad_added_cb
at gstdecodebin2.c line 2838
#3 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#4 _g_closure_invoke_va
at gclosure.c line 831
#5 g_signal_emit_valist
at gsignal.c line 3215
#6 g_signal_emit
at gsignal.c line 3363
#7 gst_element_add_pad
at gstelement.c line 694
#8 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#9 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#10 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#11 gst_task_func
at gsttask.c line 331
#12 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#13 g_thread_proxy
at gthread.c line 764
#14 start_thread
at pthread_create.c line 308
#15 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#16 ??

Comment 18 Vincent Penquerc'h 2015-08-20 16:31:53 UTC

Created attachment 309745 [details] [review]
lock the expose lock around decode chain use

Comment 19 Vincent Penquerc'h 2015-08-20 16:32:24 UTC

Created attachment 309746 [details] [review]
early out on pad-added if the pad is inactive

Comment 20 Vincent Penquerc'h 2015-08-20 16:33:45 UTC

These help. I can still get a crash, but I have to wait a while longer before I see it. I think there needs to be some more locking in the deactivated pad early out, as the deactivation might be happening in the window just after the check. And possibly other places also need locking :/

Comment 21 Vincent Penquerc'h 2015-08-20 16:48:24 UTC

The expose lock would be tempting to use here, as it is used in PAUSED->READY to protect the decode_chain destruction, but it'd need to bring in the whole of analyze_pad, which is pretty massive (and does use the expose lock itself for relevant bits).

Comment 22 Vincent Penquerc'h 2015-08-21 12:13:40 UTC

Created attachment 309811 [details] [review]
early out on pad-added if the pad is inactive

This seems to completely fix the problem. The test case has been running for six minutes now.

Comment 23 Sebastian Dröge (slomo) 2015-08-21 12:30:26 UTC

The changes seem to make sense, but with everything in decodebin they might also break other things. Please run gst-validate and see if it causes any regressions :)

Maybe more checks for now-inactive pads are needed inside decodebin? Also why are inactive pads actually causing problems?

Comment 24 Vincent Penquerc'h 2015-08-21 12:45:12 UTC

Ha ha, prudent course of action :D
There may be other places that could benefit from similar changes, but I'm scared to touch other places tbh.

AFAICT, the issue here is that another thread is switching state downwards while the streaming thread is advertising a newly found pad after a very recent upwards state change. The downwards state change causes the pad to become deactivated between the moment it was activated and pad-added called. and the moment where the pad-added callback is done autoplugging. So by taking the stream lock, we ensure the pad can't be deactivated till we're done, as well as catch the case where it got deactivated just before we got the stream lock.

After the pad gets deactivated, I expect all manners of free/deinit code to be run, which explains the odd "corrupted" like crashes. I did not try to take a ref to the pad for the duration of pad-added, but I wouldn't expect that to work, since there must be a ref at that point, held by the caller (or even the deactivated check would bomb).

Anyway, latest gst-validate is running now.

Comment 25 Vincent Penquerc'h 2015-08-21 14:28:55 UTC

There is indeed a regression, tests starting from circa 210 (transcoding) hang in gst-validate-media-check-1.0. I've just seen that a build without those patches above gets past those.

Comment 26 Vincent Penquerc'h 2015-08-26 13:44:19 UTC

Created attachment 310031 [details] [review]
fix deadlock in discoverer

This fixes that issue.
gst-validate-launcher has failures, but they also happen after a test run with the patches in the bug removed.

Comment 27 GstBlub 2015-09-19 12:35:29 UTC

I believe one of the root causes of this issue is that the GstDecodeChain is actually freed while the type_found signal handler is running.  See the patch I submitted in bug #755260

Comment 28 Sebastian Dröge (slomo) 2015-10-21 09:55:18 UTC

Vincent, GstBlub, what's the status here? Which patches are ready to be merged/reviewed?

Comment 29 Vincent Penquerc'h 2015-10-21 11:20:44 UTC

Created attachment 313803 [details] [review]
early out on pad-added if the pad is inactive

Comment 30 Vincent Penquerc'h 2015-10-21 11:21:49 UTC

I could still get a deadlock without the patches, with the patch from bug #755260. I'm running the test case now and it's going pretty well up to now. One of the patches did not apply anymore, I just updated it.

Comment 31 Sebastian Dröge (slomo) 2015-10-21 11:28:38 UTC

Comment on attachment 313803 [details] [review]
early out on pad-added if the pad is inactive

This looks racy, and why is it a problem? If the pad was deactivated already, later code should just fail gracefully and return

Comment 32 Sebastian Dröge (slomo) 2015-10-21 11:30:29 UTC

Ok, see my comments about that one patch... the others seem good to go already. Would you mind updating the one in bug #755260 and checking if it makes sense, and if it does also merge it? :)

Comment 33 Vincent Penquerc'h 2015-10-21 12:31:33 UTC

Pad deactivation can occur without the stream lock being taken, so it's indeed racy. The stream lock is briefly taken when stopping the task, but that doesn't help here.

I tried with the two OK'd patches here and the one from bug #755260, and I got a crash in analyze_pad, to which I'll attach the stack trace in a minute. The two bins are different. The chain used in the thread that crashes seems valid, and with a ref of 1, so it's unclear why pthread is barfing over this.

So almost there (it took some time for the test case to break), but not quite.

Comment 34 Vincent Penquerc'h 2015-10-21 12:32:39 UTC

+ Trace 235602

Thread 3 (Thread 0x7f221a75c700 (LWP 14560))

#0 gst_element_factory_get_type
at gstelementfactory.c line 87
#1 gst_element_factory_get_static_pad_templates
at gstelementfactory.c line 599
#2 gst_element_factory_list_filter
at gstelementfactory.c line 850
#3 gst_uri_decode_bin_autoplug_factories
at gsturidecodebin.c line 342
#4 ffi_call_unix64
from /usr/lib/x86_64-linux-gnu/libffi.so.6
#5 ffi_call
from /usr/lib/x86_64-linux-gnu/libffi.so.6
#6 g_cclosure_marshal_generic
at gclosure.c line 1448
#7 g_closure_invoke
at gclosure.c line 768
#8 signal_emit_unlocked_R
at gsignal.c line 3589
#9 g_signal_emit_valist
at gsignal.c line 3317
#10 g_signal_emit
at gsignal.c line 3363
#11 proxy_autoplug_factories_signal
at gsturidecodebin.c line 1722
#12 ffi_call_unix64
from /usr/lib/x86_64-linux-gnu/libffi.so.6
#13 ffi_call
from /usr/lib/x86_64-linux-gnu/libffi.so.6
#14 g_cclosure_marshal_generic
at gclosure.c line 1448
#15 g_closure_invoke
at gclosure.c line 768
#16 signal_emit_unlocked_R
at gsignal.c line 3551
#17 g_signal_emit_valist
at gsignal.c line 3317
#18 g_signal_emit
at gsignal.c line 3363
#19 analyze_new_pad
at gstdecodebin2.c line 1610
#20 pad_added_cb
at gstdecodebin2.c line 2906
#21 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#22 _g_closure_invoke_va
at gclosure.c line 831
#23 g_signal_emit_valist
at gsignal.c line 3215
#24 g_signal_emit
at gsignal.c line 3363
#25 gst_element_add_pad
at gstelement.c line 697
#26 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#27 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#28 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#29 gst_task_func
at gsttask.c line 331
#30 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#31 g_thread_proxy
at gthread.c line 764
#32 start_thread
at pthread_create.c line 308
#33 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#34 ??

Thread 1 (Thread 0x7f2211321700 (LWP 14565))

#0 __GI_raise
at ../nptl/sysdeps/unix/sysv/linux/raise.c line 64
#1 __GI_abort
at abort.c line 91
#2 __assert_fail_base
#3 __GI___assert_fail
#4 __pthread_tpp_change_priority
at tpp.c line 61
#5 __pthread_mutex_lock_full
at pthread_mutex_lock.c line 415
#6 g_mutex_lock
at gthread-posix.c line 209
#7 analyze_new_pad
at gstdecodebin2.c line 1873
#8 pad_added_cb
at gstdecodebin2.c line 2906
#9 g_cclosure_marshal_VOID__OBJECTv
at gmarshal.c line 1312
#10 _g_closure_invoke_va
at gclosure.c line 831
#11 g_signal_emit_valist
at gsignal.c line 3215
#12 g_signal_emit
at gsignal.c line 3363
#13 gst_element_add_pad
at gstelement.c line 697
#14 gst_ogg_demux_activate_chain
at gstoggdemux.c line 2923
#15 gst_ogg_demux_perform_seek_pull
at gstoggdemux.c line 3582
#16 gst_ogg_demux_loop
at gstoggdemux.c line 4757
#17 gst_task_func
at gsttask.c line 331
#18 g_thread_pool_thread_proxy
at gthreadpool.c line 307
#19 g_thread_proxy
at gthread.c line 764
#20 start_thread
at pthread_create.c line 308
#21 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
#22 ??

Comment 35 Sebastian Dröge (slomo) 2015-10-21 13:25:01 UTC

So it asserts because the chain mutex can't be taken because the chain is already gone? That sounds like the refcounting patch from the other bug can help here. You could connect to the signals with taking another ref on the chain, and having a destroy notify that unrefs the chain again. g_signal_connect_data().

Should probably be used for the other signals too.

Comment 36 Vincent Penquerc'h 2015-10-21 14:21:05 UTC

The chain isn't gone, it has a ref of 1.

print *chain
$1 = {parent = 0x0, dbin = 0xb89610, refs = 1, lock = {p = 0x7f2208017fd0, i = {134315984, 32546}}, pad = 0x0, drained = 0, 
  demuxer = 1, adaptive_demuxer = 0, seekable = 1, elements = 0x0, active_group = 0x0, next_groups = 0x0, pending_pads = 0x0, 
  current_pad = 0x0, endpad = 0x0, deadend = 0, deadend_details = 0x0, endcaps = 0x0, old_groups = 0x0}

Comment 37 Vincent Penquerc'h 2015-10-21 15:07:32 UTC

(In reply to Sebastian Dröge (slomo) from comment #35)
> So it asserts because the chain mutex can't be taken because the chain is
> already gone? That sounds like the refcounting patch from the other bug can
> help here.

For clarity: I am using that refcouting patch, and the chain is still alive when this happens. Adding logs shows that it is indeed not yet destroyed when this happens (and the logs cause the problem to manifest itself more often).

Comment 38 Vincent Penquerc'h 2015-10-21 15:32:04 UTC

(In reply to Sebastian Dröge (slomo) from comment #35)
> You could connect to the signals with taking another ref on the
> chain, and having a destroy notify that unrefs the chain again.
> g_signal_connect_data().

If I take an extra ref on the chain for all the 4 places where a signal is connected with the chain (I never release those refs), I still get crashes. This is consistent with the chain being not yet destroyed above, so this is not the problem. It looks like the mutex is somehow getting wrecked, but I don't see other calls to g_mutex_clear that could explain this.

Comment 39 Vincent Penquerc'h 2015-10-21 16:19:53 UTC

I'm not 100% sure this is related, but for completeness: I've been running with valgrind, and while I'm not seeing that crash so far, I've seen another one, twice now:

I get the message:

(uridecodebin-test:17558): GStreamer-WARNING **: 
Trying to join task 0xfc47120 from its thread would deadlock.
You cannot change the state of an element from its streaming
thread. Use g_idle_add() or post a GstMessage on the bus to
schedule the state change from the main thread.

from:

==17558== 
==17558== Process terminating with default action of signal 5 (SIGTRAP): dumping core
==17558==    at 0x53F2A91: _g_log_abort (gmessages.c:311)
==17558==    by 0x53F36E3: g_logv (gmessages.c:1033)
==17558==    by 0x53F3851: g_log (gmessages.c:1071)
==17558==    by 0x4ED7B60: gst_task_join (gsttask.c:847)
==17558==    by 0x4EA36BC: gst_pad_finalize (gstpad.c:730)
==17558==    by 0x5166902: g_object_unref (gobject.c:3112)
==17558==    by 0x4E8D08F: gst_element_remove_pad (gstelement.c:811)
==17558==    by 0x4E8D442: gst_element_dispose (gstelement.c:2934)
==17558==    by 0x516687C: g_object_unref (gobject.c:3075)
==17558==    by 0x4EA389D: do_stream_status.isra.0 (gstpad.c:5793)
==17558==    by 0x4ED6982: gst_task_func (gsttask.c:302)
==17558==    by 0x5410B67: g_thread_pool_thread_proxy (gthreadpool.c:307)
==17558==    by 0x5410304: g_thread_proxy (gthread.c:764)
==17558==    by 0x56DAE99: start_thread (pthread_create.c:308)
==17558== 

It looks like the pad in question was unreffed while do_stream_status was running, and holding the last ref, so unreffing it causes the pad to be destroyed, which causes its task to stop and join. I'm not familiar with this code, but it seems like it should not happen.

This might need a new bug for it if it turns out to be a different issue.

Comment 40 Vincent Penquerc'h 2015-10-23 12:40:58 UTC

Created attachment 313938 [details] [review]
WIP trying to get rid of races

This is what I have at the moment. I don't get any issues with mutexes, as the lifetime of the chain is now OK. I keep a dead boolean in the chain to avoid calling analyze_new_pad while it's being destroyed. This is still racy though, as the chain lock would need keeping over analyze_pad, and it can't.

I don't really get the comment about the ref on the signal connect. I don't think that'd help anyway, as the signal might have been just triggered at the time the chain starts getting destroyed (ie, the chain starts getting destroyed just after pad_added_cb released the chain mutex lock, but before analyze_new_pad is called.

I'm still seeing the "joining from same thread" errors though. Looks like oggdemux still has a ref of 1, while its bin is now destroyed.

Comment 41 Sebastian Dröge (slomo) 2015-10-23 13:20:36 UTC

(In reply to Vincent Penquerc'h from comment #40)

> I don't really get the comment about the ref on the signal connect. I don't
> think that'd help anyway, as the signal might have been just triggered at
> the time the chain starts getting destroyed (ie, the chain starts getting
> destroyed just after pad_added_cb released the chain mutex lock, but before
> analyze_new_pad is called.

My assumption was that destroying the chain would only happen then once the signal is actually disconnected (i.e. the element we connected to gets destroyed). If you use g_signal_connect_data() and pass as GDestroyNotify the unref function of the chain, then the last reference of the chain will never get to refcount 0 while the signal is still connected (or being emitted).

Comment 42 Vincent Penquerc'h 2015-10-23 13:30:46 UTC

Right. The signals are disconnected in the free function though (in the hide step). Which might mean we need to have three steps of chain deletion, but that seems like too complex.

Comment 43 Vincent Penquerc'h 2015-10-23 14:57:33 UTC

Created attachment 313944 [details] [review]
WIP trying to get rid of races

This should perform a similar check. A ref is added to the chain when connecting to pad-added, removed when the padd-added signal is removed, and a first "hide" step is performed before the actual free, so that the signal can't get a chain that's just started to be freed.

Comment 44 Vincent Penquerc'h 2015-11-03 15:22:00 UTC

Created attachment 314739 [details] [review]
lock the expose lock around decode chain use

The previous expose locking patches caused deadlocks in the unit tests, fixed in this version.

Comment 45 Vincent Penquerc'h 2015-12-01 14:57:22 UTC

Created attachment 316606 [details] [review]
lock the expose lock around decode chain use

latest version

Comment 46 Vincent Penquerc'h 2015-12-01 14:58:19 UTC

Created attachment 316607 [details] [review]
early out on pad-added if the pad is inactive

latest version

Comment 47 Vincent Penquerc'h 2015-12-01 15:02:08 UTC

I just tried again on the latest tree. With those patches above and from https://bugzilla.gnome.org/show_bug.cgi?id=755260, the test case from https://bugzilla.gnome.org/show_bug.cgi?id=752623 runs for minutes before I get bored and ^C it. I think there's still something iffy since I was getting rare-ish memory corruption like issues before, even with those patches in, but it's still a noticeable improvement. So I'm going to push those after a couple hours to allow for any NAK to those latest versions.

Comment 48 Vincent Penquerc'h 2015-12-01 17:43:29 UTC

Pushed, alongside the patch from https://bugzilla.gnome.org/show_bug.cgi?id=755260

commit 870c6df489238b0ad9ee988ca41e92babc9a8256
Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk>
Date:   Thu Aug 20 17:30:38 2015 +0100

    decodebin: early out on pad-added when the pad is inactive
    
    The pad may be recently deactivated if the element is switched
    back down very quickly.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=752651

commit ee44337fc3e3030a5155d28b3561af157e6c6003
Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk>
Date:   Thu Aug 20 17:29:36 2015 +0100

    decodebin: lock the expose lock around decode_chain use
    
    Helps with a crash in decodebin when quickly switching states.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=752651

Comment 49 Tim-Philipp Müller 2015-12-02 18:20:09 UTC

Follow-up:

commit 71505dfa24fc402bbe44b61851270cd618556f25
Author: Tim-Philipp Müller <tim@centricular.com>
Date:   Wed Dec 2 18:16:05 2015 +0000

    decodebin2: fix "Attempt to unlock mutex that was not locked"
    
    Introduced in commit ee44337f, caused the decodebin
    test_text_plain_streams unit test to abort.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=752651