Bug 755123 – tagdemux: protect internal state for concurrent access

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 755123 - tagdemux: protect internal state for concurrent access


Summary:	tagdemux: protect internal state for concurrent access


Status:	RESOLVED DUPLICATE of bug 775687

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal critical
Target Milestone:	NONE
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:	754375

Reported:	2015-09-16 16:19 UTC by GstBlub
Modified:	2016-12-06 18:44 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch file (26.63 KB, patch) 2015-09-16 16:19 UTC, GstBlub	none	Details \| Review

Description GstBlub 2015-09-16 16:19:07 UTC

Created attachment 311484 [details] [review]
Patch file

The tagdemux doesn't appear to do any kind of locking.  This causes crashes if the pipeline is torn down just at the "right" time, especially while a typefind is in progress.  In my case it causes a crash in about 5-10% of the cases, and when that happens it appears to be very similar to what was reported in bug #700340.  It appear that bug #754375 would likely be solved with this patch as well.

Comment 1 GstBlub 2015-09-16 16:25:50 UTC

This is the typical back trace I got whenever I run into this condition (1.4):

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 23999]
0xb75ad545 in g_type_check_value_holds () from /usr/lib/libgobject-2.0.so.0
(gdb) bt

+ Trace 235460

#0 g_type_check_value_holds
from /usr/lib/libgobject-2.0.so.0
#1 gst_tag_list_add_value_internal
at gsttaglist.c line 1030
#2 gst_tag_list_copy_foreach
at gsttaglist.c line 1101
#3 gst_structure_foreach
at gststructure.c line 1130
#4 gst_tag_list_insert
at gsttaglist.c line 1130
#5 gst_tag_list_merge
at gsttaglist.c line 1167
#6 gst_tag_demux_send_tag_event
at gsttagdemux.c line 1789
#7 gst_tag_demux_src_getrange
at gsttagdemux.c line 1681
#8 gst_pad_get_range_unchecked
at gstpad.c line 4263
#9 gst_pad_pull_range
at gstpad.c line 4502
#10 gst_base_parse_pull_range
at gstbaseparse.c line 2951
#11 gst_base_parse_scan_frame
at gstbaseparse.c line 3086
#12 gst_base_parse_loop
at gstbaseparse.c line 3194
#13 gst_task_func
at gsttask.c line 317
#14 start_thread
from /lib/libpthread.so.0
#15 clone
from /lib/libc.so.6

Comment 2 Tim-Philipp Müller 2015-09-16 18:15:48 UTC

Thanks for the patch, you might be onto something here, even if at first glance it looks like you might be doing a bit more than necessary. Haven't looked carefully yet though.

Comment 3 GstBlub 2015-09-16 18:57:17 UTC

(In reply to Tim-Philipp Müller from comment #2)
> Thanks for the patch, you might be onto something here, even if at first
> glance it looks like you might be doing a bit more than necessary. Haven't
> looked carefully yet though.
Yeah I'm honestly not sure if what I'm going is necessarily correct, but it reliably fixed these crashes.  I haven't been able to crash it anymore.  It's a little ugly, I agree, and maybe using a separate mutex would simplify things, but I do know for sure that a lot of the ->priv members are accessed/used by various threads and none of which is protected, resulting in this particular crash.  I wouldn't be shocked if it also caused some other memory corruption triggering other bugs down the road.

Comment 4 GstBlub 2015-09-16 20:39:08 UTC

I've done some more testing and while I haven't run into any tagdemux crashes anymore, I found another similar issue in decodebin:

1:11:49.314779015 18194 0xb20d9d40 WARN                 basesrc gstbasesrc.c:3460:gst_base_src_start_complete:<source> pad not activated yet
1:11:49.315069258 18194 0xb20d9d40 WARN                GST_PADS gstpad.c:1058:gst_pad_set_active:<qtdemux456:sink> Failed to activate pad
1:11:49.315317966 18194 0xb20d9d40 WARN               decodebin gstdecodebin2.c:2218:connect_pad:<decodebin294> Couldn't set qtdemux456 to PAUSED
GLib (gthread-posix.c): Unexpected error from C library during 'pthread_mutex_lock': Invalid argument.  Aborting.
[New LWP 24512]

Program received signal SIGABRT, Aborted.
[Switching to LWP 24512]
0xffffe424 in __kernel_vsyscall ()
(gdb) thread apply all bt

+ Trace 235461

Thread 11 (LWP 24512)

#0 __kernel_vsyscall
#1 raise
from /lib/libc.so.6
#2 abort
from /lib/libc.so.6
#3 ??
from /usr/lib/libglib-2.0.so.0
#4 connect_pad
at gstdecodebin2.c line 2222
#5 analyze_new_pad
at gstdecodebin2.c line 1742
#6 type_found
at gstdecodebin2.c line 2535
#7 ffi_call_SYSV
from /usr/lib/libffi.so.6
#8 ffi_call
from /usr/lib/libffi.so.6
#9 g_cclosure_marshal_generic
from /usr/lib/libgobject-2.0.so.0
#10 g_closure_invoke
from /usr/lib/libgobject-2.0.so.0
#11 ??
from /usr/lib/libgobject-2.0.so.0
#12 g_signal_emit_valist
from /usr/lib/libgobject-2.0.so.0
#13 g_signal_emit
from /usr/lib/libgobject-2.0.so.0
#14 gst_type_find_element_loop
at gsttypefindelement.c line 1091
#15 gst_task_func
at gsttask.c line 317
#16 start_thread
from /lib/libpthread.so.0
#17 clone
from /lib/libc.so.6

Comment 5 GstBlub 2015-09-16 21:03:56 UTC

And another weird one that somehow feels like it is related:

0:00:34.631547171 25594  0x81d4b00 DEBUG              decodebin gstdecodebin2.c:1205:gst_decode_bin_set_subs_encoding:<decodebin3> Setting new encoding: (NULL)
0:00:34.631695202 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:1207:gst_decode_bin_set_subs_encoding:<decodebin3> subtitle locking from thread 0x81d4b00
0:00:34.631815614 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:1207:gst_decode_bin_set_subs_encoding:<decodebin3> subtitle lock from thread 0x81d4b00
0:00:34.631936792 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:1216:gst_decode_bin_set_subs_encoding:<decodebin3> subtitle unlocking from thread 0x81d4b00
0:00:34.632353356 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4516:gst_decode_bin_change_state:<decodebin3> expose locking from thread 0x81d4b00
0:00:34.632478906 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4516:gst_decode_bin_change_state:<decodebin3> expose locked from thread 0x81d4b00
0:00:34.632599859 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4521:gst_decode_bin_change_state:<decodebin3> expose unlocking from thread 0x81d4b00
0:00:34.632718111 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4522:gst_decode_bin_change_state:<decodebin3> dynlocking from thread 0x81d4b00
0:00:34.632835199 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4522:gst_decode_bin_change_state:<decodebin3> dynlocked from thread 0x81d4b00
0:00:34.632947862 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4523:gst_decode_bin_change_state:<decodebin3> clearing shutdown flag
0:00:34.633063050 25594  0x81d4b00 LOG                decodebin gstdecodebin2.c:4525:gst_decode_bin_change_state:<decodebin3> dynunlocking from thread 0x81d4b00

(xxxxxxxxxxxx:25594): GStreamer-CRITICAL **: chain on pad typefind:sink but it was not in push mode
[New LWP 25674]

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to LWP 25674]
0xb74d8de7 in ?? () from /usr/lib/libglib-2.0.so.0
(gdb) bt

+ Trace 235462

Thread 5 (LWP 25674)

#0 ??
from /usr/lib/libglib-2.0.so.0
#1 g_logv
from /usr/lib/libglib-2.0.so.0
#2 g_log
from /usr/lib/libglib-2.0.so.0
#3 gst_pad_chain_data_unchecked
at gstpad.c line 3882
#4 gst_pad_push_data
at gstpad.c line 4069
#5 gst_pad_push
at gstpad.c line 4180
#6 gst_proxy_pad_chain_default
at gstghostpad.c line 126
#7 gst_pad_chain_data_unchecked
at gstpad.c line 3836
#8 gst_pad_push_data
at gstpad.c line 4069
#9 gst_pad_push
at gstpad.c line 4180
#10 gst_base_src_loop
at gstbasesrc.c line 2835
#11 gst_task_func
at gsttask.c line 317
#12 start_thread
from /lib/libpthread.so.0
#13 clone
from /lib/libc.so.6
#0 __kernel_vsyscall
#1 pthread_cond_timedwait
from /lib/libpthread.so.0
#2 g_cond_wait_until
from /usr/lib/libglib-2.0.so.0
#3 ??
from /usr/lib/libglib-2.0.so.0
#4 ??
from /usr/lib/libglib-2.0.so.0
#5 ??
from /usr/lib/libglib-2.0.so.0
#6 start_thread
from /lib/libpthread.so.0
#7 clone
from /lib/libc.so.6

Comment 6 GstBlub 2015-09-16 22:21:13 UTC

(In reply to GstBlub from comment #5)
> (xxxxxxxxxxxx:25594): GStreamer-CRITICAL **: chain on pad typefind:sink but
> it was not in push mode

This one I now run into quite a lot, but only if I increase the logging output for decodebin...

Still haven't seen any crashes in the tagdemux, though.

Comment 7 Tim-Philipp Müller 2015-09-16 22:35:18 UTC

I think there is another bug for decodebin, let's stick to tagdemux for this one.

Comment 8 Sebastian Dröge (slomo) 2015-09-17 09:50:43 UTC

(In reply to GstBlub from comment #6)
> (In reply to GstBlub from comment #5)
> > (xxxxxxxxxxxx:25594): GStreamer-CRITICAL **: chain on pad typefind:sink but
> > it was not in push mode
> 
> This one I now run into quite a lot, but only if I increase the logging
> output for decodebin...

This one would seem like a race condition in typefind, probably while switching between pull and push mode.

Comment 9 GstBlub 2015-09-17 13:35:08 UTC

(In reply to Tim-Philipp Müller from comment #7)
> I think there is another bug for decodebin, let's stick to tagdemux for this
> one.

I think you're right.  The sad thing is, as soon as I bump up the log level to GST_DEBUG=decodebin:9 I run into this issue *a lot*, at a rate of probably 85% or more :-(

Comment 10 GstBlub 2015-09-17 13:44:18 UTC

(In reply to GstBlub from comment #9)
> (In reply to Tim-Philipp Müller from comment #7)
> > I think there is another bug for decodebin, let's stick to tagdemux for this
> > one.
> 
> I think you're right.  The sad thing is, as soon as I bump up the log level
> to GST_DEBUG=decodebin:9 I run into this issue *a lot*, at a rate of
> probably 85% or more :-(

The only reason why I bumped that log level was because I wanted to reproduce the crasher in comment #4, but that consistently led to the crasher in comment #5 and I haven't been able to reproduce the one in comment #4 with the bumped log level.  The one I originally wanted to reproduce (#4) happened only after several hours of me testing with my patch.

Comment 11 GstBlub 2015-09-18 22:52:18 UTC

(In reply to Tim-Philipp Müller from comment #7)
> I think there is another bug for decodebin, let's stick to tagdemux for this
> one.

I suspect you mean bug #690420?  I have a patch that appears to be fixing this, just waiting for approval before I can submit it.

Once I fixed this decodebin bug, I ran into yet another very similar crash in baseparse:

+ Trace 235469

Comment 12 GstBlub 2015-09-18 22:54:00 UTC

I omitted the assertion warning:

** (xxxxxxxxx:24472): CRITICAL **: gst_pb_utils_add_codec_description_to_tag_list: assertion 'caps != NULL' failed

Comment 13 Sebastian Dröge (slomo) 2015-09-19 16:02:11 UTC

Can you try to simplify the locking patch a bit, maybe by refactoring the locking in general, taking locks a bit longer, adding new locks? :) Also we usually just call functions blabla_locked() if they are expected to be called with a lock, and blabla() a wrapper around it that just does the locking. Instead of adding a "locked" parameter

Comment 14 GstBlub 2015-09-19 20:40:51 UTC

The reason I added a "locked" parameter is that the function in some instances unlocks the object without aquiring the lock again.  This avoids having to lock it again and then as soon as the function returns unlocking it.

Yes, it's a little ugly but I don't know if another mutex would help simplifying it much.  I am worried about all kinds of functions (like gst_pad_*, some gst_element_*) ending up calling back into the code causing mutexes that are already locked to be locked again.  Of course we could use a recursive mutex, but that comes with overhead, too.

FYI, I still haven't run into any crashes in tagdemux anymore.

Comment 15 Tim-Philipp Müller 2016-12-06 18:44:25 UTC

I strongly suspect this was fixed by

commit 7c1a32e28b9d145ee25e3d3141dee7821ba8a7e2
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Tue Dec 6 16:29:23 2016 +0200

    tagdemux: Fix crash when shutting down element during getrange()
    
    Ensure that nothing is in any of the streaming thread functions
    anymore when going from PAUSED to READY. While the parent's state change
    function has deactivated all pads, there is nothing preventing
    downstream from activating our srcpad again and calling the getrange()
    function. Although we're in READY!
    
    https://bugzilla.gnome.org/show_bug.cgi?id=775687

looks like the same trace at least.

If not please re-open, thanks!

*** This bug has been marked as a duplicate of bug 775687 ***