GNOME Bugzilla – Bug 755123
tagdemux: protect internal state for concurrent access
Last modified: 2016-12-06 18:44:25 UTC
Created attachment 311484 [details] [review] Patch file The tagdemux doesn't appear to do any kind of locking. This causes crashes if the pipeline is torn down just at the "right" time, especially while a typefind is in progress. In my case it causes a crash in about 5-10% of the cases, and when that happens it appears to be very similar to what was reported in bug #700340. It appear that bug #754375 would likely be solved with this patch as well.
This is the typical back trace I got whenever I run into this condition (1.4): Program received signal SIGSEGV, Segmentation fault. [Switching to LWP 23999] 0xb75ad545 in g_type_check_value_holds () from /usr/lib/libgobject-2.0.so.0 (gdb) bt
+ Trace 235460
Thanks for the patch, you might be onto something here, even if at first glance it looks like you might be doing a bit more than necessary. Haven't looked carefully yet though.
(In reply to Tim-Philipp Müller from comment #2) > Thanks for the patch, you might be onto something here, even if at first > glance it looks like you might be doing a bit more than necessary. Haven't > looked carefully yet though. Yeah I'm honestly not sure if what I'm going is necessarily correct, but it reliably fixed these crashes. I haven't been able to crash it anymore. It's a little ugly, I agree, and maybe using a separate mutex would simplify things, but I do know for sure that a lot of the ->priv members are accessed/used by various threads and none of which is protected, resulting in this particular crash. I wouldn't be shocked if it also caused some other memory corruption triggering other bugs down the road.
I've done some more testing and while I haven't run into any tagdemux crashes anymore, I found another similar issue in decodebin: 1:11:49.314779015 18194 0xb20d9d40 WARN basesrc gstbasesrc.c:3460:gst_base_src_start_complete:<source> pad not activated yet 1:11:49.315069258 18194 0xb20d9d40 WARN GST_PADS gstpad.c:1058:gst_pad_set_active:<qtdemux456:sink> Failed to activate pad 1:11:49.315317966 18194 0xb20d9d40 WARN decodebin gstdecodebin2.c:2218:connect_pad:<decodebin294> Couldn't set qtdemux456 to PAUSED GLib (gthread-posix.c): Unexpected error from C library during 'pthread_mutex_lock': Invalid argument. Aborting. [New LWP 24512] Program received signal SIGABRT, Aborted. [Switching to LWP 24512] 0xffffe424 in __kernel_vsyscall () (gdb) thread apply all bt
+ Trace 235461
Thread 11 (LWP 24512)
And another weird one that somehow feels like it is related: 0:00:34.631547171 25594 0x81d4b00 DEBUG decodebin gstdecodebin2.c:1205:gst_decode_bin_set_subs_encoding:<decodebin3> Setting new encoding: (NULL) 0:00:34.631695202 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:1207:gst_decode_bin_set_subs_encoding:<decodebin3> subtitle locking from thread 0x81d4b00 0:00:34.631815614 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:1207:gst_decode_bin_set_subs_encoding:<decodebin3> subtitle lock from thread 0x81d4b00 0:00:34.631936792 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:1216:gst_decode_bin_set_subs_encoding:<decodebin3> subtitle unlocking from thread 0x81d4b00 0:00:34.632353356 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4516:gst_decode_bin_change_state:<decodebin3> expose locking from thread 0x81d4b00 0:00:34.632478906 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4516:gst_decode_bin_change_state:<decodebin3> expose locked from thread 0x81d4b00 0:00:34.632599859 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4521:gst_decode_bin_change_state:<decodebin3> expose unlocking from thread 0x81d4b00 0:00:34.632718111 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4522:gst_decode_bin_change_state:<decodebin3> dynlocking from thread 0x81d4b00 0:00:34.632835199 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4522:gst_decode_bin_change_state:<decodebin3> dynlocked from thread 0x81d4b00 0:00:34.632947862 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4523:gst_decode_bin_change_state:<decodebin3> clearing shutdown flag 0:00:34.633063050 25594 0x81d4b00 LOG decodebin gstdecodebin2.c:4525:gst_decode_bin_change_state:<decodebin3> dynunlocking from thread 0x81d4b00 (xxxxxxxxxxxx:25594): GStreamer-CRITICAL **: chain on pad typefind:sink but it was not in push mode [New LWP 25674] Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to LWP 25674] 0xb74d8de7 in ?? () from /usr/lib/libglib-2.0.so.0 (gdb) bt
+ Trace 235462
Thread 5 (LWP 25674)
(In reply to GstBlub from comment #5) > (xxxxxxxxxxxx:25594): GStreamer-CRITICAL **: chain on pad typefind:sink but > it was not in push mode This one I now run into quite a lot, but only if I increase the logging output for decodebin... Still haven't seen any crashes in the tagdemux, though.
I think there is another bug for decodebin, let's stick to tagdemux for this one.
(In reply to GstBlub from comment #6) > (In reply to GstBlub from comment #5) > > (xxxxxxxxxxxx:25594): GStreamer-CRITICAL **: chain on pad typefind:sink but > > it was not in push mode > > This one I now run into quite a lot, but only if I increase the logging > output for decodebin... This one would seem like a race condition in typefind, probably while switching between pull and push mode.
(In reply to Tim-Philipp Müller from comment #7) > I think there is another bug for decodebin, let's stick to tagdemux for this > one. I think you're right. The sad thing is, as soon as I bump up the log level to GST_DEBUG=decodebin:9 I run into this issue *a lot*, at a rate of probably 85% or more :-(
(In reply to GstBlub from comment #9) > (In reply to Tim-Philipp Müller from comment #7) > > I think there is another bug for decodebin, let's stick to tagdemux for this > > one. > > I think you're right. The sad thing is, as soon as I bump up the log level > to GST_DEBUG=decodebin:9 I run into this issue *a lot*, at a rate of > probably 85% or more :-( The only reason why I bumped that log level was because I wanted to reproduce the crasher in comment #4, but that consistently led to the crasher in comment #5 and I haven't been able to reproduce the one in comment #4 with the bumped log level. The one I originally wanted to reproduce (#4) happened only after several hours of me testing with my patch.
(In reply to Tim-Philipp Müller from comment #7) > I think there is another bug for decodebin, let's stick to tagdemux for this > one. I suspect you mean bug #690420? I have a patch that appears to be fixing this, just waiting for approval before I can submit it. Once I fixed this decodebin bug, I ran into yet another very similar crash in baseparse:
+ Trace 235469
I omitted the assertion warning: ** (xxxxxxxxx:24472): CRITICAL **: gst_pb_utils_add_codec_description_to_tag_list: assertion 'caps != NULL' failed
Can you try to simplify the locking patch a bit, maybe by refactoring the locking in general, taking locks a bit longer, adding new locks? :) Also we usually just call functions blabla_locked() if they are expected to be called with a lock, and blabla() a wrapper around it that just does the locking. Instead of adding a "locked" parameter
The reason I added a "locked" parameter is that the function in some instances unlocks the object without aquiring the lock again. This avoids having to lock it again and then as soon as the function returns unlocking it. Yes, it's a little ugly but I don't know if another mutex would help simplifying it much. I am worried about all kinds of functions (like gst_pad_*, some gst_element_*) ending up calling back into the code causing mutexes that are already locked to be locked again. Of course we could use a recursive mutex, but that comes with overhead, too. FYI, I still haven't run into any crashes in tagdemux anymore.
I strongly suspect this was fixed by commit 7c1a32e28b9d145ee25e3d3141dee7821ba8a7e2 Author: Sebastian Dröge <sebastian@centricular.com> Date: Tue Dec 6 16:29:23 2016 +0200 tagdemux: Fix crash when shutting down element during getrange() Ensure that nothing is in any of the streaming thread functions anymore when going from PAUSED to READY. While the parent's state change function has deactivated all pads, there is nothing preventing downstream from activating our srcpad again and calling the getrange() function. Although we're in READY! https://bugzilla.gnome.org/show_bug.cgi?id=775687 looks like the same trace at least. If not please re-open, thanks! *** This bug has been marked as a duplicate of bug 775687 ***