GNOME Bugzilla – Bug 752651
decodebin: segfault on setting to NULL
Last modified: 2015-12-02 18:20:09 UTC
+++ This bug was initially created as a clone of Bug #752623 +++ With the sample application from the #752623 bug decodebin segfaults. The trace is as follows: Program received signal SIGSEGV, Segmentation fault.
+ Trace 235290
Thread 140737050691328 (LWP 26464)
Not quite sure what to make of this. In gst_decode_group_control_demuxer_pad, group->parent is NULL. This seems totally expected, as gst_decode_chain_new can be called with an explicitely NULL parent in type_found. However, CHAIN_MUTEX_LOCK does seem to protect an awful lot of stuff, so I'm not confident it's safe just to omit the lock when group->parent is NULL...
Actually, the above turns out to be wrong, a group isn't a chain, and gst_decode_group_new is always called with a parent that must non non-NULL as it was always dereferenced before the call.
Running the test case from the linked bug, I saw a couple other related crashes.
+ Trace 235358
Trying to see if some previous locking patches were the cause, I can still get this crash with 2813e08210c04ae7de60e67af3ed81c5b65b1723. However, it's hard to get because we get deadlocks sooner most of the time. (gdb) bt
+ Trace 235359
Can you get a backtrace of all threads of the deadlock?
I got these while testing a version of -base from before patches that fixed those deadlocks, to be clear. This was to see if I could still get the crash then, or whether the crash was introduced by those locking patches. So I'm not seeing any deadlock with current gstreamer. Or did you want to see the stack traces anyway ?
No, I only care about problems with latest GIT master ;) So it only crashes now?
Yes, just the crashes now.
Can you provide a backtrace with up-to-date GStreamer then? And check why it crashes in that line, e.g. if whatever there is NULL or some random pointer.
The main backtrace is in the first comment from Thiago. It is up to date AFAIK. The parent pointer was NULL (not sure why, as none of the code seems to allow it to be NULL). Comment 3 has a couple other traces that seem related, which I put in for reference.
The line numbers don't match anymore, that's why I was asking for a new backtrace :)
Also gst_decode_group_new() handles parent==NULL just fine. For the very first demuxer in a stream (the most upstream one) that happens IIRC.
I can't seem to be able to get the original crash, I always end up with one of the two ones I posted traces earlier today. I do recall getting the crash when I first looked at it a few weeks ago, and group->parent was NULL. The only calls I see to gst_decode_group_new are from gst_decode_chain_get_current_group, which calls them with the chain variable, which was always dereferenced earlier, here: /* Now we know that we can really return something useful */ if (!chain->active_group) { So I don't see how it can end up NULL in _group_new.
The line numbers in the two backtraces you pasted earlier today also don't map to anything useful in the latest gstdecodebin2.c :)
Sorry, I had been using different trees indeed. Here's new traces, including a similar but different one.
+ Trace 235361
Interestingly, while trying to repro the original trace, which I can't so far, I had a lone deadlock:
+ Trace 235362
Thread 2 (Thread 0x7fcd008c3700 (LWP 31266))
Another crash for completeness. All happening in the same vicinity and pointing to some missing locking. Still can't get the original one.
+ Trace 235363
Created attachment 309745 [details] [review] lock the expose lock around decode chain use
Created attachment 309746 [details] [review] early out on pad-added if the pad is inactive
These help. I can still get a crash, but I have to wait a while longer before I see it. I think there needs to be some more locking in the deactivated pad early out, as the deactivation might be happening in the window just after the check. And possibly other places also need locking :/
The expose lock would be tempting to use here, as it is used in PAUSED->READY to protect the decode_chain destruction, but it'd need to bring in the whole of analyze_pad, which is pretty massive (and does use the expose lock itself for relevant bits).
Created attachment 309811 [details] [review] early out on pad-added if the pad is inactive This seems to completely fix the problem. The test case has been running for six minutes now.
The changes seem to make sense, but with everything in decodebin they might also break other things. Please run gst-validate and see if it causes any regressions :) Maybe more checks for now-inactive pads are needed inside decodebin? Also why are inactive pads actually causing problems?
Ha ha, prudent course of action :D There may be other places that could benefit from similar changes, but I'm scared to touch other places tbh. AFAICT, the issue here is that another thread is switching state downwards while the streaming thread is advertising a newly found pad after a very recent upwards state change. The downwards state change causes the pad to become deactivated between the moment it was activated and pad-added called. and the moment where the pad-added callback is done autoplugging. So by taking the stream lock, we ensure the pad can't be deactivated till we're done, as well as catch the case where it got deactivated just before we got the stream lock. After the pad gets deactivated, I expect all manners of free/deinit code to be run, which explains the odd "corrupted" like crashes. I did not try to take a ref to the pad for the duration of pad-added, but I wouldn't expect that to work, since there must be a ref at that point, held by the caller (or even the deactivated check would bomb). Anyway, latest gst-validate is running now.
There is indeed a regression, tests starting from circa 210 (transcoding) hang in gst-validate-media-check-1.0. I've just seen that a build without those patches above gets past those.
Created attachment 310031 [details] [review] fix deadlock in discoverer This fixes that issue. gst-validate-launcher has failures, but they also happen after a test run with the patches in the bug removed.
I believe one of the root causes of this issue is that the GstDecodeChain is actually freed while the type_found signal handler is running. See the patch I submitted in bug #755260
Vincent, GstBlub, what's the status here? Which patches are ready to be merged/reviewed?
Created attachment 313803 [details] [review] early out on pad-added if the pad is inactive
I could still get a deadlock without the patches, with the patch from bug #755260. I'm running the test case now and it's going pretty well up to now. One of the patches did not apply anymore, I just updated it.
Comment on attachment 313803 [details] [review] early out on pad-added if the pad is inactive This looks racy, and why is it a problem? If the pad was deactivated already, later code should just fail gracefully and return
Ok, see my comments about that one patch... the others seem good to go already. Would you mind updating the one in bug #755260 and checking if it makes sense, and if it does also merge it? :)
Pad deactivation can occur without the stream lock being taken, so it's indeed racy. The stream lock is briefly taken when stopping the task, but that doesn't help here. I tried with the two OK'd patches here and the one from bug #755260, and I got a crash in analyze_pad, to which I'll attach the stack trace in a minute. The two bins are different. The chain used in the thread that crashes seems valid, and with a ref of 1, so it's unclear why pthread is barfing over this. So almost there (it took some time for the test case to break), but not quite.
+ Trace 235602
Thread 3 (Thread 0x7f221a75c700 (LWP 14560))
Thread 1 (Thread 0x7f2211321700 (LWP 14565))
So it asserts because the chain mutex can't be taken because the chain is already gone? That sounds like the refcounting patch from the other bug can help here. You could connect to the signals with taking another ref on the chain, and having a destroy notify that unrefs the chain again. g_signal_connect_data(). Should probably be used for the other signals too.
The chain isn't gone, it has a ref of 1. print *chain $1 = {parent = 0x0, dbin = 0xb89610, refs = 1, lock = {p = 0x7f2208017fd0, i = {134315984, 32546}}, pad = 0x0, drained = 0, demuxer = 1, adaptive_demuxer = 0, seekable = 1, elements = 0x0, active_group = 0x0, next_groups = 0x0, pending_pads = 0x0, current_pad = 0x0, endpad = 0x0, deadend = 0, deadend_details = 0x0, endcaps = 0x0, old_groups = 0x0}
(In reply to Sebastian Dröge (slomo) from comment #35) > So it asserts because the chain mutex can't be taken because the chain is > already gone? That sounds like the refcounting patch from the other bug can > help here. For clarity: I am using that refcouting patch, and the chain is still alive when this happens. Adding logs shows that it is indeed not yet destroyed when this happens (and the logs cause the problem to manifest itself more often).
(In reply to Sebastian Dröge (slomo) from comment #35) > You could connect to the signals with taking another ref on the > chain, and having a destroy notify that unrefs the chain again. > g_signal_connect_data(). If I take an extra ref on the chain for all the 4 places where a signal is connected with the chain (I never release those refs), I still get crashes. This is consistent with the chain being not yet destroyed above, so this is not the problem. It looks like the mutex is somehow getting wrecked, but I don't see other calls to g_mutex_clear that could explain this.
I'm not 100% sure this is related, but for completeness: I've been running with valgrind, and while I'm not seeing that crash so far, I've seen another one, twice now: I get the message: (uridecodebin-test:17558): GStreamer-WARNING **: Trying to join task 0xfc47120 from its thread would deadlock. You cannot change the state of an element from its streaming thread. Use g_idle_add() or post a GstMessage on the bus to schedule the state change from the main thread. from: ==17558== ==17558== Process terminating with default action of signal 5 (SIGTRAP): dumping core ==17558== at 0x53F2A91: _g_log_abort (gmessages.c:311) ==17558== by 0x53F36E3: g_logv (gmessages.c:1033) ==17558== by 0x53F3851: g_log (gmessages.c:1071) ==17558== by 0x4ED7B60: gst_task_join (gsttask.c:847) ==17558== by 0x4EA36BC: gst_pad_finalize (gstpad.c:730) ==17558== by 0x5166902: g_object_unref (gobject.c:3112) ==17558== by 0x4E8D08F: gst_element_remove_pad (gstelement.c:811) ==17558== by 0x4E8D442: gst_element_dispose (gstelement.c:2934) ==17558== by 0x516687C: g_object_unref (gobject.c:3075) ==17558== by 0x4EA389D: do_stream_status.isra.0 (gstpad.c:5793) ==17558== by 0x4ED6982: gst_task_func (gsttask.c:302) ==17558== by 0x5410B67: g_thread_pool_thread_proxy (gthreadpool.c:307) ==17558== by 0x5410304: g_thread_proxy (gthread.c:764) ==17558== by 0x56DAE99: start_thread (pthread_create.c:308) ==17558== It looks like the pad in question was unreffed while do_stream_status was running, and holding the last ref, so unreffing it causes the pad to be destroyed, which causes its task to stop and join. I'm not familiar with this code, but it seems like it should not happen. This might need a new bug for it if it turns out to be a different issue.
Created attachment 313938 [details] [review] WIP trying to get rid of races This is what I have at the moment. I don't get any issues with mutexes, as the lifetime of the chain is now OK. I keep a dead boolean in the chain to avoid calling analyze_new_pad while it's being destroyed. This is still racy though, as the chain lock would need keeping over analyze_pad, and it can't. I don't really get the comment about the ref on the signal connect. I don't think that'd help anyway, as the signal might have been just triggered at the time the chain starts getting destroyed (ie, the chain starts getting destroyed just after pad_added_cb released the chain mutex lock, but before analyze_new_pad is called. I'm still seeing the "joining from same thread" errors though. Looks like oggdemux still has a ref of 1, while its bin is now destroyed.
(In reply to Vincent Penquerc'h from comment #40) > I don't really get the comment about the ref on the signal connect. I don't > think that'd help anyway, as the signal might have been just triggered at > the time the chain starts getting destroyed (ie, the chain starts getting > destroyed just after pad_added_cb released the chain mutex lock, but before > analyze_new_pad is called. My assumption was that destroying the chain would only happen then once the signal is actually disconnected (i.e. the element we connected to gets destroyed). If you use g_signal_connect_data() and pass as GDestroyNotify the unref function of the chain, then the last reference of the chain will never get to refcount 0 while the signal is still connected (or being emitted).
Right. The signals are disconnected in the free function though (in the hide step). Which might mean we need to have three steps of chain deletion, but that seems like too complex.
Created attachment 313944 [details] [review] WIP trying to get rid of races This should perform a similar check. A ref is added to the chain when connecting to pad-added, removed when the padd-added signal is removed, and a first "hide" step is performed before the actual free, so that the signal can't get a chain that's just started to be freed.
Created attachment 314739 [details] [review] lock the expose lock around decode chain use The previous expose locking patches caused deadlocks in the unit tests, fixed in this version.
Created attachment 316606 [details] [review] lock the expose lock around decode chain use latest version
Created attachment 316607 [details] [review] early out on pad-added if the pad is inactive latest version
I just tried again on the latest tree. With those patches above and from https://bugzilla.gnome.org/show_bug.cgi?id=755260, the test case from https://bugzilla.gnome.org/show_bug.cgi?id=752623 runs for minutes before I get bored and ^C it. I think there's still something iffy since I was getting rare-ish memory corruption like issues before, even with those patches in, but it's still a noticeable improvement. So I'm going to push those after a couple hours to allow for any NAK to those latest versions.
Pushed, alongside the patch from https://bugzilla.gnome.org/show_bug.cgi?id=755260 commit 870c6df489238b0ad9ee988ca41e92babc9a8256 Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk> Date: Thu Aug 20 17:30:38 2015 +0100 decodebin: early out on pad-added when the pad is inactive The pad may be recently deactivated if the element is switched back down very quickly. https://bugzilla.gnome.org/show_bug.cgi?id=752651 commit ee44337fc3e3030a5155d28b3561af157e6c6003 Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk> Date: Thu Aug 20 17:29:36 2015 +0100 decodebin: lock the expose lock around decode_chain use Helps with a crash in decodebin when quickly switching states. https://bugzilla.gnome.org/show_bug.cgi?id=752651
Follow-up: commit 71505dfa24fc402bbe44b61851270cd618556f25 Author: Tim-Philipp Müller <tim@centricular.com> Date: Wed Dec 2 18:16:05 2015 +0000 decodebin2: fix "Attempt to unlock mutex that was not locked" Introduced in commit ee44337f, caused the decodebin test_text_plain_streams unit test to abort. https://bugzilla.gnome.org/show_bug.cgi?id=752651