GNOME Bugzilla – Bug 641928
gst_pad_push fast path races with pad deactivation
Last modified: 2011-02-15 11:47:01 UTC
It seems that the PadPushCache stuff has introduced a race condition with pad flushing/deactivation. Pad deactivation is implemented under the promise that once a (sink) pad is set to flushing and the stream lock was successfully acquired, the chain function will not be executed again until the pad is set to non-flushing again. This is at least what basetransform is assuming. It has the usual code pattern like: activate_push(pad, active): { if (active) { ...do startup init stuff... } else { /* Make sure that streaming has stopped: */ PAD_STREAM_LOCK(pad); PAD_STREAM_UNLOCK(pad); ...do cleanup/deinit stuff here... } } The stream lock/unlock cycle is actually performed in gstpad.c as well (right after calling such a pad_activate function). With this in mind, consider the new fast path in gst_pad_push, also in pseudo-code here, with an additional comment added: gst_pad_push(pad, buffer): { cache = take_cache(pad); if (cache_invalid) goto slow_path; if (caps_changed) goto slow_path; /* This is racy, peer pad could become flushing right now! */ PAD_STREAM_LOCK(peer); ret = PAD_CHAINFUNC(peer, buffer); PAD_STREAM_UNLOCK(peer); put_cache(pad, cache); return ret; slow_path: .... } You can see that there is a (very rare) possible race such that during the line marked with the comment, the whole pad deactivation phase could be executed (including the lock/unlock cycle). The effect is that the chain function gets still called afterwards, which will obviously lead to random crashes. Well it's either that, or I stared at the code for too long, since I couldn't make a conclusive test for this so far. I have however actual crash reports that pointed me to what appears to be this issue :( I suppose an easy fix would be to take the stream lock earlier (before acquiring the cache), and release it when falling back to slow path (an extra unlock/lock pair, which probably could be avoided with some more refactoring later).
> > You can see that there is a (very rare) possible race such that during the line > marked with the comment, the whole pad deactivation phase could be executed > (including the lock/unlock cycle). The effect is that the chain function gets > still called afterwards, which will obviously lead to random crashes. You are right. <snip> > > I suppose an easy fix would be to take the stream lock earlier (before > acquiring the cache), and release it when falling back to slow path (an extra > unlock/lock pair, which probably could be avoided with some more refactoring > later). That would not work because we need to take the stream_lock of the peer pad, which we can only get after getting the cache. Also the flushing flag is on the peer pad and with the object lock. It seems that there is no other way than checking the flushing flag after getting the stream lock.. That would add one more object lock/unlock because the flags are currently not atomic.
Can also be done without locks because the srcpad cache is set to invalid when the sinkpad is set to flushing. commit 12d2d016634fa9fa83a6de6b341bdfc8d71f7fff Author: Wim Taymans <wim.taymans@collabora.co.uk> Date: Mon Feb 14 17:31:25 2011 +0100 pad: Check sinkpad for flushing Check the sinkpad for the flushing state before calling the chainfunction on the pad. We do this by checking the cache (which is also cleared on the srcpad when the sink is set to flushing). Fixes #641928
That's clever. But if I'm not mistaken, the whole thing still races with gst_pad_unlink, since the object lock is not taken anywhere on the push fast path and unlinking doesn't take the streaming lock :/
(In reply to comment #3) > That's clever. > > But if I'm not mistaken, the whole thing still races with gst_pad_unlink, since > the object lock is not taken anywhere on the push fast path and unlinking > doesn't take the streaming lock :/ gst_pad_unlink() never waits (not before the cache either) for streaming to complete, you have to manually use pad blocks if you want to be sure. The pads have their refcount increased while streaming (srcpad because the caller has a ref, sinkpad because we ref the sinkpad) so things can never go away and crash. This is why the push cache needs to artificially keep a ref to the peer to make sure it doesn't disappear while streaming.
Ah, yes indeed. That made me wonder if there is still a race with pad blocking, but apparently that is safe as well now. I guess I'm happy :)