Bug 641928 – gst_pad_push fast path races with pad deactivation

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 641928 - gst_pad_push fast path races with pad deactivation


Summary:	gst_pad_push fast path races with pad deactivation


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	git master
Hardware:	Other Linux

Importance:	Normal critical
Target Milestone:	0.10.33
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-02-09 13:36 UTC by René Stadler
Modified:	2011-02-15 11:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description René Stadler 2011-02-09 13:36:03 UTC

It seems that the PadPushCache stuff has introduced a race condition with pad flushing/deactivation. Pad deactivation is implemented under the promise that once a (sink) pad is set to flushing and the stream lock was successfully acquired, the chain function will not be executed again until the pad is set to non-flushing again. This is at least what basetransform is assuming. It has the usual code pattern like:

activate_push(pad, active):
{
  if (active) {
    ...do startup init stuff...
  } else {
    /* Make sure that streaming has stopped: */
    PAD_STREAM_LOCK(pad);
    PAD_STREAM_UNLOCK(pad);

    ...do cleanup/deinit stuff here...
  }
}

The stream lock/unlock cycle is actually performed in gstpad.c as well (right after calling such a pad_activate function). With this in mind, consider the new fast path in gst_pad_push, also in pseudo-code here, with an additional comment added:

gst_pad_push(pad, buffer):
{
  cache = take_cache(pad);

  if (cache_invalid)
    goto slow_path;
  if (caps_changed)
    goto slow_path;

  /* This is racy, peer pad could become flushing right now! */

  PAD_STREAM_LOCK(peer);
  ret = PAD_CHAINFUNC(peer, buffer);
  PAD_STREAM_UNLOCK(peer);

  put_cache(pad, cache);
  return ret;

slow_path:
  ....
}

You can see that there is a (very rare) possible race such that during the line marked with the comment, the whole pad deactivation phase could be executed (including the lock/unlock cycle). The effect is that the chain function gets still called afterwards, which will obviously lead to random crashes.

Well it's either that, or I stared at the code for too long, since I couldn't make a conclusive test for this so far. I have however actual crash reports that pointed me to what appears to be this issue :(

I suppose an easy fix would be to take the stream lock earlier (before acquiring the cache), and release it when falling back to slow path (an extra unlock/lock pair, which probably could be avoided with some more refactoring later).

Comment 1 Wim Taymans 2011-02-09 14:30:48 UTC

> 
> You can see that there is a (very rare) possible race such that during the line
> marked with the comment, the whole pad deactivation phase could be executed
> (including the lock/unlock cycle). The effect is that the chain function gets
> still called afterwards, which will obviously lead to random crashes.

You are right.

<snip>

> 
> I suppose an easy fix would be to take the stream lock earlier (before
> acquiring the cache), and release it when falling back to slow path (an extra
> unlock/lock pair, which probably could be avoided with some more refactoring
> later).

That would not work because we need to take the stream_lock of the peer pad, which we can only get after getting the cache. Also the flushing flag is on the peer pad and with the object lock.

It seems that there is no other way than checking the flushing flag after getting the stream lock.. That would add one more object lock/unlock because the flags are currently not atomic.

Comment 2 Wim Taymans 2011-02-14 16:34:09 UTC

Can also be done without locks because the srcpad cache is set to invalid when the sinkpad is set to flushing.

commit 12d2d016634fa9fa83a6de6b341bdfc8d71f7fff
Author: Wim Taymans <wim.taymans@collabora.co.uk>
Date:   Mon Feb 14 17:31:25 2011 +0100

    pad: Check sinkpad for flushing
    
    Check the sinkpad for the flushing state before calling the chainfunction on the
    pad. We do this by checking the cache (which is also cleared on the srcpad when
    the sink is set to flushing).
    
    Fixes #641928

Comment 3 René Stadler 2011-02-14 19:32:54 UTC

That's clever.

But if I'm not mistaken, the whole thing still races with gst_pad_unlink, since the object lock is not taken anywhere on the push fast path and unlinking doesn't take the streaming lock :/

Comment 4 Wim Taymans 2011-02-15 09:56:54 UTC

(In reply to comment #3)
> That's clever.
> 
> But if I'm not mistaken, the whole thing still races with gst_pad_unlink, since
> the object lock is not taken anywhere on the push fast path and unlinking
> doesn't take the streaming lock :/

gst_pad_unlink() never waits (not before the cache either) for streaming to complete, you have to manually use pad blocks if you want to be sure. The pads have their refcount increased while streaming (srcpad because the caller has a ref, sinkpad because we ref the sinkpad) so things can never go away and crash. 

This is why the push cache needs to artificially keep a ref to the peer to make sure it doesn't disappear while streaming.

Comment 5 René Stadler 2011-02-15 11:47:01 UTC

Ah, yes indeed. That made me wonder if there is still a race with pad blocking, but apparently that is safe as well now. I guess I'm happy :)