Bug 503592 – gstpad.c does many ref/unref of peer pad in dataflow

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 503592 - gstpad.c does many ref/unref of peer pad in dataflow


Summary:	gstpad.c does many ref/unref of peer pad in dataflow


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.10.32
Assigned To:	Stefan Sauer (gstreamer, gtkdoc dev)
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-12-14 13:11 UTC by Stefan Sauer (gstreamer, gtkdoc dev)
Modified:	2010-12-05 18:17 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
convert refs into lazy refs (2.80 KB, patch) 2007-12-14 13:22 UTC, Stefan Sauer (gstreamer, gtkdoc dev)	none	Details \| Review
convert refs into lazy refs (3.30 KB, patch) 2007-12-14 13:25 UTC, Stefan Sauer (gstreamer, gtkdoc dev)	none	Details \| Review
use nested lock to avoid reffing the peer pad (4.09 KB, patch) 2010-09-07 13:58 UTC, Stefan Sauer (gstreamer, gtkdoc dev)	none	Details \| Review
use nested lock to avoid reffing the peer pad (4.23 KB, patch) 2010-09-08 07:25 UTC, Stefan Sauer (gstreamer, gtkdoc dev)	none	Details \| Review
use nested lock to avoid reffing the peer pad (4.28 KB, patch) 2010-09-08 10:51 UTC, Stefan Sauer (gstreamer, gtkdoc dev)	rejected	Details \| Review
fast path prototype (3.80 KB, patch) 2010-10-12 10:40 UTC, Wim Taymans	none	Details \| Review
more evolved patch (11.12 KB, patch) 2010-10-13 00:53 UTC, Wim Taymans	none	Details \| Review
more evolved patch (11.12 KB, patch) 2010-10-13 00:54 UTC, Wim Taymans	none	Details \| Review

Description Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:11:15 UTC

in
gst_pad_push, gst_pad_check_pull_range, gst_pad_pull_range
one can find code like below. this causes atomic ops for all the dataflow. To give an idea - I added the GST_WARNING to count them and for a 2 min avi I got 91949 ref/unrefs.

GST_OBJECT_LOCK (pad);

if (G_UNLIKELY ((peer = GST_PAD_PEER (pad)) == NULL))
  goto not_linked;

gst_object_ref (peer);
GST_WARNING("!ref peer pad!");

GST_OBJECT_UNLOCK (pad);

ret = gst_pad_chain_unchecked (peer, buffer);

gst_object_unref (peer);

Comment 1 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:22:32 UTC

Created attachment 100950 [details] [review]
convert refs into lazy refs

Comment 2 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:24:56 UTC

Some extra explanations. GstPad has ->peer, which is used via
#define GST_PAD_PEER(pad)		(GST_PAD_CAST(pad)->peer)

We cannot turn that into a reffing one.

Comment 3 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:25:33 UTC

Created attachment 100951 [details] [review]
convert refs into lazy refs

Comment 4 Wim Taymans 2007-12-14 15:35:19 UTC

Not very practical, the unref happens at an undefined time and can keep the sinkpad lingering on for as long as the srcpad is alive.

Just as a brain dump:

_pad_push (pad, buffer)
{
  LOCK (pad)
  peer = pad->peer;
  if (!peer)
    goto no_peer;
  chain_func = pad->peer_chain;
  if (chain_func != peer->chain)
    pad->peer_chain = chain_func;
  GST_STREAM_LOCK (peer);
  UNLOCK (pad)

  res = chain_func (peer); 

  GST_STREAM_UNLOCK (peer);

  return res;

no_peer:
  {
    UNLOCK (pad);
    return NOT_LINKED;
  }
}

This example uses a nested lock, there is no way the pad can get unlinked when the pad lock is taken and we can be sure that the pad is not unreffed when we have the stream_lock. Not sure this is correct...

Comment 5 Tim-Philipp Müller 2009-04-16 22:46:25 UTC

What's up with this? Do we still care about this?

Comment 6 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-07 13:58:51 UTC

Created attachment 169667 [details] [review]
use nested lock to avoid reffing the peer pad

Implement suggestion.

Comment 7 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-07 14:07:14 UTC

$ time gst-launch-0.10 -q fakesrc sizetype=empty silent=true num-buffers=1000000 ! fakesink silent=true

before:
real		user		sys	
0m3.895s	0m3.876s	0m0.012s
0m3.674s	0m3.652s	0m0.024s

after:
real		user		sys	
0m3.441s	0m3.384s	0m0.012s
0m3.583s	0m3.480s	0m0.008s

Comment 8 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-08 07:25:18 UTC

Created attachment 169738 [details] [review]
use nested lock to avoid reffing the peer pad

proper patch. measurements rechecked & confirmed

Comment 9 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-08 10:51:45 UTC

Created attachment 169752 [details] [review]
use nested lock to avoid reffing the peer pad

Comment 10 Sebastian Dröge (slomo) 2010-09-21 17:31:46 UTC

Looks good in general but please document that chain_data_unchecked must be called with the stream lock.

And taking the stream lock instead of getting a new ref only works because in post_activate() the stream lock is taken when the pad is deactivated and the pad can't go away before it is deactivated?

Comment 11 Wim Taymans 2010-09-21 17:35:19 UTC

The nested lock can cause a deadlock because now the object lock can be taken before the stream-lock.

Comment 12 Wim Taymans 2010-10-12 10:40:24 UTC

Created attachment 172173 [details] [review]
fast path prototype

Here is a small prototype of the pad link cache.

The idea is that when there are no flushing pads, no probes or pad blocks, we can avoid taking any locks (except for the streaming lock) or doing any refcounts as shown in the above patch. The downside is that we do a few more atomic compare and swaps to get sole access to the cache.

The next thing to implement is invalidation code for the cache and code to recreate the cache when the fast-path conditions are met again. This should be done by tagging each modification to pads (flushing, probed, blocked) by incrementing a cookie.

Also the caps on the buffer should be checked against the cached ones so that we can fall back to the slow path as well.

Comment 13 Wim Taymans 2010-10-13 00:53:26 UTC

Created attachment 172226 [details] [review]
more evolved patch

A more evolved version of the patch.

This one dynamically builds the cache in the slow path and invalidates the cache when modifications are done to the pad link. The only thing missing is the invalidation when a pad probe is installed.

It passes make check, playbin2 works and fakesrc ! fakesink seems to be around 10% faster.

Comment 14 Wim Taymans 2010-10-13 00:54:52 UTC

Created attachment 172227 [details] [review]
more evolved patch

A more evolved version of the patch.

This one dynamically builds the cache in the slow path and invalidates the cache when modifications are done to the pad link. The only thing missing is the invalidation when a pad probe is installed.

It passes make check, playbin2 works and fakesrc ! fakesink seems to be around 10% faster.

Comment 15 Marco Ballesio 2010-10-13 07:41:14 UTC

I think this patch is great, definitely a big leap forward!

If it's allowed, I'd point out just a couple of trivial things:

- in gst_pad_chain_data_unchecked you're passing the argument "cache" and then reassigning it to NULL. In code readability's sake, wouldn't it be better to use a local variable for this? 

- even though the functions "pad_take_cache", "pad_free_cache", "pad_put_cache" and "pad_invalidate_cache" should be implicitly inlined by the gcc, I suggest to explicitly add an "inline" statement to make this optimisation more portable and make it clear to the reader.

P.S. I'll try and run performance checks on x86 and Atom when possible, hopefully already this week.

Comment 16 Wim Taymans 2010-10-13 09:36:18 UTC

(In reply to comment #15)
> I think this patch is great, definitely a big leap forward!
> 
> If it's allowed, I'd point out just a couple of trivial things:
> 
> - in gst_pad_chain_data_unchecked you're passing the argument "cache" and then
> reassigning it to NULL. In code readability's sake, wouldn't it be better to
> use a local variable for this?

I experimented with a local variable and with a field in the cache struct. reassigning the cache paramter resulted in the smallest code but I guess a local variable would be clearer.

> 
> - even though the functions "pad_take_cache", "pad_free_cache", "pad_put_cache"
> and "pad_invalidate_cache" should be implicitly inlined by the gcc, I suggest
> to explicitly add an "inline" statement to make this optimisation more portable
> and make it clear to the reader.

will do

> 
> P.S. I'll try and run performance checks on x86 and Atom when possible,
> hopefully already this week.

ok, sweet

Comment 17 Sebastian Dröge (slomo) 2010-10-13 10:18:44 UTC

New version at http://cgit.freedesktop.org/~slomo/gstreamer in the dataflow-locking branch. Only changes are, that the cache is invalidated when adding buffer/data/event probes.

Comment 18 Sebastian Dröge (slomo) 2010-10-13 10:33:20 UTC

Breaks the tee unit test:
elements/tee.c:551:F:general:test_flow_aggregation:0: Assertion 'gst_pad_push (mysrc, gst_buffer_ref (buffer)) == GST_FLOW_ERROR' failed

Comment 19 Sebastian Dröge (slomo) 2010-10-13 10:44:00 UTC

Some simple, synthetic benchmarks:

24% speedup: gst-launch-0.10 fakesrc silent=true num-buffers=1000000 ! identity silent=true ! identity silent=true ! queue silent=true ! identity silent=true ! fakesink silent=true

22.5% speedup: gst-launch-0.10 fakesrc silent=true num-buffers=1000000 ! identity silent=true ! identity silent=true ! identity silent=true ! fakesink silent=true


For a real playbin2 pipeline there was no measurable difference (yes, sinks were "fakesink sync=false silent=true")

Comment 20 Olivier Crête 2010-10-14 02:14:35 UTC

With my simulation of the send side of a Farsight2 call, I get around 5% improvement. This seems like a pretty nice gain.

audiotestsrc num-buffers=1000000  samplesperbuffer=160 ! audio/x-raw-int, rate=8000 ! tee ! identity ! alawenc ! rtppcmapay ! application/x-rtp, payload=96,ptime=20 ! fsfunnel ! fsfunnel ! udpsink port=8888 host=127.0.0.1  sync=false

Comment 21 Felipe Contreras (banned) 2010-10-16 14:54:20 UTC

I ran my futex test and the results are:
queue ~12%, direct ~31%

Comment 22 Edward Hervey 2010-10-16 16:42:44 UTC

Felipe, are those numbers speedup using those patches ? If so, it's already a great start.


The issue with queue is a separate one I'm working one (based on brainstorming/prototyping/past-experience from both wim and myself), involving:

* not take *any* lock when queue is neither full nor empty (goodbye context switching for cases where rate of data coming in is similar to rate of data coming out, which is the case for pipelines with live sources/sinks).

* If needed (queue empty/full) only take a lock to signal the other thread of a change in levels (i.e. you only take the lock to wakeup/wait-on a conditional and nothing else, as opposed to right now where a mutex is taken over the whole queue processing), limiting the amount of time of context switching.

* allocation-less as much as possible (relying on grow-only arrays with sane default sizes as a replacement to glib's GQueue). This feature alone introduce a 33% speedup in throughput due to the reduced cost of passing data from one thread to another, for a minimal extra expense in memory being used.

I'll open a bug report about this on monday once I've cleaned up my work branch and commented the code.

In the end, we should end up with the overhead of queue being the same as the overhead of a 'dumb' element like identity, while benefiting (and relying) on the kernel for proper load balancing (beneficial on single core, multi-core, with/without hw-accelerated use cases).

Other ideas that could be worked on afterwards would be using the min-threshold properties to throttle the cases where the data coming in is (much) faster that the speed at which data is pulled out, thereby reducing the wakeups on the producer (file read / demuxer / decoder) side and only wakeup one side or the other when the levels are within X units of the min/max.

Comment 23 Felipe Contreras (banned) 2010-10-16 19:43:28 UTC

(In reply to comment #22)
> Felipe, are those numbers speedup using those patches ? If so, it's already a
> great start.

Yes, improvement in CPU time per buffer push on an ARM cpu.

Comment 24 Stefan Sauer (gstreamer, gtkdoc dev) 2010-10-16 21:28:45 UTC

I wonder if we should have a queue helper object (like we have gst-adapter). imho most elements that would need a queue afterwards could just add this internally. This way e.g. tee can get away with one queue per tee instead one queue per tee-srcpad. And we save on pad-push overhead. And people can't forget to add the queue.

Comment 25 Marco Ballesio 2010-10-23 16:50:55 UTC

Eventually I has time to cross-check the patch on ARM (omap 3430). Executing my simple test:

time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink

I measured a mean 3.6% improvement (about 26.5s against 27.5s). I'd say this patch is even better than my testing one where most of the locks have been removed.

Testing it with telepathy-stream-engine I found no perceivable improvements with GTalk, but the default CPU load was something around 20%, so the optimisation would be below 1%, far lower than top's sensitivity.

Comment 26 Tim-Philipp Müller 2010-10-23 17:00:02 UTC

> time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000
> blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert !
> audioconvert ! audioconvert ! fakesink

For what it's worth, fakesink silent=true will generally yield more accurate results.

Comment 27 Marco Ballesio 2010-10-23 18:28:22 UTC

(In reply to comment #26)
 
> For what it's worth, fakesink silent=true will generally yield more accurate
> results.

Fair enough, the results when using:

time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink silent=true

give a 20.88 s against 22.04 s, that is an (even better) 5.3%. Results on VoIP tests should still remain the same, even though I'd like to run more ad-hoc tests with mutated pipeline setups on stream-engine when possible.

I don't have a comparison in this case with my testing patch, but I assume it would anyway be lower than 5.3%.

Comment 28 Wim Taymans 2010-10-24 08:25:17 UTC

Now we have to look at shaving off a few percent from the base classes and probably the common objects like the clock. 

Edward has some pending patches to improve queue performance too.

I've got some branches here with some more things:

http://cgit.freedesktop.org/~wtay/gstreamer/log/?h=less-task-lock
http://cgit.freedesktop.org/~wtay/gstreamer/log/?h=lockfree-clocks

Comment 29 Wim Taymans 2010-12-03 10:33:28 UTC

The push-cache and less-task-lock branches are merged in master now.

Comment 30 Stefan Sauer (gstreamer, gtkdoc dev) 2010-12-05 18:16:53 UTC

Thanks for all those good improvements.