After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 503592 - gstpad.c does many ref/unref of peer pad in dataflow
gstpad.c does many ref/unref of peer pad in dataflow
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gstreamer (core)
git master
Other Linux
: Normal normal
: 0.10.32
Assigned To: Stefan Sauer (gstreamer, gtkdoc dev)
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2007-12-14 13:11 UTC by Stefan Sauer (gstreamer, gtkdoc dev)
Modified: 2010-12-05 18:17 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
convert refs into lazy refs (2.80 KB, patch)
2007-12-14 13:22 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
none Details | Review
convert refs into lazy refs (3.30 KB, patch)
2007-12-14 13:25 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
none Details | Review
use nested lock to avoid reffing the peer pad (4.09 KB, patch)
2010-09-07 13:58 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
none Details | Review
use nested lock to avoid reffing the peer pad (4.23 KB, patch)
2010-09-08 07:25 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
none Details | Review
use nested lock to avoid reffing the peer pad (4.28 KB, patch)
2010-09-08 10:51 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
rejected Details | Review
fast path prototype (3.80 KB, patch)
2010-10-12 10:40 UTC, Wim Taymans
none Details | Review
more evolved patch (11.12 KB, patch)
2010-10-13 00:53 UTC, Wim Taymans
none Details | Review
more evolved patch (11.12 KB, patch)
2010-10-13 00:54 UTC, Wim Taymans
none Details | Review

Description Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:11:15 UTC
in
gst_pad_push, gst_pad_check_pull_range, gst_pad_pull_range
one can find code like below. this causes atomic ops for all the dataflow. To give an idea - I added the GST_WARNING to count them and for a 2 min avi I got 91949 ref/unrefs.

GST_OBJECT_LOCK (pad);

if (G_UNLIKELY ((peer = GST_PAD_PEER (pad)) == NULL))
  goto not_linked;

gst_object_ref (peer);
GST_WARNING("!ref peer pad!");

GST_OBJECT_UNLOCK (pad);

ret = gst_pad_chain_unchecked (peer, buffer);

gst_object_unref (peer);
Comment 1 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:22:32 UTC
Created attachment 100950 [details] [review]
convert refs into lazy refs
Comment 2 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:24:56 UTC
Some extra explanations. GstPad has ->peer, which is used via
#define GST_PAD_PEER(pad)		(GST_PAD_CAST(pad)->peer)

We cannot turn that into a reffing one.
Comment 3 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-14 13:25:33 UTC
Created attachment 100951 [details] [review]
convert refs into lazy refs
Comment 4 Wim Taymans 2007-12-14 15:35:19 UTC
Not very practical, the unref happens at an undefined time and can keep the sinkpad lingering on for as long as the srcpad is alive.

Just as a brain dump:

_pad_push (pad, buffer)
{
  LOCK (pad)
  peer = pad->peer;
  if (!peer)
    goto no_peer;
  chain_func = pad->peer_chain;
  if (chain_func != peer->chain)
    pad->peer_chain = chain_func;
  GST_STREAM_LOCK (peer);
  UNLOCK (pad)

  res = chain_func (peer); 

  GST_STREAM_UNLOCK (peer);

  return res;

no_peer:
  {
    UNLOCK (pad);
    return NOT_LINKED;
  }
}

This example uses a nested lock, there is no way the pad can get unlinked when the pad lock is taken and we can be sure that the pad is not unreffed when we have the stream_lock. Not sure this is correct...
Comment 5 Tim-Philipp Müller 2009-04-16 22:46:25 UTC
What's up with this? Do we still care about this?
Comment 6 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-07 13:58:51 UTC
Created attachment 169667 [details] [review]
use nested lock to avoid reffing the peer pad

Implement suggestion.
Comment 7 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-07 14:07:14 UTC
$ time gst-launch-0.10 -q fakesrc sizetype=empty silent=true num-buffers=1000000 ! fakesink silent=true

before:
real		user		sys	
0m3.895s	0m3.876s	0m0.012s
0m3.674s	0m3.652s	0m0.024s

after:
real		user		sys	
0m3.441s	0m3.384s	0m0.012s
0m3.583s	0m3.480s	0m0.008s
Comment 8 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-08 07:25:18 UTC
Created attachment 169738 [details] [review]
use nested lock to avoid reffing the peer pad

proper patch. measurements rechecked & confirmed
Comment 9 Stefan Sauer (gstreamer, gtkdoc dev) 2010-09-08 10:51:45 UTC
Created attachment 169752 [details] [review]
use nested lock to avoid reffing the peer pad
Comment 10 Sebastian Dröge (slomo) 2010-09-21 17:31:46 UTC
Looks good in general but please document that chain_data_unchecked must be called with the stream lock.

And taking the stream lock instead of getting a new ref only works because in post_activate() the stream lock is taken when the pad is deactivated and the pad can't go away before it is deactivated?
Comment 11 Wim Taymans 2010-09-21 17:35:19 UTC
The nested lock can cause a deadlock because now the object lock can be taken before the stream-lock.
Comment 12 Wim Taymans 2010-10-12 10:40:24 UTC
Created attachment 172173 [details] [review]
fast path prototype

Here is a small prototype of the pad link cache.

The idea is that when there are no flushing pads, no probes or pad blocks, we can avoid taking any locks (except for the streaming lock) or doing any refcounts as shown in the above patch. The downside is that we do a few more atomic compare and swaps to get sole access to the cache.

The next thing to implement is invalidation code for the cache and code to recreate the cache when the fast-path conditions are met again. This should be done by tagging each modification to pads (flushing, probed, blocked) by incrementing a cookie.

Also the caps on the buffer should be checked against the cached ones so that we can fall back to the slow path as well.
Comment 13 Wim Taymans 2010-10-13 00:53:26 UTC
Created attachment 172226 [details] [review]
more evolved patch

A more evolved version of the patch.

This one dynamically builds the cache in the slow path and invalidates the cache when modifications are done to the pad link. The only thing missing is the invalidation when a pad probe is installed.

It passes make check, playbin2 works and fakesrc ! fakesink seems to be around 10% faster.
Comment 14 Wim Taymans 2010-10-13 00:54:52 UTC
Created attachment 172227 [details] [review]
more evolved patch

A more evolved version of the patch.

This one dynamically builds the cache in the slow path and invalidates the cache when modifications are done to the pad link. The only thing missing is the invalidation when a pad probe is installed.

It passes make check, playbin2 works and fakesrc ! fakesink seems to be around 10% faster.
Comment 15 Marco Ballesio 2010-10-13 07:41:14 UTC
I think this patch is great, definitely a big leap forward!

If it's allowed, I'd point out just a couple of trivial things:

- in gst_pad_chain_data_unchecked you're passing the argument "cache" and then reassigning it to NULL. In code readability's sake, wouldn't it be better to use a local variable for this? 

- even though the functions "pad_take_cache", "pad_free_cache", "pad_put_cache" and "pad_invalidate_cache" should be implicitly inlined by the gcc, I suggest to explicitly add an "inline" statement to make this optimisation more portable and make it clear to the reader.

P.S. I'll try and run performance checks on x86 and Atom when possible, hopefully already this week.
Comment 16 Wim Taymans 2010-10-13 09:36:18 UTC
(In reply to comment #15)
> I think this patch is great, definitely a big leap forward!
> 
> If it's allowed, I'd point out just a couple of trivial things:
> 
> - in gst_pad_chain_data_unchecked you're passing the argument "cache" and then
> reassigning it to NULL. In code readability's sake, wouldn't it be better to
> use a local variable for this?

I experimented with a local variable and with a field in the cache struct. reassigning the cache paramter resulted in the smallest code but I guess a local variable would be clearer.

> 
> - even though the functions "pad_take_cache", "pad_free_cache", "pad_put_cache"
> and "pad_invalidate_cache" should be implicitly inlined by the gcc, I suggest
> to explicitly add an "inline" statement to make this optimisation more portable
> and make it clear to the reader.

will do

> 
> P.S. I'll try and run performance checks on x86 and Atom when possible,
> hopefully already this week.

ok, sweet
Comment 17 Sebastian Dröge (slomo) 2010-10-13 10:18:44 UTC
New version at http://cgit.freedesktop.org/~slomo/gstreamer in the dataflow-locking branch. Only changes are, that the cache is invalidated when adding buffer/data/event probes.
Comment 18 Sebastian Dröge (slomo) 2010-10-13 10:33:20 UTC
Breaks the tee unit test:
elements/tee.c:551:F:general:test_flow_aggregation:0: Assertion 'gst_pad_push (mysrc, gst_buffer_ref (buffer)) == GST_FLOW_ERROR' failed
Comment 19 Sebastian Dröge (slomo) 2010-10-13 10:44:00 UTC
Some simple, synthetic benchmarks:

24% speedup: gst-launch-0.10 fakesrc silent=true num-buffers=1000000 ! identity silent=true ! identity silent=true ! queue silent=true ! identity silent=true ! fakesink silent=true

22.5% speedup: gst-launch-0.10 fakesrc silent=true num-buffers=1000000 ! identity silent=true ! identity silent=true ! identity silent=true ! fakesink silent=true


For a real playbin2 pipeline there was no measurable difference (yes, sinks were "fakesink sync=false silent=true")
Comment 20 Olivier Crête 2010-10-14 02:14:35 UTC
With my simulation of the send side of a Farsight2 call, I get around 5% improvement. This seems like a pretty nice gain.

audiotestsrc num-buffers=1000000  samplesperbuffer=160 ! audio/x-raw-int, rate=8000 ! tee ! identity ! alawenc ! rtppcmapay ! application/x-rtp, payload=96,ptime=20 ! fsfunnel ! fsfunnel ! udpsink port=8888 host=127.0.0.1  sync=false
Comment 21 Felipe Contreras (banned) 2010-10-16 14:54:20 UTC
I ran my futex test and the results are:
queue ~12%, direct ~31%
Comment 22 Edward Hervey 2010-10-16 16:42:44 UTC
Felipe, are those numbers speedup using those patches ? If so, it's already a great start.


The issue with queue is a separate one I'm working one (based on brainstorming/prototyping/past-experience from both wim and myself), involving:

* not take *any* lock when queue is neither full nor empty (goodbye context switching for cases where rate of data coming in is similar to rate of data coming out, which is the case for pipelines with live sources/sinks).

* If needed (queue empty/full) only take a lock to signal the other thread of a change in levels (i.e. you only take the lock to wakeup/wait-on a conditional and nothing else, as opposed to right now where a mutex is taken over the whole queue processing), limiting the amount of time of context switching.

* allocation-less as much as possible (relying on grow-only arrays with sane default sizes as a replacement to glib's GQueue). This feature alone introduce a 33% speedup in throughput due to the reduced cost of passing data from one thread to another, for a minimal extra expense in memory being used.

I'll open a bug report about this on monday once I've cleaned up my work branch and commented the code.

In the end, we should end up with the overhead of queue being the same as the overhead of a 'dumb' element like identity, while benefiting (and relying) on the kernel for proper load balancing (beneficial on single core, multi-core, with/without hw-accelerated use cases).

Other ideas that could be worked on afterwards would be using the min-threshold properties to throttle the cases where the data coming in is (much) faster that the speed at which data is pulled out, thereby reducing the wakeups on the producer (file read / demuxer / decoder) side and only wakeup one side or the other when the levels are within X units of the min/max.
Comment 23 Felipe Contreras (banned) 2010-10-16 19:43:28 UTC
(In reply to comment #22)
> Felipe, are those numbers speedup using those patches ? If so, it's already a
> great start.

Yes, improvement in CPU time per buffer push on an ARM cpu.
Comment 24 Stefan Sauer (gstreamer, gtkdoc dev) 2010-10-16 21:28:45 UTC
I wonder if we should have a queue helper object (like we have gst-adapter). imho most elements that would need a queue afterwards could just add this internally. This way e.g. tee can get away with one queue per tee instead one queue per tee-srcpad. And we save on pad-push overhead. And people can't forget to add the queue.
Comment 25 Marco Ballesio 2010-10-23 16:50:55 UTC
Eventually I has time to cross-check the patch on ARM (omap 3430). Executing my simple test:

time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink

I measured a mean 3.6% improvement (about 26.5s against 27.5s). I'd say this patch is even better than my testing one where most of the locks have been removed.

Testing it with telepathy-stream-engine I found no perceivable improvements with GTalk, but the default CPU load was something around 20%, so the optimisation would be below 1%, far lower than top's sensitivity.
Comment 26 Tim-Philipp Müller 2010-10-23 17:00:02 UTC
> time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000
> blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert !
> audioconvert ! audioconvert ! fakesink

For what it's worth, fakesink silent=true will generally yield more accurate results.
Comment 27 Marco Ballesio 2010-10-23 18:28:22 UTC
(In reply to comment #26)
 
> For what it's worth, fakesink silent=true will generally yield more accurate
> results.

Fair enough, the results when using:

time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink silent=true

give a 20.88 s against 22.04 s, that is an (even better) 5.3%. Results on VoIP tests should still remain the same, even though I'd like to run more ad-hoc tests with mutated pipeline setups on stream-engine when possible.

I don't have a comparison in this case with my testing patch, but I assume it would anyway be lower than 5.3%.
Comment 28 Wim Taymans 2010-10-24 08:25:17 UTC
Now we have to look at shaving off a few percent from the base classes and probably the common objects like the clock. 

Edward has some pending patches to improve queue performance too.

I've got some branches here with some more things:

http://cgit.freedesktop.org/~wtay/gstreamer/log/?h=less-task-lock
http://cgit.freedesktop.org/~wtay/gstreamer/log/?h=lockfree-clocks
Comment 29 Wim Taymans 2010-12-03 10:33:28 UTC
The push-cache and less-task-lock branches are merged in master now.
Comment 30 Stefan Sauer (gstreamer, gtkdoc dev) 2010-12-05 18:16:53 UTC
Thanks for all those good improvements.