GNOME Bugzilla – Bug 503592
gstpad.c does many ref/unref of peer pad in dataflow
Last modified: 2010-12-05 18:17:31 UTC
in gst_pad_push, gst_pad_check_pull_range, gst_pad_pull_range one can find code like below. this causes atomic ops for all the dataflow. To give an idea - I added the GST_WARNING to count them and for a 2 min avi I got 91949 ref/unrefs. GST_OBJECT_LOCK (pad); if (G_UNLIKELY ((peer = GST_PAD_PEER (pad)) == NULL)) goto not_linked; gst_object_ref (peer); GST_WARNING("!ref peer pad!"); GST_OBJECT_UNLOCK (pad); ret = gst_pad_chain_unchecked (peer, buffer); gst_object_unref (peer);
Created attachment 100950 [details] [review] convert refs into lazy refs
Some extra explanations. GstPad has ->peer, which is used via #define GST_PAD_PEER(pad) (GST_PAD_CAST(pad)->peer) We cannot turn that into a reffing one.
Created attachment 100951 [details] [review] convert refs into lazy refs
Not very practical, the unref happens at an undefined time and can keep the sinkpad lingering on for as long as the srcpad is alive. Just as a brain dump: _pad_push (pad, buffer) { LOCK (pad) peer = pad->peer; if (!peer) goto no_peer; chain_func = pad->peer_chain; if (chain_func != peer->chain) pad->peer_chain = chain_func; GST_STREAM_LOCK (peer); UNLOCK (pad) res = chain_func (peer); GST_STREAM_UNLOCK (peer); return res; no_peer: { UNLOCK (pad); return NOT_LINKED; } } This example uses a nested lock, there is no way the pad can get unlinked when the pad lock is taken and we can be sure that the pad is not unreffed when we have the stream_lock. Not sure this is correct...
What's up with this? Do we still care about this?
Created attachment 169667 [details] [review] use nested lock to avoid reffing the peer pad Implement suggestion.
$ time gst-launch-0.10 -q fakesrc sizetype=empty silent=true num-buffers=1000000 ! fakesink silent=true before: real user sys 0m3.895s 0m3.876s 0m0.012s 0m3.674s 0m3.652s 0m0.024s after: real user sys 0m3.441s 0m3.384s 0m0.012s 0m3.583s 0m3.480s 0m0.008s
Created attachment 169738 [details] [review] use nested lock to avoid reffing the peer pad proper patch. measurements rechecked & confirmed
Created attachment 169752 [details] [review] use nested lock to avoid reffing the peer pad
Looks good in general but please document that chain_data_unchecked must be called with the stream lock. And taking the stream lock instead of getting a new ref only works because in post_activate() the stream lock is taken when the pad is deactivated and the pad can't go away before it is deactivated?
The nested lock can cause a deadlock because now the object lock can be taken before the stream-lock.
Created attachment 172173 [details] [review] fast path prototype Here is a small prototype of the pad link cache. The idea is that when there are no flushing pads, no probes or pad blocks, we can avoid taking any locks (except for the streaming lock) or doing any refcounts as shown in the above patch. The downside is that we do a few more atomic compare and swaps to get sole access to the cache. The next thing to implement is invalidation code for the cache and code to recreate the cache when the fast-path conditions are met again. This should be done by tagging each modification to pads (flushing, probed, blocked) by incrementing a cookie. Also the caps on the buffer should be checked against the cached ones so that we can fall back to the slow path as well.
Created attachment 172226 [details] [review] more evolved patch A more evolved version of the patch. This one dynamically builds the cache in the slow path and invalidates the cache when modifications are done to the pad link. The only thing missing is the invalidation when a pad probe is installed. It passes make check, playbin2 works and fakesrc ! fakesink seems to be around 10% faster.
Created attachment 172227 [details] [review] more evolved patch A more evolved version of the patch. This one dynamically builds the cache in the slow path and invalidates the cache when modifications are done to the pad link. The only thing missing is the invalidation when a pad probe is installed. It passes make check, playbin2 works and fakesrc ! fakesink seems to be around 10% faster.
I think this patch is great, definitely a big leap forward! If it's allowed, I'd point out just a couple of trivial things: - in gst_pad_chain_data_unchecked you're passing the argument "cache" and then reassigning it to NULL. In code readability's sake, wouldn't it be better to use a local variable for this? - even though the functions "pad_take_cache", "pad_free_cache", "pad_put_cache" and "pad_invalidate_cache" should be implicitly inlined by the gcc, I suggest to explicitly add an "inline" statement to make this optimisation more portable and make it clear to the reader. P.S. I'll try and run performance checks on x86 and Atom when possible, hopefully already this week.
(In reply to comment #15) > I think this patch is great, definitely a big leap forward! > > If it's allowed, I'd point out just a couple of trivial things: > > - in gst_pad_chain_data_unchecked you're passing the argument "cache" and then > reassigning it to NULL. In code readability's sake, wouldn't it be better to > use a local variable for this? I experimented with a local variable and with a field in the cache struct. reassigning the cache paramter resulted in the smallest code but I guess a local variable would be clearer. > > - even though the functions "pad_take_cache", "pad_free_cache", "pad_put_cache" > and "pad_invalidate_cache" should be implicitly inlined by the gcc, I suggest > to explicitly add an "inline" statement to make this optimisation more portable > and make it clear to the reader. will do > > P.S. I'll try and run performance checks on x86 and Atom when possible, > hopefully already this week. ok, sweet
New version at http://cgit.freedesktop.org/~slomo/gstreamer in the dataflow-locking branch. Only changes are, that the cache is invalidated when adding buffer/data/event probes.
Breaks the tee unit test: elements/tee.c:551:F:general:test_flow_aggregation:0: Assertion 'gst_pad_push (mysrc, gst_buffer_ref (buffer)) == GST_FLOW_ERROR' failed
Some simple, synthetic benchmarks: 24% speedup: gst-launch-0.10 fakesrc silent=true num-buffers=1000000 ! identity silent=true ! identity silent=true ! queue silent=true ! identity silent=true ! fakesink silent=true 22.5% speedup: gst-launch-0.10 fakesrc silent=true num-buffers=1000000 ! identity silent=true ! identity silent=true ! identity silent=true ! fakesink silent=true For a real playbin2 pipeline there was no measurable difference (yes, sinks were "fakesink sync=false silent=true")
With my simulation of the send side of a Farsight2 call, I get around 5% improvement. This seems like a pretty nice gain. audiotestsrc num-buffers=1000000 samplesperbuffer=160 ! audio/x-raw-int, rate=8000 ! tee ! identity ! alawenc ! rtppcmapay ! application/x-rtp, payload=96,ptime=20 ! fsfunnel ! fsfunnel ! udpsink port=8888 host=127.0.0.1 sync=false
I ran my futex test and the results are: queue ~12%, direct ~31%
Felipe, are those numbers speedup using those patches ? If so, it's already a great start. The issue with queue is a separate one I'm working one (based on brainstorming/prototyping/past-experience from both wim and myself), involving: * not take *any* lock when queue is neither full nor empty (goodbye context switching for cases where rate of data coming in is similar to rate of data coming out, which is the case for pipelines with live sources/sinks). * If needed (queue empty/full) only take a lock to signal the other thread of a change in levels (i.e. you only take the lock to wakeup/wait-on a conditional and nothing else, as opposed to right now where a mutex is taken over the whole queue processing), limiting the amount of time of context switching. * allocation-less as much as possible (relying on grow-only arrays with sane default sizes as a replacement to glib's GQueue). This feature alone introduce a 33% speedup in throughput due to the reduced cost of passing data from one thread to another, for a minimal extra expense in memory being used. I'll open a bug report about this on monday once I've cleaned up my work branch and commented the code. In the end, we should end up with the overhead of queue being the same as the overhead of a 'dumb' element like identity, while benefiting (and relying) on the kernel for proper load balancing (beneficial on single core, multi-core, with/without hw-accelerated use cases). Other ideas that could be worked on afterwards would be using the min-threshold properties to throttle the cases where the data coming in is (much) faster that the speed at which data is pulled out, thereby reducing the wakeups on the producer (file read / demuxer / decoder) side and only wakeup one side or the other when the levels are within X units of the min/max.
(In reply to comment #22) > Felipe, are those numbers speedup using those patches ? If so, it's already a > great start. Yes, improvement in CPU time per buffer push on an ARM cpu.
I wonder if we should have a queue helper object (like we have gst-adapter). imho most elements that would need a queue afterwards could just add this internally. This way e.g. tee can get away with one queue per tee instead one queue per tee-srcpad. And we save on pad-push overhead. And people can't forget to add the queue.
Eventually I has time to cross-check the patch on ARM (omap 3430). Executing my simple test: time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16" ! audioconvert ! audioconvert ! audioconvert ! fakesink I measured a mean 3.6% improvement (about 26.5s against 27.5s). I'd say this patch is even better than my testing one where most of the locks have been removed. Testing it with telepathy-stream-engine I found no perceivable improvements with GTalk, but the default CPU load was something around 20%, so the optimisation would be below 1%, far lower than top's sensitivity.
> time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 > blocksize=128 ! "audio/x-raw-int, rate=8000, width=16" ! audioconvert ! > audioconvert ! audioconvert ! fakesink For what it's worth, fakesink silent=true will generally yield more accurate results.
(In reply to comment #26) > For what it's worth, fakesink silent=true will generally yield more accurate > results. Fair enough, the results when using: time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16" ! audioconvert ! audioconvert ! audioconvert ! fakesink silent=true give a 20.88 s against 22.04 s, that is an (even better) 5.3%. Results on VoIP tests should still remain the same, even though I'd like to run more ad-hoc tests with mutated pipeline setups on stream-engine when possible. I don't have a comparison in this case with my testing patch, but I assume it would anyway be lower than 5.3%.
Now we have to look at shaving off a few percent from the base classes and probably the common objects like the clock. Edward has some pending patches to improve queue performance too. I've got some branches here with some more things: http://cgit.freedesktop.org/~wtay/gstreamer/log/?h=less-task-lock http://cgit.freedesktop.org/~wtay/gstreamer/log/?h=lockfree-clocks
The push-cache and less-task-lock branches are merged in master now.
Thanks for all those good improvements.