GNOME Bugzilla – Bug 702380
basertppayload: provide a GstBufferPool for output rtp buffers
Last modified: 2013-07-16 19:50:25 UTC
Right now most (all?) payloaders will create new rtp buffers for their output. The size of those buffers is always limited by the MTU, we could therefore have a convenience GstBufferPool which those payloaders could use, providing pre-allocated buffers of that size, with the proper Meta set on them and therefore avoiding creating/freeing buffers all the time. A quick glance at profiles shows that creating/freeing all those buffers (instead of re-using them) account for 25-50% of the time spent in payloaders :(
Most payloaders should just allocate a new header and for the payload, use the GstMemory from the incoming buffer.
(In reply to comment #1) > Most payloaders should just allocate a new header and for the payload, use the > GstMemory from the incoming buffer. Some do, others don't. One of the patterns is to accumulate data on an adapter and then copy them into the target RTP packet, there's no good adapter api yet to get a block of memory and append it to the rtp packet. I'm curious what the difference is now between allocating fresh buffers (+memory) and recycling them from a pool..
Recycling buffers from a pool is about 4 times slower than just allocating and freeing them: *** total 0:00:01.009697402 - average 0:00:00.000000504 - Done creating 2000000 buffers *** total 0:00:04.009326025 - average 0:00:00.000002004 - Done creating 2000000 buffers
(In reply to comment #3) > Recycling buffers from a pool is about 4 times slower than just allocating and > freeing them: > > *** total 0:00:01.009697402 - average 0:00:00.000000504 - Done creating > 2000000 buffers > *** total 0:00:04.009326025 - average 0:00:00.000002004 - Done creating > 2000000 buffers See tests/benchmarks/gstpoolstress.c
Ok, when making sure we never completely empty the pool (and need to read the socket) we can get somewhat faster (but it jitters, sometimes slower): *** total 0:00:00.960529176 - average 0:00:00.000000480 - Done creating 2000000 buffers *** total 0:00:00.671104842 - average 0:00:00.000000335 - Done creating 2000000 buffers
(In reply to comment #5) > Ok, when making sure we never completely empty the pool (and need to read the > socket) we can get somewhat faster (but it jitters, sometimes slower): > > *** total 0:00:00.960529176 - average 0:00:00.000000480 - Done creating > 2000000 buffers > *** total 0:00:00.671104842 - average 0:00:00.000000335 - Done creating > 2000000 buffers That looks more realistic than the first one (because you were mostly creating buffers all the time instead of re-using, so it obviously adds more overhead). It does make it 45% faster, which seems on par with the overhead I was seeing when profiling rtpmp2tpay for example.
(In reply to comment #6) > (In reply to comment #5) > > Ok, when making sure we never completely empty the pool (and need to read the > > socket) we can get somewhat faster (but it jitters, sometimes slower): > > > > *** total 0:00:00.960529176 - average 0:00:00.000000480 - Done creating > > 2000000 buffers > > *** total 0:00:00.671104842 - average 0:00:00.000000335 - Done creating > > 2000000 buffers > > That looks more realistic than the first one (because you were mostly creating > buffers all the time instead of re-using, so it obviously adds more overhead). No, it was writing and reading from the socket to signal pool full/empty, only one buffer was allocated. > > It does make it 45% faster, which seems on par with the overhead I was seeing > when profiling rtpmp2tpay for example. Here's another typical run, now I added (2nd line) simply doing gst_buffer_new() without memory: *** total 0:00:01.250591428 - average 0:00:00.000000625 - Done creating 2000000 buffers *** total 0:00:00.397951348 - average 0:00:00.000000198 - Done creating 2000000 buffers *** total 0:00:01.017563303 - average 0:00:00.000000508 - Done creating 2000000 buffers
In short, for optimizing allocations, the bufferpool is quite terrible unless you can avoid the never-empty case. We should optimize that case more, it's quite common. I think it's also all the atomic operations that slow it down, maybe it would be faster to use a lock there... So if you really need it fast, the best option right now, I think, is by simply doing buffer_new() and making methods on adapter to move the memory on the buffer or so.
Just moving memories from the adapter to buffers would also prevent the need of copying the input memories into a newly allocated, larger memory... potentially speeding up things even more.
Created attachment 249238 [details] [review] adapter: Add function to return buffer composed of multiple memories One option here is to have the adapter return a GstBuffer with multiple GstMemory objects in it, see attached patch, I haven't tested the performance.
Created attachment 249239 [details] [review] rtpbaseaudiopayload: Avoid copying the data
Review of attachment 249238 [details] [review]: Looks good to me ::: libs/gst/base/gstadapter.c @@ +804,3 @@ * This function is potentially more performant than gst_adapter_take() * since it can reuse the memory in pushed buffers by subbuffering + * or merging. This function will always return a contiguous buffer. Maybe "always return a buffer with a single memory region"?
..and the same for rtpbasevideopayload? :)
(In reply to comment #13) > ..and the same for rtpbasevideopayload? :) No such thing, but I do plan to go over every video payloader as I guess they're the biggest offenders.
Core: commit 592049159773d5377d1ce85f322c5751a4fac597 Author: Olivier Crête <olivier.crete@collabora.com> Date: Mon Jul 15 15:41:44 2013 -0400 adapter: Add function to return buffer composed of multiple memories API: gst_adapter_take_fast() Base: commit c599c45b584188d48c7469c1743d55c2edc60538 Author: Olivier Crête <olivier.crete@collabora.com> Date: Mon Jul 15 16:13:11 2013 -0400 rtpbaseaudiopayload: Avoid copying the data Good: commit 54c5a7f69019376c20e5304371209e4e1df37b86 Author: Olivier Crête <olivier.crete@collabora.com> Date: Tue Jul 16 15:37:49 2013 -0400 rtp: Use gst_adapter_take_buffer_fast() where possible in RTP payloaders