GNOME Bugzilla – Bug 773770
rtpjitterbuffer: high CPU usage with low latency audio
Last modified: 2018-11-03 15:13:45 UTC
rtpjitterbuffer in combination with udpsrc seems to be very CPU demanding with low latency audio stream. With 1kpps audio stream rtpjitterbuffer alone requires roughly 20-25% CPU time on 500MHz Cortex A7. Usecase 1: gst-launch-1.0 udpsrc retrieve-sender-address=false uri="udp://224.1.2.3:5004" caps="application/x-rtp,media=(string)audio,encoding-name=(string)L24,clock-rate=(int)48000,encoding-params=(string)2,channels=(int)2" ! rtpjitterbuffer latency=10000 ! fakesink In this usecase two things can be observed: * while buffering (the first 10 seconds) the CPU usage is 25% (for udpsrc) * then the CPU jumps briefly to 60% and stabilizes at about 45% Usecase 2: gst-launch-1.0 audiotestsrc is-live=true wave=sine samplesperbuffer=48 ! "audio/x-raw,format=(string)S24BE,rate=(int)48000,channels=(int)2,channel-mask=(int)0x3" ! rtpL24pay ! rtpjitterbuffer latency=10000 ! fakesink In the second use case the CPU usage while buffering is 4-5% and after buffering is done 7-9%. I would expect similar CPU usage (5-8%) also on top of UDP source (how ever udpsrc currently inefficient is, see bug 772841), but 20% CPU is clearly too much. I was investigating the logs for the usual suspects like extra buffer copy, but have not found any. The only clear difference I could see between the two logs is that in Usecase 2 the packets are processed immediately, whereas in Usecase 1 many timers are processed per packet. However it is hard to debug that way since the debug messages introduce a significant extra lag, which disturbs the measurement.
10000ms is not really low latency. My guess is that it's the list of timers that uses that much CPU here. Can you get proper profiling data here? Run with a tool like perf to find the hot spots.
I agree, the 10000ms is completely artificial just to see the difference in CPU usage when buffering and when not, which might indicate something. It doesn't seem to really matter what the latency is, I do see the same CPU load with 10ms, with 100ms or with 10s. I have been doing some analysis in the last few days but my valgrind and perf profiling haven't really lead to anything so distinct between the good and bad case, that would be worth mentioning. On one hand due to some "unknown"s and on the other hand I'm suspecting the high CPU usage is not due to high computation but due to cache misses and waiting for DRAM. But in both cases I'm seeing about 20% L1 misses and 6% LL misses, so again nothing really distinct in one case or the other.
Looking at the following test it seems to be really a cache issue: 1) first run Usecase 2 -> the CPU is around 7-9%. 2) then run Usecase 1 in parallel -> the CPU usage of __both__ goes up to 40-50%. 3) stop the UDP stream for Usecase 1 (but the command is still running) -> the CPU usage of UC1 goes to 0% and UC2 to 7-9% 4) start the UDP stream again and we're back in point 2
Here some cache statistics: UC1 (udpsrc) Performance counter stats for process id '738': 5661.705299 task-clock (msec) # 0.708 CPUs utilized 33199 context-switches # 0.006 M/sec 0 cpu-migrations # 0.000 K/sec 0 page-faults # 0.000 K/sec 2636575647 cycles # 0.466 GHz (63.44%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 104954645 instructions # 0.04 insns per cycle (62.53%) 13825670 branches # 2.442 M/sec (62.11%) 6442396 branch-misses # 46.60% of all branches (61.54%) 42112109 L1-dcache-loads # 7.438 M/sec (62.13%) 4642094 L1-dcache-load-misses # 11.02% of all L1-dcache hits (50.35%) 28051277 LLC-loads # 4.955 M/sec (50.64%) 9292795 LLC-load-misses # 33.13% of all LL-cache hits (50.77%) 7.998025020 seconds time elapsed ----- And UC2 (no udpsrc) Performance counter stats for process id '799': 6215.619905 task-clock (msec) # 0.689 CPUs utilized 44273 context-switches # 0.007 M/sec 0 cpu-migrations # 0.000 K/sec 0 page-faults # 0.000 K/sec 2832839267 cycles # 0.456 GHz (54.66%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 240020862 instructions # 0.08 insns per cycle (62.25%) 27402150 branches # 4.409 M/sec (66.28%) 9297390 branch-misses # 33.93% of all branches (57.62%) 82000459 L1-dcache-loads # 13.193 M/sec (53.71%) 6656276 L1-dcache-load-misses # 8.12% of all L1-dcache hits (48.78%) 37442118 LLC-loads # 6.024 M/sec (54.09%) 10521617 LLC-load-misses # 28.10% of all LL-cache hits (57.52%) 9.023314498 seconds time elapsed
Created attachment 338947 [details] Call-graph of the Usecase 1 - i.e. with udpsrc
Created attachment 338948 [details] Call-graph of the Usecase 2 - i.e. with RTP generated from audiotestsrc via rtpL24pay
From a short look, it seems like it's generally syscalls. In both cases mostly mutexes/futexes, in the udpsrc case also recvmsg. But generally nothing looks very odd here.
In the udpsrc case there is also more malloc/free.
(In reply to Petr Kulhavy from comment #8) > In the udpsrc case there is also more malloc/free. Could adding a buffer pool or soemething helps with that ?
Some analysis of rtpjitterbuffer.c and gstrtpjitterbuffer.c: ------------------------------------------------------------ 1) RTPJitterBufferItem allocation - every single insert/remove triggers malloc/free -> it is not necessary, because normally the buffer stabilizes at certain number of frames and periodically a frame is added or removed. So the items could be left in the data structure, just marked as unused and reused on the next insert. 2) gstrtpjitterbuffer.c:wait_next_timeout() for every single timeout allocates/frees a new clock. -> since there is only one timer waiting at a time this is redundant and can be done outside of the thread loop -> unfortunately it seems to be not that trivial due to timer unschedule, which says the clock ID cannot be used afterwards :-( 3) the data structure in rtpjitterbuffer.c (linked list) is not very efficient: * insert() is O(n) due to the initial sort-in and update_buffer_level() * get_buffer_level/update_buffer_level is O(n) -> I will see if I can think of a better data structure, which is more memory-friendly 4) the data structure for timers is not very efficient either and seems to be even more intensively used. -> this could be an issue, see more below 5) The timers are stored in an array and one TimerData structure has a size of just a little bit more than a cache line. I have tried to reduce it to exactly one cache line (64B) but if there was any effect seen with perf it was <1%, which I consider a measurement error. In general the objects are not very cache optimal. There are holes and items are not sorted by hotness. I have tried to cache-optimize GstRtpJitterBufferPrivate, which has a 4kB window array in the middle of the structure :-o but that didn't seem to have any effect either... Analysis of the timer array usage --------------------------------- The array is used in: * find_timer() - operation: linear search for seqno - complexity: O(n) - called from: jitter_buffer_chain (called JBC below), set_timer, calculate_expected (called by JBC) * add_timer() - operation expand and add to the end - complexity: O(1) ? but calls realloc? - called from: update_timers (called by JBC), set_timer (called by JBC, update_est_eos), calculate_expected (called by JBC) * remove_timer() - operation: remove_index_fast - complexity: O(1) ? * remove_all_timers() - operation: set len to 0, i.e. discard all elements - complexity: O(1) ? * already_lost() - operation: linear search - complexity: O(n) - called from: jitter_buffer_chain * update_timers() - operation: linear processing of all timers - complexity: O(n) - called from: jitter_buffer_chain * wait_next_timeout() - operation: linear search for the lowest time/lowest seqno + remove lost timers - complexity: O(n)
> Could adding a buffer pool or soemething helps with that ? Yes, that's something that's on my list for this cycle fwiw. Since this is only audio, I wonder how many packets per second this is? 50-100 packets per second or so should not have such a huge impact really, even on a low-powered machine. If we hit the kernel with futex syscalls that means there's a lot of contention on some locks (which is not unexpected for the jitterbuffer I suppose, with one thread pushing packets in and another pushing them out?).
I think there was already another bug about the timer array usage where I've put down the same analysis before, but I couldn't think of a meaningful data structure that can do all these operations in a fast way and then had other things to do :)
(In reply to Tim-Philipp Müller from comment #11) > > Could adding a buffer pool or soemething helps with that ? > > Yes, that's something that's on my list for this cycle fwiw. > > Since this is only audio, I wonder how many packets per second this is? > 50-100 packets per second or so should not have such a huge impact really, > even on a low-powered machine. Normally yes, but this is a low latency stream with 1k packets per second, which seems to be quite a challenge for the code.
Created attachment 339397 [details] Design proposal I have written down some rough ideas how to implement the jitterbuffer using a sliding window and circular buffer. Please let me know your thoughts. Especially I'm not sure whether I covered all the aspects that are currently implemented with timers. Originally I wanted to write a short document describing the current solution, it's weaknesses and then proposing the solution - that's why this form. Unfortunately the reverse-engineering of the part with timers seems to be taking rather too long time. That's why I'm bringing just the proposal part of it for a discussion, which might be more efficient in terms of time and effort than the complete reverse engineering of the current solution.
We can probably learn something from the following papers for a useful timer data structure: https://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-timing-wheels/ http://www.cs.columbia.edu/%7Enahum/w6998/papers/sosp87-timing-wheels.pdf
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/318.