Bug 773770 – rtpjitterbuffer: high CPU usage with low latency audio

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 773770 - rtpjitterbuffer: high CPU usage with low latency audio


Summary:	rtpjitterbuffer: high CPU usage with low latency audio


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-good
Version:	1.9.90
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-11-01 05:44 UTC by Petr Kulhavy
Modified:	2018-11-03 15:13 UTC

See Also:	751636
GNOME target:	---
GNOME version:	---

Attachments
Call-graph of the Usecase 1 - i.e. with udpsrc (48.06 KB, application/pdf) 2016-11-02 11:42 UTC, Petr Kulhavy	Details
Call-graph of the Usecase 2 - i.e. with RTP generated from audiotestsrc via rtpL24pay (37.07 KB, application/pdf) 2016-11-02 11:44 UTC, Petr Kulhavy	Details
Design proposal (30.74 KB, application/pdf) 2016-11-09 13:51 UTC, Petr Kulhavy	Details

Description Petr Kulhavy 2016-11-01 05:44:23 UTC

rtpjitterbuffer in combination with udpsrc seems to be very CPU demanding with low latency audio stream. With 1kpps audio stream rtpjitterbuffer alone requires roughly 20-25% CPU time on 500MHz Cortex A7.

Usecase 1:
gst-launch-1.0 udpsrc  retrieve-sender-address=false uri="udp://224.1.2.3:5004" caps="application/x-rtp,media=(string)audio,encoding-name=(string)L24,clock-rate=(int)48000,encoding-params=(string)2,channels=(int)2" ! rtpjitterbuffer latency=10000 ! fakesink

In this usecase two things can be observed:
* while buffering (the first 10 seconds) the CPU usage is 25% (for udpsrc)
* then the CPU jumps briefly to 60% and stabilizes at about 45%

Usecase 2:
gst-launch-1.0 audiotestsrc is-live=true wave=sine samplesperbuffer=48 ! "audio/x-raw,format=(string)S24BE,rate=(int)48000,channels=(int)2,channel-mask=(int)0x3" ! rtpL24pay ! rtpjitterbuffer latency=10000  ! fakesink

In the second use case the CPU usage while buffering is 4-5% and after buffering is done 7-9%. 

I would expect similar CPU usage (5-8%) also on top of UDP source (how ever udpsrc currently inefficient is, see bug 772841), but 20% CPU is clearly too much.

I was investigating the logs for the usual suspects like extra buffer copy, but have not found any. The only clear difference I could see between the two logs is that in Usecase 2 the packets are processed immediately, whereas in Usecase 1 many timers are processed per packet. However it is hard to debug that way since the debug messages introduce a significant extra lag, which disturbs the measurement.

Comment 1 Sebastian Dröge (slomo) 2016-11-01 10:09:50 UTC

10000ms is not really low latency. My guess is that it's the list of timers that uses that much CPU here.

Can you get proper profiling data here? Run with a tool like perf to find the hot spots.

Comment 2 Petr Kulhavy 2016-11-01 10:43:35 UTC

I agree, the 10000ms is completely artificial just to see the difference in CPU usage when buffering and when not, which might indicate something. It doesn't seem to really matter what the latency is, I do see the same CPU load with 10ms, with 100ms or with 10s.

I have been doing some analysis in the last few days but my valgrind and perf profiling haven't really lead to anything so distinct between the good and bad case, that would be worth mentioning. On one hand due to some "unknown"s and on the other hand I'm suspecting the high CPU usage is not due to high computation but due to cache misses and waiting for DRAM. But in both cases I'm seeing about 20% L1 misses and 6% LL misses, so again nothing really distinct in one case or the other.

Comment 3 Petr Kulhavy 2016-11-02 11:20:15 UTC

Looking at the following test it seems to be really a cache issue:
1) first run Usecase 2 -> the CPU is around 7-9%.
2) then run Usecase 1 in parallel -> the CPU usage of __both__ goes up to 40-50%.
3) stop the UDP stream for Usecase 1 (but the command is still running) -> the CPU usage of UC1 goes to 0% and UC2 to 7-9%
4) start the UDP stream again and we're back in point 2

Comment 4 Petr Kulhavy 2016-11-02 11:30:44 UTC

Here some cache statistics:
UC1 (udpsrc)

 Performance counter stats for process id '738':

       5661.705299      task-clock (msec)         #    0.708 CPUs utilized          
             33199      context-switches          #    0.006 M/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 0      page-faults               #    0.000 K/sec                  
        2636575647      cycles                    #    0.466 GHz                      (63.44%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
         104954645      instructions              #    0.04  insns per cycle          (62.53%)
          13825670      branches                  #    2.442 M/sec                    (62.11%)
           6442396      branch-misses             #   46.60% of all branches          (61.54%)
          42112109      L1-dcache-loads           #    7.438 M/sec                    (62.13%)
           4642094      L1-dcache-load-misses     #   11.02% of all L1-dcache hits    (50.35%)
          28051277      LLC-loads                 #    4.955 M/sec                    (50.64%)
           9292795      LLC-load-misses           #   33.13% of all LL-cache hits     (50.77%)

       7.998025020 seconds time elapsed

-----
And UC2 (no udpsrc)

 Performance counter stats for process id '799':

       6215.619905      task-clock (msec)         #    0.689 CPUs utilized          
             44273      context-switches          #    0.007 M/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 0      page-faults               #    0.000 K/sec                  
        2832839267      cycles                    #    0.456 GHz                      (54.66%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
         240020862      instructions              #    0.08  insns per cycle          (62.25%)
          27402150      branches                  #    4.409 M/sec                    (66.28%)
           9297390      branch-misses             #   33.93% of all branches          (57.62%)
          82000459      L1-dcache-loads           #   13.193 M/sec                    (53.71%)
           6656276      L1-dcache-load-misses     #    8.12% of all L1-dcache hits    (48.78%)
          37442118      LLC-loads                 #    6.024 M/sec                    (54.09%)
          10521617      LLC-load-misses           #   28.10% of all LL-cache hits     (57.52%)

       9.023314498 seconds time elapsed

Comment 5 Petr Kulhavy 2016-11-02 11:42:58 UTC

Created attachment 338947 [details]
Call-graph of the Usecase 1 - i.e. with udpsrc

Comment 6 Petr Kulhavy 2016-11-02 11:44:06 UTC

Created attachment 338948 [details]
Call-graph of the Usecase 2 - i.e. with RTP generated from audiotestsrc via rtpL24pay

Comment 7 Sebastian Dröge (slomo) 2016-11-02 11:47:59 UTC

From a short look, it seems like it's generally syscalls. In both cases mostly mutexes/futexes, in the udpsrc case also recvmsg. But generally nothing looks very odd here.

Comment 8 Petr Kulhavy 2016-11-02 12:05:47 UTC

In the udpsrc case there is also more malloc/free.

Comment 9 Nicolas Dufresne (ndufresne) 2016-11-02 14:19:11 UTC

(In reply to Petr Kulhavy from comment #8)
> In the udpsrc case there is also more malloc/free.

Could adding a buffer pool or soemething helps with that ?

Comment 10 Petr Kulhavy 2016-11-02 14:34:30 UTC

Some analysis of rtpjitterbuffer.c and gstrtpjitterbuffer.c:
------------------------------------------------------------
1) RTPJitterBufferItem allocation - every single insert/remove triggers malloc/free
    -> it is not necessary, because normally the buffer stabilizes at certain number of frames and periodically a frame is added or removed. So the items could be left in the data structure, just marked as unused and reused on the next insert.

2)  gstrtpjitterbuffer.c:wait_next_timeout() for every single timeout allocates/frees a new clock.
    -> since there is only one timer waiting at a time this is redundant and can be done outside of the thread loop
    -> unfortunately it seems to be not that trivial due to timer unschedule, which says the clock ID cannot be used afterwards :-(

3) the data structure in rtpjitterbuffer.c (linked list) is not very efficient:
* insert() is O(n) due to the initial sort-in and update_buffer_level()
* get_buffer_level/update_buffer_level is O(n)
    -> I will see if I can think of a better data structure, which is more memory-friendly

4) the data structure for timers is not very efficient either and seems to be even more intensively used.
    -> this could be an issue, see more below

5) The timers are stored in an array and one TimerData structure has a size of just a little bit more than a cache line. I have tried to reduce it to exactly one cache line (64B) but if there was any effect seen with perf it was <1%, which I consider a measurement error.

In general the objects are not very cache optimal. There are holes and items are not sorted by hotness. I have tried to cache-optimize GstRtpJitterBufferPrivate, which has  a 4kB window array in the middle of the structure :-o but that didn't seem to have any effect either...

Analysis of the timer array usage
---------------------------------
The array is used in:

* find_timer()
    - operation: linear search for seqno
    - complexity: O(n)
    - called from: jitter_buffer_chain (called JBC below), set_timer, calculate_expected (called by JBC)
* add_timer()
    - operation expand and add to the end
    - complexity:  O(1) ? but calls realloc?
    - called from: update_timers (called by JBC), set_timer (called by JBC, update_est_eos), calculate_expected (called by JBC)
* remove_timer()
    - operation: remove_index_fast
    - complexity: O(1) ?
* remove_all_timers()
    - operation: set len to 0, i.e. discard all elements
    - complexity: O(1) ?
* already_lost()
    - operation: linear search
    - complexity: O(n)
    - called from: jitter_buffer_chain
* update_timers()
    - operation: linear processing of all timers
    - complexity: O(n)
    - called from: jitter_buffer_chain
* wait_next_timeout()
    - operation: linear search for the lowest time/lowest seqno + remove lost timers
    - complexity: O(n)

Comment 11 Tim-Philipp Müller 2016-11-02 14:37:03 UTC

> Could adding a buffer pool or soemething helps with that ?

Yes, that's something that's on my list for this cycle fwiw.

Since this is only audio, I wonder how many packets per second this is? 50-100 packets per second or so should not have such a huge impact really, even on a low-powered machine.

If we hit the kernel with futex syscalls that means there's a lot of contention on some locks (which is not unexpected for the jitterbuffer I suppose, with one thread pushing packets in and another pushing them out?).

Comment 12 Sebastian Dröge (slomo) 2016-11-02 14:40:59 UTC

I think there was already another bug about the timer array usage where I've put down the same analysis before, but I couldn't think of a meaningful data structure that can do all these operations in a fast way and then had other things to do :)

Comment 13 Petr Kulhavy 2016-11-02 14:48:36 UTC

(In reply to Tim-Philipp Müller from comment #11)
> > Could adding a buffer pool or soemething helps with that ?
> 
> Yes, that's something that's on my list for this cycle fwiw.
> 
> Since this is only audio, I wonder how many packets per second this is?
> 50-100 packets per second or so should not have such a huge impact really,
> even on a low-powered machine.

Normally yes, but this is a low latency stream with 1k packets per second, which seems to be quite a challenge for the code.

Comment 14 Petr Kulhavy 2016-11-09 13:51:13 UTC

Created attachment 339397 [details]
Design proposal

I have written down some rough ideas how to implement the jitterbuffer using a sliding window and circular buffer. Please let me know your thoughts. Especially I'm not sure whether I covered all the aspects that are currently implemented with timers.

Originally I wanted to write a short document describing the current solution, it's weaknesses and then proposing the solution - that's why this form. Unfortunately the reverse-engineering of the part with timers seems to be taking rather too long time. That's why I'm bringing just the proposal part of it for a discussion, which might be more efficient in terms of time and effort than the complete reverse engineering of the current solution.

Comment 15 Sebastian Dröge (slomo) 2017-09-27 17:11:01 UTC

We can probably learn something from the following papers for a useful timer data structure: https://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-timing-wheels/ http://www.cs.columbia.edu/%7Enahum/w6998/papers/sosp87-timing-wheels.pdf

Comment 16 GStreamer system administrator 2018-11-03 15:13:45 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/318.