GNOME Bugzilla – Bug 772841
udpsrc: high CPU usage at high packet rate
Last modified: 2018-11-03 15:12:54 UTC
A simple pipeline: gst-launch-1.0 udpsrc uri="udp://224.100.1.1:5004" ! fakesink is causing 30% CPU usage on a Cortex A7 @528MHz at 1000 packets/second Such a high CPU usage is unacceptable, especially if a C program doing the same - reading from a socket and discarding the data - needs just 3-4% CPU This use case is crucial for handling low latency audio.
I've also tested the "retrieve-sender-address" patch and disabling the sender address saves about 2%.
I have been working on this and hope to land some improvements early in the 1.11 cycle (once 1.10 is out).
Created attachment 337560 [details] Callgrind output I've done some profiling with valgrind and it turned out that udpsrc allocates a buffer with every packet. Actually, there are about 8 malloc calls per g_socket_receive_message call.
(In reply to Tim-Philipp Müller from comment #2) > I have been working on this and hope to land some improvements early in the > 1.11 cycle (once 1.10 is out). How much improvement do you expect? Would you mind to share the patches here? I'm kind of stuck and can't wait as I need to significantly reduce the CPU usage to get my use-case working. So I will be happy to test and possibly optimize further. It would save my time and avoid doubling the effort. Thanks!
Created attachment 339726 [details] [review] Optimization to receive messages only in multicast Here is a small optimization for non-multicast traffic. It does not get control messages from g_socket_receive_message() and therefore does not release them.
Created attachment 339727 [details] [review] Buffer pool patch Here is my attempt to do the buffer pool. So far with non-configurable buffer size of 1500 (for the usual MTU). However it has the problem that the buffer is freed due to the resize to the actual packet size. Any idea how to avoid this issue and take the full advantage of the buffer pool?
(In reply to Petr Kulhavy from comment #5) > Created attachment 339726 [details] [review] [review] > Optimization to receive messages only in multicast > > Here is a small optimization for non-multicast traffic. It does not get > control messages from g_socket_receive_message() and therefore does not > release them. See also https://bugzilla.gnome.org/show_bug.cgi?id=769994
Comment on attachment 339726 [details] [review] Optimization to receive messages only in multicast commit 1fc572d7408c339a922817a1f99579d6228533df Author: Petr Kulhavy <brain@jikos.cz> Date: Sat Nov 12 23:34:23 2016 +0100 udpsrc: receive control messages only in multicast Control messages are used only in multicast mode - to detect if the destination address is not ours and possibly drop the packet. However in non-multicast modes the messages are still allocated and freed even if not used. Therefore request control messages from g_socket_receive_message() only in multicast mode. https://bugzilla.gnome.org/show_bug.cgi?id=772841
(In reply to Petr Kulhavy from comment #6) > Created attachment 339727 [details] [review] [review] > Buffer pool patch > > Here is my attempt to do the buffer pool. So far with non-configurable > buffer size of 1500 (for the usual MTU). However it has the problem that the > buffer is freed due to the resize to the actual packet size. Any idea how to > avoid this issue and take the full advantage of the buffer pool? The buffer pool should probably resize the buffer back (if it can) and check the memory maxsize instead of the size.
(In reply to Sebastian Dröge (slomo) from comment #9) > (In reply to Petr Kulhavy from comment #6) > > Created attachment 339727 [details] [review] [review] [review] > > Buffer pool patch > > > > Here is my attempt to do the buffer pool. So far with non-configurable > > buffer size of 1500 (for the usual MTU). However it has the problem that the > > buffer is freed due to the resize to the actual packet size. Any idea how to > > avoid this issue and take the full advantage of the buffer pool? > > The buffer pool should probably resize the buffer back (if it can) and check > the memory maxsize instead of the size. That is what I have actually done in my local copy. Was there any particular reason for releasing it and not resizing it back? Can the maxsize change? How?
(In reply to Petr Kulhavy from comment #10) > (In reply to Sebastian Dröge (slomo) from comment #9) > > (In reply to Petr Kulhavy from comment #6) > > > Created attachment 339727 [details] [review] [review] [review] [review] > > > Buffer pool patch > > > > > > Here is my attempt to do the buffer pool. So far with non-configurable > > > buffer size of 1500 (for the usual MTU). However it has the problem that the > > > buffer is freed due to the resize to the actual packet size. Any idea how to > > > avoid this issue and take the full advantage of the buffer pool? > > > > The buffer pool should probably resize the buffer back (if it can) and check > > the memory maxsize instead of the size. > > That is what I have actually done in my local copy. Was there any particular > reason for releasing it and not resizing it back? I don't think so. You also need to check that it's the same memory still btw, but maybe the UDP buffer pool should be able to be less strict here. > Can the maxsize change? How? Different memory
Created attachment 339793 [details] [review] Remove redundant saddr unref This is not causing any performance issue, but I found it during the optimizations. The unref of saddr before calling receive message seems to be redundant because the saddr is unref-ed just before goto retry.
Comment on attachment 339793 [details] [review] Remove redundant saddr unref It would probably make more sense to move it right after the "retry:" and remove it from before the "goto retry" places.
Created attachment 339798 [details] [review] Remove redundant saddr unref Agree, makes sense. Here moved just after retry:
Comment on attachment 339798 [details] [review] Remove redundant saddr unref commit 89ad2de93ec6b5834e2111e58fc77868033f7ada Author: Petr Kulhavy <brain@jikos.cz> Date: Mon Nov 14 12:13:14 2016 +0100 udpsrc: remove redundant saddr unref The g_object_unref (saddr) before receiving message seems to be redundant as it is done just before jumping to retry Though not directly related, part of https://bugzilla.gnome.org/show_bug.cgi?id=772841
Created attachment 339819 [details] [review] Buffer pool patch Here is the updated version of the buffer pool modifications with configurable buffer size (as a parameter).
Created attachment 339820 [details] [review] GstBufferPool modification Here is a related modification of the default_reset_buffer() function in GstBufferPool to reset the buffer size if the memory hasn't changed. To be applied on gstreamer.
I'm also wondering if reducing the number of syscalls would bring any further performance boost. I mean try to combine the wait (poll) and receive, e.g. by setting SO_RCVTIMEO on the socket. What do you think?
> I'm also wondering if reducing the number of syscalls would bring any > further performance boost. I mean try to combine the wait (poll) and > receive, e.g. by setting SO_RCVTIMEO on the socket. What do you think? We should use recvmmsg(). I have patches for that which I need to rebase, this is next on my list, give me a day or so please :)
(In reply to Tim-Philipp Müller from comment #19) > > I'm also wondering if reducing the number of syscalls would bring any > > further performance boost. I mean try to combine the wait (poll) and > > receive, e.g. by setting SO_RCVTIMEO on the socket. What do you think? > > We should use recvmmsg(). I have patches for that which I need to rebase, > this is next on my list, give me a day or so please :) I have been looking at recvmmsg(). But the timeout seems to be buggy: man recvmmsg: "BUGS: The timeout argument does not work as intended. The timeout is checked only after the receipt of each datagram, so that if up to vlen-1 datagrams are received before the timeout expires, but then no further data‐grams are received, the call will block forever." So I'm afraid it is useless :-(
(In reply to Petr Kulhavy from comment #20) > I have been looking at recvmmsg(). But the timeout seems to be buggy: > > man recvmmsg: > "BUGS: The timeout argument does not work as intended. The timeout is > checked only after the receipt of each datagram, so that if up to vlen-1 > datagrams are received before the timeout expires, but then no further > data‐grams are received, the call will block forever." > > So I'm afraid it is useless :-( Not totally true. When using it non-blocking, you can first poll, and then you call recvmmsg with N messages allocated. As it's non blocking, it will give you back M messages (where M <= N). You don't need to use the timeout.
You can just optimise for the high-packetrate case, in which case you don't first poll, but just try to read packets, and if you don't get anything (or not enough) then you wait with poll.
(In reply to Nicolas Dufresne (stormer) from comment #21) > (In reply to Petr Kulhavy from comment #20) > > I have been looking at recvmmsg(). But the timeout seems to be buggy: > > > > man recvmmsg: > > "BUGS: The timeout argument does not work as intended. The timeout is > > checked only after the receipt of each datagram, so that if up to vlen-1 > > datagrams are received before the timeout expires, but then no further > > data‐grams are received, the call will block forever." > > > > So I'm afraid it is useless :-( > > Not totally true. When using it non-blocking, you can first poll, and then > you call recvmmsg with N messages allocated. As it's non blocking, it will > give you back M messages (where M <= N). You don't need to use the timeout. That might make sense for usecases where latency is not critical. However for low latency streams you want to possibly receive individual packets (i.e. M == 1), so there recvmmsg won't gain anything.
Not sure I follow. The bug summary talks about a high packet rate. If that's the case, you will likely be able to retrieve multiple packets with a single system call, which improves throughput, latency and cpu usage. And in no case should you be worse off than with the current implementation.
Petr, basically we don't need the timeout, and not using the timeout is what will prevent adding latency. Tim, I totally agree with you that we should revc first.
(In reply to Nicolas Dufresne (stormer) from comment #25) > Petr, basically we don't need the timeout, and not using the timeout is what > will prevent adding latency. Tim, I totally agree with you that we should > revc first. Nicolas, I agree to that, calling recv first is the right approach.
(In reply to Tim-Philipp Müller from comment #24) > Not sure I follow. The bug summary talks about a high packet rate. If that's > the case, you will likely be able to retrieve multiple packets with a single > system call, which improves throughput, latency and cpu usage. And in no > case should you be worse off than with the current implementation. Tim, the use case I'm talking about is for example AES67 with 1k packets/second, 1ms each. To reduce the latency as much as possible ideally you would receive every single packet as fast as possible and send it down the pipe for further processing individually. So things like interrupt coalescence or receiving multiple packets are not desired in the first place since they increase the latency. The primary goal is then not to gather packets but to make the processing lightweight so it can sustain the high packet rate. Now that is the theory. The other thing is how much realtime Linux able to provide, how much overhead are the syscalls going to create and eventually how many packets per second is the system able to process with the given MIPS. So when it turns out the system is not responsive enough to handle that packet rate things like recvmmsg would actually help to save the situation (where packets cumulated). But ideally it should never come to that point. Does it make more sense now?
Created attachment 339900 [details] [review] Buffer pool patch Here is a modified version, which reads first and polls only on EWOULDBLOCK.
Created attachment 339902 [details] [review] Optimize GstUdpSrc cache performance Here is a small patch, which optimizes GstUdpSrc for cache performance: - hot properties (used in read) are moved to the top - reorder to avoid holes where needed - unused property @ttl removed
With all these changes I'm at 1-3% CPU for unicast (udpsrc ! fakesink @ 1kpps) with occasional temporary jump to 12% (using top). Multicast is moving between 2-9% and 17-22% due to the heavy allocations described in bug 769994. It is a significant improvement but to sort it out completely we should get rid of the kernel message allocations as described in bug 769994. Because the typical use case with RTSP is multicast :-S
Created attachment 340677 [details] [review] Process control messages with a callback I have created a patch, which uses my gsocket API improvement in bug 769994, and processes the control messages with a callback rather than getting them via the receive_message() interface. This reduces the CPU usage significantly as the overhead of serialization and memory allocation/deallocation is bypassed. The patch works with both versions of the API.
I have recently encountered similar performance-related issues with udpsrc. In my case, the visible issue was that it couldn't handle high bitrates (50-60 Mbps) with the default MTU of 1500... I had to enable jumbo frames to get just a small improvement and then realized that udpsrc is really inefficient in reading packets from the kernel buffer. In order to improve the situation, I wrote a series of improvements: https://git.collabora.com/cgit/user/gkiagia/gst-plugins-good.git/log/?h=udpsrc-highbitrate These basically include: * Use recvmmsg(), through recent GSocket API * Use a separate reader thread to ensure we return execution to recvmmsg() as fast as possible * Use a buffer pool to avoid a lot of new allocations It's possible that some of it is overkill, however it is **still** losing a very small percentage of packets sometimes with a 57 Mbps RTP stream that I have been testing with (on a link-local connection). Thankfully, it is too small to notice (it's in the order of tens of packets out of a few million - with retransmission enabled it's not noticeable at all).
The the extra thread part, why couldn't you just add queue after udpsrc ? Having 2 threads per source all the time seems like an important overhead for the other use cases.
(In reply to Nicolas Dufresne (stormer) from comment #33) > The the extra thread part, why couldn't you just add queue after udpsrc ? > Having 2 threads per source all the time seems like an important overhead > for the other use cases. Perhaps it is overhead, yes. My initial idea was also to use a queue, but it failed my benchmark application, so I went for the separate thread... After implementing the separate reader thread my benchmark was still failing, though, but later I realized that it works on real ethernet (benchmark is on the lo interface). Still, I decided to keep the thread anyway, because it feels safer... and since it's still losing some packets (!) I didn't want to risk it for this application. You have a good point, though. I can do some real tests without the separate thread and see if it makes any real impact.
Created attachment 358691 [details] benchmark application This is the benchmark I used initially. It's not good enough, but it shows the difference between using recvmmsg() and not using it.
Created attachment 369368 [details] [review] udpsrc: switch to using a buffer pool This exposes a new property, mtu, which is used to determine the initial size of buffers from the buffer pool. If received data exceeds this, the element gracefully handles that in a manner similar to what we had previously: a large memory gets filled and reallocated at the next call to "fill". The default size is set to 1500, which should cover most use cases. With contributions from Mathieu Duponchelle <mathieu@centricular.com>
I updated the buffer pool patch to make it upstreamable, with the following changes: * At least on its own, the proposed patch incorrectly blocked on receive_message calls, as the udpsrc socket is blocking. We can not make it blocking, at least not by default, as it might be used by other code such as udpsink when sending; it would be possible to follow two different code paths depending on the socket's refcount, but that's complicated and IMO unrelated to the bufferpool work. * I changed the property name to MTU * I removed the conditions for calling config_set_params, caps are not required to be non-NULL there, and this made the udpsrc tests fail * I added support for larger than MTU messages, by falling back to a larger memory, very similar to what we had before, except we no longer cache the allocator / allocation params @Petr, I kept you as the author, and added myself as a contributor in the commit message, I hope that's fine by you :)
Review of attachment 369368 [details] [review]: Even though I'm sure the fallback is unlikely to be used, I think the implementation of that fallback should be better. As we request a pool from downstream, we should also do better with buffer that would have multiple memory. A unit test with downstream providing pool would also be nice. ::: gst/udp/gstudpsrc.c @@ +837,3 @@ p_saddr = (udpsrc->retrieve_sender_address) ? &saddr : NULL; + if (!gst_buffer_map (outbuf, &info, GST_MAP_READWRITE)) That is not ideal, if downstream setup buffers with multiple memory in it, this will merge them. You can use the ivec to prevent this. @@ +843,3 @@ + ivec[0].size = info.size; + + /* Prepare memory in case the data size exceeds mtu */ s/mtu/buffer size/ because the size could have been set by downstream allocation reply. @@ +855,3 @@ + + udpsrc->extra_mem = + gst_allocator_alloc (allocator, MAX_IPV4_UDP_PACKET_SIZE, ¶ms); I'm not sure what the size limit for the vector, but wouldn't it be better to add multiple buffers, this way if you have a 1400 mtu and downstream allocates 700, you have two zero copy buffers. All we need is to setup enough buffer in the vect so that MAX_IPV4_UDP_PACKET_SIZE is reached. @@ +983,2 @@ + if (res > udpsrc->mtu) { + gst_buffer_append_memory (outbuf, udpsrc->extra_mem); The way our pool works, this will kill it. All memory will be freed when they come back because the "memory" got modified. Better use a buffer list.
Seems like a bit of work, what about dropping downstream pool support ? This bug seems to care about caching, not doing fancy zero-copy.
(In reply to Nicolas Dufresne (stormer) from comment #38) > @@ +983,2 @@ > + if (res > udpsrc->mtu) { > + gst_buffer_append_memory (outbuf, udpsrc->extra_mem); > > The way our pool works, this will kill it. All memory will be freed when > they come back because the "memory" got modified. Better use a buffer list. It can't be a buffer list, otherwise you break the packetization. Each buffer in a bufferlist is separate.
(In reply to Mathieu Duponchelle from comment #37) > * At least on its own, the proposed patch incorrectly blocked on > receive_message calls, as the udpsrc socket is blocking. We can not make it > blocking, at least not by default, as it might be used by other code such as > udpsink when sending; it would be possible to follow two different code > paths depending on the socket's refcount, but that's complicated and IMO > unrelated to the bufferpool work. Maybe add a property for that? It seems like a valuable optimization (in a separate patch), that would prevent one syscall per packet if packets are arriving fast. > * I added support for larger than MTU messages, by falling back to a larger > memory, very similar to what we had before, except we no longer cache the > allocator / allocation params As Nicolas mentioned already this breaks the pool. We need to do something more clever here. Maybe only ever use a bufferpool if MTU is set, and don't set it by default. If MTU is set we always allocate that much and don't need to do any magic with different memories.
(In reply to Sebastian Dröge (slomo) from comment #41) > As Nicolas mentioned already this breaks the pool. We need to do something > more clever here. Maybe only ever use a bufferpool if MTU is set, and don't > set it by default. If MTU is set we always allocate that much and don't need > to do any magic with different memories. My thinking here was to optimize the default case, while not breaking the more exotic ones, meaning that the buffer pool can be used by default without any user interaction. This doesn't really "break" or "kill" anything, just means the larger than 1500 MTU case still requires allocation, which we can't really work around anyway?
My main concern is your use of downstream pool, I would drop that part, you don't really need it, it's not aligned with the MTU. It is also totally untested. The rest is a bit ugly, the fact that the fallback will cause the memory to be freed is definitely worth a comment in the code.
(In reply to Nicolas Dufresne (stormer) from comment #43) > My main concern is your use of downstream pool, I would drop that part, you > don't really need it, it's not aligned with the MTU. It is also totally > untested. FWIW I didn't change that from the originally proposed patch, I don't mind dropping it :) > The rest is a bit ugly, the fact that the fallback will cause the > memory to be freed is definitely worth a comment in the code. Sure, though I think the ugliness is actually toned down with that patch compared to the currently existing code ;)
Created attachment 369386 [details] [review] udpsrc: switch to using a buffer pool This exposes a new property, mtu, which is used to determine the initial size of buffers from the buffer pool. If received data exceeds this, the element gracefully handles that in a manner similar to what we had previously: a large memory gets filled and reallocated at the next call to "fill". The default size is set to 1500, which should cover most use cases. With contributions from Mathieu Duponchelle <mathieu@centricular.com>
Created attachment 369387 [details] [review] udpsrc: switch to using a buffer pool This exposes a new property, mtu, which is used to determine the initial size of buffers from the buffer pool. If received data exceeds this, the element gracefully handles that in a manner similar to what we had previously: a large memory gets filled and reallocated at the next call to "fill". The default size is set to 1500, which should cover most use cases. With contributions from Mathieu Duponchelle <mathieu@centricular.com>
Review of attachment 339820 [details] [review]: I think this one we can take.
Comment on attachment 339900 [details] [review] Buffer pool patch Mathieu reworked this one.
Review of attachment 339902 [details] [review]: Does that have a measurable impact ? Also, removing unused ttl should be a sep patch.
(In reply to Nicolas Dufresne (stormer) from comment #49) > Review of attachment 339902 [details] [review] [review]: > > Does that have a measurable impact ? Also, removing unused ttl should be a > sep patch. Thanks, I did observe with George's test application, where the throughput is improved, and the percentage of lost packets measurably goes down, I'm unsure why to be honest. Regarding CPU usage, I didn't notice an improvement, but that's with a simplistic test case, I expect improvements should be more remarkable in scenarios with more complex allocation scenarios, I'll let you know :)
(In reply to Mathieu Duponchelle from comment #50) > Thanks, I did observe with George's test application, where the throughput > is improved, and the percentage of lost packets measurably goes down, I'm > unsure why to be honest. A wild guess, we are jumping between location in the object and endup with with cash miss, as the loops are slower, we get preempted and the socket internal buffer overrun. It's well known that we need to read the socket data asap to avoid overrun, though you could in theory hide all this by using a larger socket buffer. If it does improve thing with such a trivial patch, I'd say let's merge it. Second though, the ttl removal is find, and it's in the comment.
Review of attachment 340677 [details] [review]: ::: gst/udp/gstudpsrc.c @@ +889,3 @@ +#if GLIB_CHECK_VERSION(2,52,0) + /* use callback from receive_message() */ + g_socket_set_cmsg_handler (udpsrc->used_socket, gst_udpsrc_check_dst_addr, This never made it to 2.52, I believe it requires a tad more work upstream, have you pinged Philip Whitnall about this ?
Review of attachment 369387 [details] [review]: No more comment, I think it's good to merge in the next dev window.
Comment on attachment 339820 [details] [review] GstBufferPool modification commit 7d75d3ef4181de6658fe7d4dbbe633082257315d (HEAD -> master) Author: Petr Kulhavy <brain@jikos.cz> Date: Mon Nov 14 15:35:50 2016 +0100 gstbuffer: reset buffer to its original size if intact Enhance default_reset_buffer() to resize the buffer to its full size if the memory hasn't changed. This allows to reuse the buffer even if the offset has changed or the size has shrunk, rather than freeing the buffer. Change related to: https://bugzilla.gnome.org/show_bug.cgi?id=772841
commit 589019d8f5072d4470ea2ed76cfffa75f45e9426 Author: Petr Kulhavy <brain@jikos.cz> Date: Tue Nov 15 09:39:31 2016 +0100 udpsrc: optimize GstUdpSrc object for cache performance Optimize GstUdpSrc for cache performance. Move the hot properties, which are used by the read function, to the top: @used_socket, @addr, @cancellable, @skip_first_bytes, @timeout, @retrieve_sender_address. Remove the unused property @ttl. Where needed reorder so that holes are avoided (the 64-bit @timeout) https://bugzilla.gnome.org/show_bug.cgi?id=772841 Attachment 369387 [details] pushed as 738eb0d - udpsrc: switch to using a buffer pool
So only the callback based control message is missing, but that's blocked upstream.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/309.