GNOME Bugzilla – Bug 747270
avfvideosrc: BUFFER_QUEUE_SIZE too small
Last modified: 2018-11-03 13:33:12 UTC
In avfvideosrc.m, we have: #define BUFFER_QUEUE_SIZE 2 This buffer queue is used in a producer-consumer fashion between the AV capture callbacks (on AVFoundation's thread) and the element (on the main thread). With such a small queue, we were constantly: a) losing performance due to the main thread going to sleep constantly b) periodically dropping frames in this code: if ([bufQueue count] == BUFFER_QUEUE_SIZE) [bufQueue removeLastObject]; Increasing the queue size to e.g. 32 improves performance significantly. On iPhone 5s, improvement of ~7% is seen in the following pipeline: avfvideosrc device-index=0 ! video/x-raw,width=1280,height=720,framerate=30/1 ! vtenc_h264 bitrate=400 ! fakesink
How did you came out with 32? also, are you attaching a patch?
V4L2 uses 4 in the case we don't know. I don't know if this element is pushing buffers downstream, but if so, it should be negotiated.
(In reply to Reynaldo H. Verdejo Pinochet from comment #1) > How did you came out with 32? I said "e.g. 32", meaning it's just an arbitrary number I chose that's not too small. :) I think it should be the smallest one that works. > also, are you attaching a patch? Not yet, I'm still looking for feedback. > I don't know if this element is pushing buffers downstream, but if so, it should be negotiated. avfvideosrc is obviously pushing buffers downstream -- it's a video capture source (e.g. from the iPhone camera). What do you mean by negotiating? Isn't it a private implementation detail of this element? The existence of the AV capture thread is an AVFoundation implementation detail.
(In reply to Ilya Konstantinov from comment #3) > avfvideosrc is obviously pushing buffers downstream -- it's a video capture > source (e.g. from the iPhone camera). > > What do you mean by negotiating? Isn't it a private implementation detail of > this element? The existence of the AV capture thread is an AVFoundation > implementation detail. I'm asking since many stack have limitation and simply copy the data and push. When you push your own data downstream and have a fixed amount of buffer, it is important to provide at least the minimum amount of buffers for the pipeline to work. This value can be obtained through the allocation query. For sources, this is done by implement the virtual method decide_allocation(). This method isn't trivial to implement though. You can refer to v4l2src for inspiration. Ideally you should wrap this up in a BufferPool, (and maybe Allocator) if you want to properly track the memory.
(In reply to Reynaldo H. Verdejo Pinochet from comment #1) > How did you came out with 32? also, are you attaching a patch? I'd like to know how you guys came up with 2.
(In reply to Cecill Etheredge (ijsf) from comment #5) > (In reply to Reynaldo H. Verdejo Pinochet from comment #1) > > How did you came out with 32? also, are you attaching a patch? > > I'd like to know how you guys came up with 2. Two make sense if you don't share the data. Basically 1 for the driver to fill, and 1 for the user space to copy to some other memory location. And then it loops likes this. But since the code now push the buffers downstream this is just wrong. Looking at the code, it's a bit bad. It wrap the memory into a buffer, but used a GstMeta to actually track the memory lifetime. This is not completly safe. It also sets the NO_SHARE flags, which will end-up causing copies all over. So yes, there is a lot of work to do.
Thanks, that makes sense for a buffered system. On a different note.. why use a buffer at all and not push directly to wherever it needs to go?
Right now I'm working on a patch to eliminate the buffer queue completely. Queueing is GstQueue's job. And it's perfectly fine for the avfvideosrc to push buffers from the dispatch thread. As for GstMeta and all that business, please see my work on bug 747216. Though as far as safety goes, I don't see a problem -- the CVPixelBuffer's lifetime is exactly like the GstBuffer's lifetime. NO_SHARE has been indeed causing perf trouble for OpenWebRTC w/intervideo* components. Though my current patch for 747216 is "no-share" too, I hope to introduce sharing.
(In reply to Cecill Etheredge (ijsf) from comment #7) > Thanks, that makes sense for a buffered system. > > On a different note.. why use a buffer at all and not push directly to > wherever it needs to go? Queues are useful to ensure a parallelism between the driver processing and the GStreamer pipeline. So while a buffer is being pushed downstream, the driver should be preparing the next buffer. (assuming you mean why use queues). AVF Has quite some code, and I suppose it's not entirely needed. It is likely done this way to fit the push src model (which is the easiest way to write a source in GST).
(In reply to Nicolas Dufresne (stormer) from comment #9) > AVF Has quite some code, and I suppose it's not entirely needed. That's the main point here, I believe. AVF does things like pooling sample buffers and has a lot of concern for this already. Also, Darwin's dispatch queues is a producer-consumer queue already. > is likely done this way to fit the push src model (which is the easiest way > to write a source in GST). We might replace NSMutableArray+NSConditionLock with GAsyncQueue, but to assure best performance, I think we should learn to push buffers from the AVF dispatch. I remember either you or Tim said that it's basically re-doing GBaseSrc, but perhaps it's not as complicated, seeing that a lot of what GBaseSrc does deals with non-live streams.
I don't get the point, GBaseSrc does not force having queues btw, as we said, create can wait for an event. The context switch has nothing to do with the performance cost I have sure. And to be be clear, I will not sponsor work that move this object away from a GstBaseSrc. It's already barely maintained, that would only make the situation worst. One of the think I notice is that we force AVFoundation into doing conversion to BGRA all the time. Even if a texture is being used downstream. That's probably where we get the biggest hit. Implement AVF specific GstMemory, make sure we don't request memory pointers unless map() is called, and you will probably already have a enough performance for "truly embedded" iOS apps.
(oh, and obviously, negotiation the allocation and use the obtained number when setting up the queue in stead of hard-coding, so we don't starve AVF internal queues)
(In reply to Nicolas Dufresne (stormer) from comment #11) > I don't get the point, GBaseSrc does not force having queues btw, as we > said, create can wait for an event. The "point" I refer to was – if a queue facility is needed between the driver and userspace, AVF is providing it. Our own queueing facility would be redundant, as far as driver-userspace communication goes. > The context switch has nothing to do with the performance cost I have sure. I have a rudimentary avfvideosrc which pushes from the dispatch queue. It (kinda, sorta) works, enough to get performance metrics. If we can have e.g. GstAsyncQueue-based avfvideosrc having the same performance, I'll gladly go off my proposal. I like my patches small. (If I could lower CPU usage by 2% through a solution that doesn't exactly fit with GstBaseSrc, I'd do it, even as a fork.) > And to be be clear, I will not sponsor work that move this object away from > a GstBaseSrc. It's already barely maintained, that would only make the > situation worst. I understand that, but do understand too that if performance would be worse than a pure AVF solution, it'll continue to be underused and under-maintained. > ... we force AVFoundation into doing conversion to BGRA all the time. Are you sure? I see avfvideosrc supports NV12, UYVY, YUY and BGRA. (In any case, my test is afvvideosrc ! video/x-raw,format=BGRA,width=1280,height=720,framerate=30/1 ! fakesink, and I didn't notice any difference when I changed format to NV12.) > Implement AVF specific GstMemory, make sure we don't request memory > pointers unless map() is called In an "avfvideosrc ! vtenc_h264" pipeline, the current code *already* passes CVPixelBuffer directly. About GstMemory and mapping, that's what I'm doing in bug 747216. Surprisingly, it didn't bring much performance improvement.
Created attachment 300916 [details] [review] avfvideosrc: Change NSMutableArray to GQueue EXPERIMENT 1 Use a proven GLib implementation, implementation taken almost verbatim from GstDecklinkVideoSrc (Sebastian Dröge's code). CPU usage was 19% and remained 19%.
Clarifying: Increasing BUFFER_QUEUE_SIZE from 2 (to whatever) reduces CPU usage from 32% to 26% only on this pipeline: avfvideosrc device-index=0 ! video/x-raw,width=1280,height=720,framerate=30/1 ! vtenc_h264 bitrate=400 ! fakesink On a simpler: avfvideosrc device-index=0 ! video/x-raw,width=1280,height=720,framerate=30/1 ! fakesink the CPU usage was 19% and remains 19% after BUFFER_QUEUE_SIZE increase (probably since the frame drop scenario never occurs).
Created attachment 300920 [details] [review] avfvideosrc.m: Get rid of the buffer queue EXPERIMENT 2 Keep a single copy of the sample buffer at a time -- no queues, no mallocs. (If downstream cannot keep up, GstQueues can be used.) BENCHMARK: On iPhone 5s (iOS 8.0), the following pipeline: avfvideosrc device-index=0 ! video/x-raw,format=BGRA,width=1280,height=720,framerate=30/1 ! fakesink enable-last-sample=0 is down from 19% to 15%.
Created attachment 300921 [details] [review] avfvideosrc: Get rid of the buffer queue My previous patch was bogus. This patch actually works. BENCHMARK: Seeing more modest improvement from 19% to 18% on: avfvideosrc device-index=0 ! video/x-raw,format=BGRA,width=1280,height=720,framerate=30/1 ! fakesink enable-last-sample=0
A more realistic workload: avfvideosrc device-index=0 ! video/x-raw,width=1280,height=720,framerate=30/1 ! queue ! vtenc_h264 bitrate=400 ! fakesink enable-last-sample=0 performs at ~26% (while pre-patch was 32%). (Without the "queue !" we're getting constant frame drops in the AV capture thread, as anticipated.)
Created attachment 300932 [details] [review] avfvideosrc: Get rid of the buffer queue CHANGES * Really let AVFoundation do its job Instead of dropping frames in the delegate, now I simply hog the delegate until the main thread finishes its job. AVFoundation is built to handle this by queueing and dropping. * Protect against spurious wakeups * currentFrame -> delegateFrame * Consolidate variables into the CaptureFrame
Created attachment 300933 [details] [review] applemedia: gst_core_media_buffer_init Change gst_core_media_buffer_new into gst_core_media_buffer_init, to allow e.g. acquiring the buffer from a buffer pool. (Prelude for next patch.)
Created attachment 300934 [details] [review] avfvideosrc: Use buffer pool We configure a buffer pool to create empty GstBuffers without any backing memory, since we'll be filling them ourselves. We discard buffer pools from upstream.
Stop the presses! "avfvideosrc: Get rid of the buffer queue" is definitely not working as expected. I know it's crazy but I actually didn't test with a simple glimagesink, and it really sucks.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/issues/230.