GNOME Bugzilla – Bug 772872
audioconvert: passthrough conversion causing high CPU load
Last modified: 2018-11-03 11:50:47 UTC
While measuring the audioconvert plugin performance I found that converting S24LE to S24LE format (i.e. no change of the format at all) causes superfluous quantization. This results in extra CPU load. The reason for this peculiar behaviour is that the first and mandatory step in an audio conversion is to unpack the data to either 16 or 32 bits. So 24-bit data is internally expanded to 32-bits. Then in the following if statement the chain_quantize() function logic decides that a quantization is needed: /* we still want to run the quantization step when reducing bits to get * the rounding correct */ if (out_int && out_depth < 32 && convert->current_format == GST_AUDIO_FORMAT_S32) Obviously, if 24-bits are expanded to 32-bits and then again compacted to 24-bits, no quantization is needed. Not sure how exactly the if statement should be altered, but shouldn't it also take the in_depth into account (like: ... && in_depth != out_depth)? My test pipeline: gst-launch-1.0 audiotestsrc samplesperbuffer=48 ! "audio/x-raw,format=(string)S24LE,rate=(int)48000,channels=(int)2" ! audioconvert ! "audio/x-raw,format=(string)S24LE,rate=(int)48000,channels=(int)2" ! alsasink can-activate-pull=true provide-clock=false buffer-time=8000 slave-method=none blocksize=192
So you are saying that the data is still passed through the converter even though the format is the same ? I would have expected passthrough to be enabled, the rest is not relevant, since it should not have happened.
I think this is a typo: he probably means S24LE to S24BE? For the endianness-only conversions we can definitely do better and not go through the quantization parts of the algorithm. It just has to be implemented.
Mmh, I'm sure I've seen a patch that fixes this somewhere! :) (Not useful, I know)
(In reply to Nicolas Dufresne (stormer) from comment #1) > So you are saying that the data is still passed through the converter even > though the format is the same ? I would have expected passthrough to be > enabled, the rest is not relevant, since it should not have happened. Yes that's what I mean and it's not a typo :-) One would expect no conversion and no extra CPU at all, but the data is passed through conversion even if in-format==out-format. Here is a bit more from my investigation, all using the same test pipeline described above, just changing the input and output format string: * S16LE -> S16LE: 21% CPU * S24LE -> S24LE: 25% CPU * S32LE -> S32LE: 21% CPU * no conversion (i.e. remove the audioconvert and the second format description): 19% CPU And for comparison BE->LE conversion: in all three cases 27% CPU. That is IMHO also too much and I will come to that once I understand more what's going on there ...
So for "same format" we should never ever do any conversion. For endianness-change-only, we should have a fast-path that does not first blow up the samples to 32/64 bits and then does quantization again just to swap the bytes.
(In reply to Sebastian Dröge (slomo) from comment #5) > So for "same format" we should never ever do any conversion. > > For endianness-change-only, we should have a fast-path that does not first > blow up the samples to 32/64 bits and then does quantization again just to > swap the bytes. Exactly, that's what I would expect. Plus the endian conversion should use the __builtin_bswap() functions or similar to take the advantage of the hardware (not sure if the there are instructions for 24-bit conversion though). I will investigate more and propose a patch.
(In reply to Petr Kulhavy from comment #6) > (In reply to Sebastian Dröge (slomo) from comment #5) > > So for "same format" we should never ever do any conversion. > > > > For endianness-change-only, we should have a fast-path that does not first > > blow up the samples to 32/64 bits and then does quantization again just to > > swap the bytes. > > Exactly, that's what I would expect. Plus the endian conversion should use > the __builtin_bswap() functions or similar to take the advantage of the > hardware (not sure if the there are instructions for 24-bit conversion > though). We use orc for such things
Ok, I think I found why this example does not go in passthrough, it's missing channel-mask field. Adding channel-mask=0x03 make it go in passthrough. Now, 0x03 should be the default mask anyway no ?
(In reply to Nicolas Dufresne (stormer) from comment #8) > Ok, I think I found why this example does not go in passthrough, it's > missing channel-mask field. Adding channel-mask=0x03 make it go in > passthrough. Now, 0x03 should be the default mask anyway no ? I do see the message "audio-converter audio-converter.c:881:gst_audio_converter_new: same formats and passthrough mixing -> passthrough" even without adding the channel-mask=0x03. However the CPU usage is still higher for S24LE than for S32LE, which I don't really understand why...
(In reply to Petr Kulhavy from comment #9) > (In reply to Nicolas Dufresne (stormer) from comment #8) > > Ok, I think I found why this example does not go in passthrough, it's > > missing channel-mask field. Adding channel-mask=0x03 make it go in > > passthrough. Now, 0x03 should be the default mask anyway no ? > > I do see the message > "audio-converter audio-converter.c:881:gst_audio_converter_new: same formats > and passthrough mixing -> passthrough" > > even without adding the channel-mask=0x03. However the CPU usage is still > higher for S24LE than for S32LE, which I don't really understand why... In the call map I do see an extra function for S24LE in audiotestsrc, which is not present in S32LE or S16LE. This could be the pack_func because the sine generator is implemented only for 16 and 32 bits, not for 24 bits. That would explain the extra CPU load. Hmmm, does that mean, that I'm chasing a red herring here? ... Summary: -------- 1) It seems the pass-through is correctly enabled in all cases as Nicolas said in comment 1. I was just trusting the debug messages too much - sorry for causing the confusion. 2) The different CPU profile @ 24-bits could be explained by extra conversion 32->24bits in the SINE GENERATOR (and not in the conversion). So my measurement was disturbed - sorry for that too. So before this gets closed as invalid, my question though is: do you see as justifiable the +2% CPU usage for doing virtually nothing? I mean there might be some overhead caused by the extra element in the pipeline, but 2% of 500MHz Cortex A7 for doing nothing is a lot. The +8% CPU usage for BE->LE conversion would be the next thing to look at...
S24 is always an inefficient format. That's because you need to read per byte since data is not aligned. Other then that, audiotestsrc is not really fast generator, it's just a test tool. Though, feel free to propose a patch, and rename the title accordingly (e.g. by mentioning this is about audiotestsrc instead of audioconvert). p.s. You'll notice that sine waves have repetitive pattern, it's kind of silly not to cache the generated pattern ;-P
Sort of silly, except that the samplerate and generated frequency can be relatively prime, or have few factors in common - so to output the signal with the same accuracy you might need to pre-cache quite a few samples. 440Hz @ 48khz repeats every 11 seconds.
That's why we have the sine-table setting in audiotestsrc.
Is there anything left to be done here? Should this be closed and the actual problems be handled in specific bugs? AFAIU there is at least the problem here that same-caps-conversion is not passthrough if one side is stereo with channel-mask, and the other side has no channel-mask. This should be fixed and seems to be the essence of this bug report, correct? Optimizations for endianess conversion and other things should be discussed in other bugs.
(In reply to Sebastian Dröge (slomo) from comment #14) > Is there anything left to be done here? Should this be closed and the actual > problems be handled in specific bugs? The last thing for me that is still not justified is why pass-through conversion requires 2% CPU and why it needs to do memcpy. Ideally it would not add any CPU usage and simply pass the buffer down the pipe as is.
It shouldn't copy, correct. Is this with exactly the same caps on both sides, or the case with the stereo caps where one side is missing the channel-mask?
With exactly the same caps on both sides audio-converter.c:converter_passthrough() is not called at all, as oppose to what I originally thought. So this is OK. But there is still a 2% CPU difference between: gst-launch-1.0 audiotestsrc is-live=true samplesperbuffer=48 ! "audio/x-raw,format=(string)S32LE,rate=(int)48000,channels=(int)2,channel-mask=(int)0x3" ! audioconvert ! "audio/x-raw,format=(string)S32LE,rate=(int)48000,channels=(int)2,channel-mask=(int) 0x3" ! fakesink and gst-launch-1.0 audiotestsrc is-live=true samplesperbuffer=48 ! "audio/x-raw,format=(string)S32LE,rate=(int)48000,channels=(int)2,channel-mask=(int)0x3" ! fakesink which I don't see a particular reason for. After some deeper profiling I'm starting to believe that this is due to heavy cache misses and lot of processing in gst_audio_convert_submit_input_buffer(): DLmr Ir -- line 790 ---------------------------------------- . . return TRUE; . . . . return FALSE; . . } . . . . static GstFlowReturn . . gst_audio_convert_submit_input_buffer (GstBaseTransform * base, . . gboolean is_discont, GstBuffer * input) . 207,660 { . . GstAudioConvert *this = GST_AUDIO_CONVERT (base); . . 2 155,745 if (base->segment.format == GST_FORMAT_TIME) { 51,713 363,405 input = 182,156 29,903,118 => ???:gst_audio_buffer_clip (51915x) . . gst_audio_buffer_clip (input, &base->segment, this->in_info.rate, . . this->in_info.bpf); . . . 103,830 if (!input) . . return GST_FLOW_OK; . . } . . 77,937 467,235 return GST_BASE_TRANSFORM_CLASS (parent_class)->submit_input_buffer (base, 8 14,328,618 => ???:0x0003f070'2 (51915x) . . is_discont, input); . . } and in gst_capsfilter_prepare_buf(): DLmr Ir -- line 391 ---------------------------------------- . . * This ensures that caps event is sent if we can, so that pipelines like: . . * gst-launch filesrc location=rawsamples.raw ! . . * audio/x-raw,format=S16LE,rate=48000,channels=2 ! alsasink . . * will work. . . */ . . static GstFlowReturn . . gst_capsfilter_prepare_buf (GstBaseTransform * trans, GstBuffer * input, . . GstBuffer ** buf) 21,037 519,150 { . . GstFlowReturn ret = GST_FLOW_OK; . . GstCapsFilter *filter = GST_CAPS_FILTER (trans); . . . . /* always return the input as output buffer */ . 103,830 *buf = input; . . 18 415,320 if (GST_PAD_MODE (trans->srcpad) == GST_PAD_MODE_PUSH 4 1,661,280 && !gst_pad_has_current_caps (trans->sinkpad)) { 357,815 29,280,131 => ???:0x048c2088 (103830x)
So there's this problem, and there is also still the problem with "channels=2" vs "channels=2,channel-mask=0x3" not doing passthrough?
It is only the problem in comment 17. "channels=2" vs. "channels=2,channel-mask=0x3" both correctly do passthrough (tested now on v1.9.90)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/300.