Bug 772872 – audioconvert: passthrough conversion causing high CPU load

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 772872 - audioconvert: passthrough conversion causing high CPU load


Summary:	audioconvert: passthrough conversion causing high CPU load


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	1.8.3
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-10-13 15:39 UTC by Petr Kulhavy
Modified:	2018-11-03 11:50 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Petr Kulhavy 2016-10-13 15:39:11 UTC

While measuring the audioconvert plugin performance I found that converting S24LE to S24LE format (i.e. no change of the format at all) causes superfluous quantization. This results in extra CPU load.

The reason for this peculiar behaviour is that the first and mandatory step in an audio conversion is to unpack the data to either 16 or 32 bits. So 24-bit data is internally expanded to 32-bits. Then in the following if statement the chain_quantize() function logic decides that a quantization is needed:

  /* we still want to run the quantization step when reducing bits to get
   * the rounding correct */
  if (out_int && out_depth < 32
      && convert->current_format == GST_AUDIO_FORMAT_S32)


Obviously, if 24-bits are expanded to 32-bits and then again compacted to 24-bits, no quantization is needed.

Not sure how exactly the if statement should be altered, but shouldn't it also take the in_depth into account (like: ... && in_depth != out_depth)?


My test pipeline:

gst-launch-1.0 audiotestsrc samplesperbuffer=48 ! "audio/x-raw,format=(string)S24LE,rate=(int)48000,channels=(int)2" ! audioconvert ! "audio/x-raw,format=(string)S24LE,rate=(int)48000,channels=(int)2" ! alsasink can-activate-pull=true provide-clock=false buffer-time=8000 slave-method=none blocksize=192

Comment 1 Nicolas Dufresne (ndufresne) 2016-10-13 15:50:11 UTC

So you are saying that the data is still passed through the converter even though the format is the same ? I would have expected passthrough to be enabled, the rest is not relevant, since it should not have happened.

Comment 2 Sebastian Dröge (slomo) 2016-10-13 15:51:53 UTC

I think this is a typo: he probably means S24LE to S24BE? For the endianness-only conversions we can definitely do better and not go through the quantization parts of the algorithm. It just has to be implemented.

Comment 3 Tim-Philipp Müller 2016-10-13 15:58:50 UTC

Mmh, I'm sure I've seen a patch that fixes this somewhere! :) (Not useful, I know)

Comment 4 Petr Kulhavy 2016-10-13 16:06:55 UTC

(In reply to Nicolas Dufresne (stormer) from comment #1)
> So you are saying that the data is still passed through the converter even
> though the format is the same ? I would have expected passthrough to be
> enabled, the rest is not relevant, since it should not have happened.

Yes that's what I mean and it's not a typo :-)
One would expect no conversion and no extra CPU at all, 
but the data is passed through conversion even if in-format==out-format.

Here is a bit more from my investigation, all using the same test pipeline described above, just changing the input and output format string:

* S16LE -> S16LE: 21% CPU
* S24LE -> S24LE: 25% CPU
* S32LE -> S32LE: 21% CPU
* no conversion (i.e. remove the audioconvert and the second format description): 19% CPU

And for comparison BE->LE conversion: in all three cases 27% CPU.
That is IMHO also too much and I will come to that once I understand more what's going on there ...

Comment 5 Sebastian Dröge (slomo) 2016-10-13 16:13:16 UTC

So for "same format" we should never ever do any conversion.

For endianness-change-only, we should have a fast-path that does not first blow up the samples to 32/64 bits and then does quantization again just to swap the bytes.

Comment 6 Petr Kulhavy 2016-10-13 16:20:11 UTC

(In reply to Sebastian Dröge (slomo) from comment #5)
> So for "same format" we should never ever do any conversion.
> 
> For endianness-change-only, we should have a fast-path that does not first
> blow up the samples to 32/64 bits and then does quantization again just to
> swap the bytes.

Exactly, that's what I would expect. Plus the endian conversion should use the __builtin_bswap() functions or similar to take the advantage of the hardware (not sure if the there are instructions for 24-bit conversion though).

I will investigate more and propose a patch.

Comment 7 Sebastian Dröge (slomo) 2016-10-13 16:28:58 UTC

(In reply to Petr Kulhavy from comment #6)
> (In reply to Sebastian Dröge (slomo) from comment #5)
> > So for "same format" we should never ever do any conversion.
> > 
> > For endianness-change-only, we should have a fast-path that does not first
> > blow up the samples to 32/64 bits and then does quantization again just to
> > swap the bytes.
> 
> Exactly, that's what I would expect. Plus the endian conversion should use
> the __builtin_bswap() functions or similar to take the advantage of the
> hardware (not sure if the there are instructions for 24-bit conversion
> though).

We use orc for such things

Comment 8 Nicolas Dufresne (ndufresne) 2016-10-13 20:08:43 UTC

Ok, I think I found why this example does not go in passthrough, it's missing channel-mask field. Adding channel-mask=0x03 make it go in passthrough. Now, 0x03 should be the default mask anyway no ?

Comment 9 Petr Kulhavy 2016-10-13 21:00:09 UTC

(In reply to Nicolas Dufresne (stormer) from comment #8)
> Ok, I think I found why this example does not go in passthrough, it's
> missing channel-mask field. Adding channel-mask=0x03 make it go in
> passthrough. Now, 0x03 should be the default mask anyway no ?

I do see the message 
"audio-converter audio-converter.c:881:gst_audio_converter_new: same formats and passthrough mixing -> passthrough"

even without adding the channel-mask=0x03. However the CPU usage is still higher for S24LE than for S32LE, which I don't really understand why...

Comment 10 Petr Kulhavy 2016-10-13 22:57:09 UTC

(In reply to Petr Kulhavy from comment #9)
> (In reply to Nicolas Dufresne (stormer) from comment #8)
> > Ok, I think I found why this example does not go in passthrough, it's
> > missing channel-mask field. Adding channel-mask=0x03 make it go in
> > passthrough. Now, 0x03 should be the default mask anyway no ?
> 
> I do see the message 
> "audio-converter audio-converter.c:881:gst_audio_converter_new: same formats
> and passthrough mixing -> passthrough"
> 
> even without adding the channel-mask=0x03. However the CPU usage is still
> higher for S24LE than for S32LE, which I don't really understand why...

In the call map I do see an extra function for S24LE in audiotestsrc, which is not present in S32LE or S16LE. This could be the pack_func because the sine generator is implemented only for 16 and 32 bits, not for 24 bits. That would explain the extra CPU load.

Hmmm, does that mean, that I'm chasing a red herring here? ...

Summary:
--------
1) It seems the pass-through is correctly enabled in all cases as Nicolas said in comment 1. I was just trusting the debug messages too much - sorry for causing the confusion. 

2) The different CPU profile @ 24-bits could be explained by extra conversion 32->24bits in the SINE GENERATOR (and not in the conversion). So my measurement was disturbed - sorry for that too.

So before this gets closed as invalid, my question though is: do you see as justifiable the +2% CPU usage for doing virtually nothing? I mean there might be some overhead caused by the extra element in the pipeline, but 2% of 500MHz Cortex A7 for doing nothing is a lot.

The +8% CPU usage for BE->LE conversion would be the next thing to look at...

Comment 11 Nicolas Dufresne (ndufresne) 2016-10-14 13:52:29 UTC

S24 is always an inefficient format. That's because you need to read per byte since data is not aligned. Other then that, audiotestsrc is not really fast generator, it's just a test tool. Though, feel free to propose a patch, and rename the title accordingly (e.g. by mentioning this is about audiotestsrc instead of audioconvert).

p.s. You'll notice that sine waves have repetitive pattern, it's kind of silly not to cache the generated pattern ;-P

Comment 12 Jan Schmidt 2016-10-17 05:17:19 UTC

Sort of silly, except that the samplerate and generated frequency can be relatively prime, or have few factors in common - so to output the signal with the same accuracy you might need to pre-cache quite a few samples. 440Hz @ 48khz repeats every 11 seconds.

Comment 13 Sebastian Dröge (slomo) 2016-10-17 09:19:48 UTC

That's why we have the sine-table setting in audiotestsrc.

Comment 14 Sebastian Dröge (slomo) 2016-10-17 09:25:50 UTC

Is there anything left to be done here? Should this be closed and the actual problems be handled in specific bugs?

AFAIU there is at least the problem here that same-caps-conversion is not passthrough if one side is stereo with channel-mask, and the other side has no channel-mask. This should be fixed and seems to be the essence of this bug report, correct?


Optimizations for endianess conversion and other things should be discussed in other bugs.

Comment 15 Petr Kulhavy 2016-10-17 21:00:52 UTC

(In reply to Sebastian Dröge (slomo) from comment #14)
> Is there anything left to be done here? Should this be closed and the actual
> problems be handled in specific bugs?

The last thing for me that is still not justified is why pass-through conversion requires 2% CPU and why it needs to do memcpy. Ideally it would not add any CPU usage and simply pass the buffer down the pipe as is.

Comment 16 Sebastian Dröge (slomo) 2016-10-18 04:40:23 UTC

It shouldn't copy, correct. Is this with exactly the same caps on both sides, or the case with the stereo caps where one side is missing the channel-mask?

Comment 17 Petr Kulhavy 2016-10-18 19:35:37 UTC

With exactly the same caps on both sides audio-converter.c:converter_passthrough() is not called at all, as oppose to what I originally thought. So this is OK.

But there is still a 2% CPU difference between:

gst-launch-1.0 audiotestsrc is-live=true samplesperbuffer=48 ! "audio/x-raw,format=(string)S32LE,rate=(int)48000,channels=(int)2,channel-mask=(int)0x3" ! audioconvert ! "audio/x-raw,format=(string)S32LE,rate=(int)48000,channels=(int)2,channel-mask=(int) 0x3" ! fakesink

and 

gst-launch-1.0 audiotestsrc is-live=true samplesperbuffer=48 ! "audio/x-raw,format=(string)S32LE,rate=(int)48000,channels=(int)2,channel-mask=(int)0x3" ! fakesink

which I don't see a particular reason for.

After some deeper profiling I'm starting to believe that this is due to heavy cache misses and lot of processing in gst_audio_convert_submit_input_buffer():

   DLmr       Ir
-- line 790 ----------------------------------------
     .         .      return TRUE;
     .         .  
     .         .    return FALSE;
     .         .  }
     .         .  
     .         .  static GstFlowReturn
     .         .  gst_audio_convert_submit_input_buffer (GstBaseTransform * base,
     .         .      gboolean is_discont, GstBuffer * input)
     .   207,660  {
     .         .    GstAudioConvert *this = GST_AUDIO_CONVERT (base);
     .         .  
     2   155,745    if (base->segment.format == GST_FORMAT_TIME) {
51,713   363,405      input =
182,156 29,903,118  => ???:gst_audio_buffer_clip (51915x)
     .         .          gst_audio_buffer_clip (input, &base->segment, this->in_info.rate,
     .         .          this->in_info.bpf);
     .         .  
     .   103,830      if (!input)
     .         .        return GST_FLOW_OK;
     .         .    }
     .         .  
77,937   467,235    return GST_BASE_TRANSFORM_CLASS (parent_class)->submit_input_buffer (base,
     8 14,328,618  => ???:0x0003f070'2 (51915x)
     .         .        is_discont, input);
     .         .  }



and in gst_capsfilter_prepare_buf():

   DLmr       Ir
-- line 391 ----------------------------------------
     .         .   * This ensures that caps event is sent if we can, so that pipelines like:
     .         .   *   gst-launch filesrc location=rawsamples.raw !
     .         .   *       audio/x-raw,format=S16LE,rate=48000,channels=2 ! alsasink
     .         .   * will work.
     .         .   */
     .         .  static GstFlowReturn
     .         .  gst_capsfilter_prepare_buf (GstBaseTransform * trans, GstBuffer * input,
     .         .      GstBuffer ** buf)
21,037   519,150  {
     .         .    GstFlowReturn ret = GST_FLOW_OK;
     .         .    GstCapsFilter *filter = GST_CAPS_FILTER (trans);
     .         .  
     .         .    /* always return the input as output buffer */
     .   103,830    *buf = input;
     .         .  
    18   415,320    if (GST_PAD_MODE (trans->srcpad) == GST_PAD_MODE_PUSH
     4 1,661,280        && !gst_pad_has_current_caps (trans->sinkpad)) {
357,815 29,280,131  => ???:0x048c2088 (103830x)

Comment 18 Sebastian Dröge (slomo) 2016-10-20 11:28:46 UTC

So there's this problem, and there is also still the problem with "channels=2" vs "channels=2,channel-mask=0x3" not doing passthrough?

Comment 19 Petr Kulhavy 2016-10-20 12:29:39 UTC

It is only the problem in comment 17.

"channels=2" vs. "channels=2,channel-mask=0x3" both correctly do passthrough (tested now on v1.9.90)

Comment 20 GStreamer system administrator 2018-11-03 11:50:47 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/300.