GNOME Bugzilla – Bug 773073
audioconvert: endian conversion optimization
Last modified: 2016-11-29 22:14:59 UTC
Endian conversion in the audioconvert plugin is highly inefficient. On Cortex A7 @ 500MHz it takes additional 7-8% CPU to convert S24LE to S24BE, similar results can be observed for 16 and 32 bits. This is unacceptably high. The current implementation performs 24->32 unpack, 24-bit quantization and 32->24 bit pack. It requires to allocate additional buffers as it cannot run in place: 0:00:00.264383167 406 0x304320 LOG audio-converter audio-converter.c:322:gst_audio_converter_update_config: new rate 0 -> 0 0:00:00.265950834 406 0x304320 INFO audio-converter audio-converter.c:863:gst_audio_converter_new: unitsizes: 6 -> 6 0:00:00.266781125 406 0x304320 INFO audio-converter audio-converter.c:566:chain_unpack: unpack format S24LE to S32LE 0:00:00.267911417 406 0x304320 INFO audio-converter audio-converter.c:623:chain_mix: mix format S32LE, passthrough 1, in_channels 2, out_channels 2 0:00:00.268835292 406 0x304320 INFO audio-converter audio-converter.c:674:chain_quantize: depth in 32, out 24 0:00:00.269497709 406 0x304320 INFO audio-converter audio-converter.c:686:chain_quantize: using no dither and noise shaping 0:00:00.270300750 406 0x304320 INFO audio-converter audio-converter.c:698:chain_quantize: quantize to 24 bits, dither 0, ns 0 0:00:00.271145542 406 0x304320 INFO audio-converter audio-converter.c:721:chain_pack: pack format S32LE to S24BE 0:00:00.271963250 406 0x304320 INFO audio-converter audio-converter.c:884:gst_audio_converter_new: do full conversion 0:00:00.272734250 406 0x304320 LOG audio-converter audio-converter.c:747:setup_allocators: chain 0x3016c8: 1 1 0:00:00.273480709 406 0x304320 LOG audio-converter audio-converter.c:747:setup_allocators: chain 0x301688: 1 0 0:00:00.277083625 406 0x304320 INFO audioconvert gstaudioconvert.c:224:gst_audio_convert_get_unit_size:<audioconvert0> unit_size = 6 0:00:00.278164584 406 0x304320 INFO audioconvert gstaudioconvert.c:224:gst_audio_convert_get_unit_size:<audioconvert0> unit_size = 6 Similar behaviour can be observed with 16 and 32 bits. On the other hand with SIMD instructions the conversion can be done very efficiently. E.g. only 3 NEON instructions are needed to convert eight 24-bit values in place: VLD3.8 VSWP d0,d2 VST3.8 or to convert sixteen 16-bit values in place: VLD2.16 VSWP d0,d1 VST2.16 The attempt here is to discuss the best way how to integrate these instructions into the audioconvert module, still keeping the portability. Also without SIMD instructions the code needs to be optimized (e.g. to perform the conversion in place). Eventually the discussion should result into a patch.
There should be endianness conversion fast-paths, implemented in C. A useful compiler should already optimize them quite well, possibly even using SIMD instructions. Once that is there, we can consider adding ORC optimizations for those if it actually makes a performance difference over the C code with an optimizing compiler. Adding actual platform specific assembly should be only the very last resort, if nothing else works and it brings a big performance improvement.
(In reply to Sebastian Dröge (slomo) from comment #1) > There should be endianness conversion fast-paths, implemented in C. A useful > compiler should already optimize them quite well, possibly even using SIMD > instructions. Are there such functions already available?
Check the audioconvert code, there are functions for more or less that but that involves packing from S32 too. Basically you write a simple loop for e.g. 16 bit samples: > for (i = 0; i < nsamples; i++) > dst[i] = GUINT16_SWAP_LE_BE(src[i]) and that's it. Same thing for 24, 32 and 64 bits. Any compiler should be able to optimize that, even with SIMD instructions if it supports auto-vectorization.
Created attachment 338099 [details] [review] Patch for endian conversion fast-path - version 1 Here is the first attempt, any feedback is appreciated. There are still two issues I'm trying to resolve: 1) 24-bit does not compile in very efficient code. With -O3 it is 25% slower than 32-bit and 45% slower than 16-bit. With -O2 it is 60% slower than both 16-bit and 32-bit. So I'm trying to find an algorithm that gcc can optimize better. 2) The base transform class still allocates buffers for audioconvert, which is no longer needed. How can I tell it the conversion can be done in place and no output buffer is needed?
(In reply to Petr Kulhavy from comment #4) > Created attachment 338099 [details] [review] [review] > Patch for endian conversion fast-path - version 1 > > Here is the first attempt, any feedback is appreciated. > > There are still two issues I'm trying to resolve: > 1) 24-bit does not compile in very efficient code. With -O3 it is 25% slower > than 32-bit and 45% slower than 16-bit. With -O2 it is 60% slower than both > 16-bit and 32-bit. So I'm trying to find an algorithm that gcc can optimize > better. You might be able to get something more optimal if you unroll the loop. 4 * 24 bits are 3 * 32 bits, so the pattern of operations repeats every 4 samples if you handle them as if they were 32 bit integers. Worth experimenting with that. > 2) The base transform class still allocates buffers for audioconvert, which > is no longer needed. How can I tell it the conversion can be done in place > and no output buffer is needed? You need to enable in place transformation for that: gst_base_transform_set_in_place().
Review of attachment 338099 [details] [review]: ::: gst-libs/gst/audio/audio-converter.c @@ +879,3 @@ + :"r" (in) + :"memory", "d0", "d1"); + } Don't add assembly code here, but use orc instead. See the existing orc code for the conversion. The swap* opcodes are what you want here @@ +1117,3 @@ + convert->convert = converter_endian; + + switch (in_info->bpf / in_info->channels) { in_info->bps :)
(In reply to Sebastian Dröge (slomo) from comment #5) > You might be able to get something more optimal if you unroll the loop. 4 * > 24 bits are 3 * 32 bits, so the pattern of operations repeats every 4 > samples if you handle them as if they were 32 bit integers. Worth > experimenting with that. Thanks, I was thinking of that as well. Have it already more or less on paper. Just the gcc assembly analysing takes time... ;-) > > 2) The base transform class still allocates buffers for audioconvert, which > > is no longer needed. How can I tell it the conversion can be done in place > > and no output buffer is needed? > > You need to enable in place transformation for that: > gst_base_transform_set_in_place(). Thanks for the hint!
Created attachment 338186 [details] [review] Simplification the chain free process While writing the endian conversion I've simplified the chain free process.
Review of attachment 338186 [details] [review]: I'm generally not very convinced that any changes here are making anything simpler :) ::: gst-libs/gst/audio/audio-converter.c @@ +210,1 @@ audio_chain_free (AudioChain * chain) That's not nice API IMHO. Freeing something should just free it, not return some "arbitrary" other thing It also does not seem to really simplify anything @@ +1165,3 @@ + /* walk the chain backwards and free all elements */ + for (chain = convert->pack_chain; chain; chain = audio_chain_free (chain)) { + } Instead you could also do chain = convert->pack_chain; while (chain) { next = chain->next; audio_chain_free (chain); chain = next; } If doing that, it would also be nice to not call it pack_chain as it's the whole chain actually.
Created attachment 338196 [details] [review] Simplification of the chain free process - version 2 Agree, that makes more sense. Here is the corrected version of the patch.
(In reply to Sebastian Dröge (slomo) from comment #5) > > 2) The base transform class still allocates buffers for audioconvert, which > > is no longer needed. How can I tell it the conversion can be done in place > > and no output buffer is needed? > > You need to enable in place transformation for that: > gst_base_transform_set_in_place(). This is a bit more complex than just that. GstAudioConvert implements only _transform and no _transform_ip. So first _transform_ip needs to be implemented. What is a good example of _transform_ip ? It would improve also signed-unsigned conversion.
The volume element maybe. And yes, signedness conversion is basically the same thing as endianness conversion in this regard.
(In reply to Sebastian Dröge (slomo) from comment #6) > Review of attachment 338099 [details] [review] [review]: > @@ +1117,3 @@ > + convert->convert = converter_endian; > + > + switch (in_info->bpf / in_info->channels) { > > in_info->bps :) There is no such thing as in_info->bps :(
(In reply to Petr Kulhavy from comment #13) > (In reply to Sebastian Dröge (slomo) from comment #6) > > Review of attachment 338099 [details] [review] [review] [review]: > > @@ +1117,3 @@ > > + convert->convert = converter_endian; > > + > > + switch (in_info->bpf / in_info->channels) { > > > > in_info->bps :) > > There is no such thing as in_info->bps :( GST_AUDIO_INFO_BPS(info) I meant, sorry
(In reply to Sebastian Dröge (slomo) from comment #5) > > There are still two issues I'm trying to resolve: > > 1) 24-bit does not compile in very efficient code. With -O3 it is 25% slower > > than 32-bit and 45% slower than 16-bit. With -O2 it is 60% slower than both > > 16-bit and 32-bit. So I'm trying to find an algorithm that gcc can optimize > > better. > > You might be able to get something more optimal if you unroll the loop. 4 * > 24 bits are 3 * 32 bits, so the pattern of operations repeats every 4 > samples if you handle them as if they were 32 bit integers. Worth > experimenting with that. The 3*32 bits algorithm for the 24-bit conversion is better at -O2, but worse at -O3. With -O2 it is 10% slower than the 32-bit conversion, which is OK. With -O3 it is 60% slower than the 32-bit conversion - that's bad :-( So I will stay with the naive algorithm.
That sounds like a compiler bug though
(In reply to Sebastian Dröge (slomo) from comment #16) > That sounds like a compiler bug though The compiler is not able to find out it's a 24-bit endian conversion. So it just optimizes the code. And it seems it is able to better optimize the simple byte swapping than the complex shifts, adds and ands.
Created attachment 338215 [details] [review] Patch for endian conversion fast-path - version 2 Here is a modified version of the endian conversion patch. It contains: * 64-bit support * assembly code removed * added "clever algorithm" for 24-bits, which processes 3x 32-bit in one cycle. Due to worse performance with -O3 I've put it into #if 0 section * using GST_AUDIO_INFO_BPS(info) I'm too unfamiliar with ORC to do the the ORC implementations. They need to write dedicated conversion functions for 24 bits, which I don't know how to write. The next thing, which I believe would actually bringing more performance gain, is to allow endian (and later also signed) conversion to run in place. I will do this in a separate patch.
Review of attachment 338215 [details] [review]: Generally looks good. Do you have any questions about the in-place implementation? For the ORC code, for 32 bit you would basically do: > .function audio_orc_swap_endianness_s32 > .dest 4 d1 gint32 > .source 4 s1 gint32 > > swapl d1, s1 and then could just call that function. The autogenerated C code should be equivalent to what you do, but it would end up with optimized assembly for SSE and NEON for example. Which hopefully is faster than what the compiler optimizes your code too. Should probably be measured also. ::: gst-libs/gst/audio/audio-converter.c @@ +923,3 @@ + z = GUINT32_SWAP_LE_BE (in[2]); + +#ifdef WORDS_BIGENDIAN G_BYTE_ORDER == G_BIG_ENDIAN ::: gst-libs/gst/audio/audio-format.h @@ +267,3 @@ #define GST_AUDIO_FORMAT_INFO_DEPTH(info) ((info)->depth) +#define GST_AUDIO_FORMAT_IS_ENDIAN_CONVERSION(info1, info2) \ This should probably be internal, not public API
Thank you for the review and the extra info. How is the ORC integrated into the code? Does it have to be enclosed in something like #if ORC_ENABLED and otherwise use the C implementation? Or does it take care of everything automatically? And then where to store the ORC code? In ORC I have not found anything for 24-bit handling other than 24->32 bit expansion. 24 bits is the actual difficult part that needs optimization. How would that be written? Regarding the in-place implementation my idea is the following: * gst_audio_converter_new() gets an extra boolean * parameter, where it returns if the conversion can be done in-place * gst_audio_convert_transform() calls gst_base_transform_set_in_place() if the above returned value is true * new function gst_audio_convert_transform_ip() needs to be added into gstaudioconvert.c - here I'm not sure if it has to be completely separate implementation or it can just call gst_audio_convert_transform(base, buf, buf); The convert_endian() function can already handle in==out
(In reply to Petr Kulhavy from comment #20) > Thank you for the review and the extra info. > > How is the ORC integrated into the code? Does it have to be enclosed in > something like #if ORC_ENABLED and otherwise use the C implementation? Or > does it take care of everything automatically? > > And then where to store the ORC code? Check the .orc file in gst-libs/gst/audio and just add your stuff there :) > In ORC I have not found anything for 24-bit handling other than 24->32 bit > expansion. 24 bits is the actual difficult part that needs optimization. How > would that be written? That's actually a good question and I don't know. ORC works on power-of-two sized types, 24 bits doesn't fit in there. It might make sense to include assembly for that specific conversion, but that's a big maintainence burden in the long run. > Regarding the in-place implementation my idea is the following: > * gst_audio_converter_new() gets an extra boolean * parameter, where it > returns if the conversion can be done in-place > > * gst_audio_convert_transform() calls gst_base_transform_set_in_place() if > the above returned value is true It would be better to call set_in_place() in set_caps(), which is the point where you have all informations about which formats to convert between. > * new function gst_audio_convert_transform_ip() needs to be added into > gstaudioconvert.c - here I'm not sure if it has to be completely separate > implementation or it can just call gst_audio_convert_transform(base, buf, > buf); > The convert_endian() function can already handle in==out That can probably be the same function, just ensure to add an assertion that in==out is only allowed if the conversion can be done in-place
Created attachment 338283 [details] [review] Patch for endian conversion fast-path - version 3 The third version of the endian fast-path patch. Modifications: * implemented transform_ip() method for GstAudioConvert object and gst_base_transform_set_in_place() * use G_BYTE_ORDER == G_BIG_ENDIAN * GST_AUDIO_FORMAT_IS_ENDIAN_CONVERSION is now private * gst_audio_converter_new now has an additional parameter in_place Now the endian conversion is done in place and performs far better. No extra buffers are allocated and in most cases the buffer handling fits into the data cache. In the remaining cases the CPU usage is 4% lower than without this patch. Still without ORC as I need to find out how to generate the gstaudiopack-dist.c.
That's done automatically on release, or with "make orc-update" inside the directory with the .orc file.
Finally managed to compile ORC :-) Are there ORC equivalents for NEON VLD3.x and VST3.x to do the 24-bit conversion?
No, see comment 21 related to this :)
Comment on attachment 338283 [details] [review] Patch for endian conversion fast-path - version 3 >--- a/gst-libs/gst/audio/audio-converter.h >+++ b/gst-libs/gst/audio/audio-converter.h >@@ -84,7 +84,8 @@ typedef enum { > GstAudioConverter * gst_audio_converter_new (GstAudioConverterFlags flags, > GstAudioInfo *in_info, > GstAudioInfo *out_info, >- GstStructure *config); >+ GstStructure *config, >+ gboolean *in_place); Can't do this, this is public API and adding a new parameter breaks API and ABI. >diff --git a/gst-libs/gst/audio/audio-format.h b/gst-libs/gst/audio/audio-format.h >index 0fbc415..5a9c06f 100644 >--- a/gst-libs/gst/audio/audio-format.h >+++ b/gst-libs/gst/audio/audio-format.h >@@ -266,7 +266,6 @@ GType gst_audio_format_info_get_type (void); > ... >- > ... Superfluous line/whitespace change, please remove :)
Could just have a gst_audio_convert_supports_passthrough() or similar
(In reply to Sebastian Dröge (slomo) from comment #25) > No, see comment 21 related to this :) That's a pity. 24-bit format is used in professional audio and 24-bit operations could be also useful for RGB video transformations (channel swapping). Shall I use inline assembly then? If yes is there a way to detect the target platform and in particular NEON with preprocessor?
Created attachment 338301 [details] [review] Patch for endian conversion fast-path - version 4 Corrected version 3: * removed extra whitespace in audio-formats.h * no API change in gst_audio_converter_new(), use new "method" gst_audio_converter_supports_inplace() instead
Created attachment 338303 [details] [review] Patch for ORC endian conversion This patch implements ORC for 16-, 32- and 64-bit endian conversion.
Cool! Out of curiosity, did you benchmark the ORC variants? Are they faster than what the compiler does by default with the C functions? (I guess the compiler won't do anything fancy here if it's not told to do so for a specific extension?)
(In reply to Petr Kulhavy from comment #28) > (In reply to Sebastian Dröge (slomo) from comment #25) > > No, see comment 21 related to this :) > > That's a pity. 24-bit format is used in professional audio and 24-bit > operations could be also useful for RGB video transformations (channel > swapping). Usually people use xRGB or variants of that exactly because of that reason. It's generally faster to handle as there are instructions for doing various things on 32 bit values, but not many for 24 bits. > Shall I use inline assembly then? If yes is there a way to detect the target > platform and in particular NEON with preprocessor? You can ask ORC if NEON is supported at runtime, and for the compile-time part take a look at what the resampler does. It IIRC has some NEON specific assembly too already. (In reply to Tim-Philipp Müller from comment #31) > Cool! Out of curiosity, did you benchmark the ORC variants? Are they faster > than what the compiler does by default with the C functions? (I guess the > compiler won't do anything fancy here if it's not told to do so for a > specific extension?) Also if the ORC backup C code compiles to something as fast as the manual implementation in C.
Side note - in case you're feeling adventurous :) - There's an unfinished orc branch here https://cgit.freedesktop.org/~wtay/orc/log/?h=orc-0.5 that would allow much better optimisations, also see https://gstconf.ubicast.tv/videos/language-and-tools-for-describing-and-executing-low-ievel-computations-on-modem-cpus/ for more details.
(In reply to Tim-Philipp Müller from comment #31) > Cool! Out of curiosity, did you benchmark the ORC variants? Are they faster > than what the compiler does by default with the C functions? (I guess the > compiler won't do anything fancy here if it's not told to do so for a > specific extension?) Not yet, but on the CPU % the difference is not visible. In fact the convert loop brings just a marginal gain. The main optimization potential/gain in my test environment is data cache usage, allocating and copying buffers. The endian fast-path brought 4-5% CPU gain because of that.
Created attachment 338401 [details] Performance measurement results Here are the results of my performance measurement of various algorithms. The measurement is based on an artificial test, which runs the endian-swap function in a loop on a buffer of a given size (in samples). The execution time was measured on Cortex A7, 528MHz. The test program was compiled in several variants: O2, O3, with naive algorithm, with unrolled loop for the 24-bit conversion and with NEON. Unfortunately I wasn't able to include ORC since the test uses static C code. The NEON version worked only with -O0, otherwise it clashed with the optimization and segmentation fault happened. This probably due to my ARM assembly ignorance. In the table you can see the average duration of one cycle for different algorithms/optimizations. The naive algorithm with -O3 scores the best, then comes the unrolled -O3 (24-bit only), NEON -O0, unrolled -O2 and as last the naive -O2. In any case we are talking about 4-17 nanoseconds. That means with 1% CPU usage (1s being 100%, i.e. 1% being 10ms) the worst algorithm can process over 500'000 samples, i.e. 192kHz / 3 channels. The naive -O3 algorithm can process 192kHz/ 8 channels with 1% CPU usage. That is pretty good. As previously explained the major CPU % loss is due to buffer allocation and not-in-place conversion, which I have eliminated. So I'm going to close the work here, sticking with the naive algorithm. Any further optimization is not worth more effort at the moment.
Ok so you'd say we do the naive C implementations and ignore the ORC parts, as well as the assembly implementations? Fine with me, simplifies things :) Can you provide an overall patch with all that, and with the assembly parts removed to keep the noise lower?
I'd say let's leave the ORC parts in for 16, 32 and 64 bits as they are trivial ORC functions, they don't harm and if disabled they compile anyway into the naive algorithm. I don't think the generated ORC performs significantly worse than the naive implementation, actually I would expect the opposite. The only remaining thing is 24-bits and there the best performing and simplest approach (in terms of code legibility and maintenance) seems to be the naive algorithm. If you agree to that I will create and upload just one patch that includes all changes.
Let's do that then :)
Created attachment 338444 [details] [review] Patch for endian conversion fast-path - version 5 Here is the (hopefully) final version with ORC merged into one patch.
Review of attachment 338444 [details] [review]: Generally looks good, thanks :) Just some minor remarks below ::: gst-libs/gst/audio/audio-converter.c @@ +863,3 @@ + */ +static void +converter_swap_endian_24 (gpointer dst, const gpointer src, gint count) Worries me a bit that dst and src can alias (in place transform), a compiler might optimize the code below to something broken in theory: it could get "x" after writing to in[i] for optimization reasons under the assumption that they don't alias Maybe we need a version for in-place and not? @@ +1002,3 @@ } +#define GST_AUDIO_FORMAT_IS_ENDIAN_CONVERSION(info1, info2) \ And next step, sign conversion \o/ :) @@ +1080,3 @@ + ("same formats, no resampler and passthrough mixing -> passthrough"); + convert->convert = converter_passthrough; + /* TODO: implement also in-place conversion */ Isn't this (converter_passthrough) never called anyway as basetransform will work in passthrough mode then? And converter_passthrough could just check for in==out and return then @@ +1116,3 @@ + default: + GST_ERROR ("unsupported sample width for endian conversion"); + g_assert (0); g_assert_not_reached() ::: gst/audioconvert/gstaudioconvert.c @@ +791,3 @@ +gst_audio_convert_transform_ip (GstBaseTransform * base, GstBuffer * buf) +{ + return gst_audio_convert_transform (base, buf, buf); Maybe needs a g_assert() for gst_audio_converter_supports_inplace() to return TRUE :)
(In reply to Sebastian Dröge (slomo) from comment #40) > Review of attachment 338444 [details] [review] [review]: > > Generally looks good, thanks :) Just some minor remarks below > > ::: gst-libs/gst/audio/audio-converter.c > @@ +863,3 @@ > + */ > +static void > +converter_swap_endian_24 (gpointer dst, const gpointer src, gint count) > > Worries me a bit that dst and src can alias (in place transform), a compiler > might optimize the code below to something broken in theory: it could get > "x" after writing to in[i] for optimization reasons under the assumption > that they don't alias > > Maybe we need a version for in-place and not? That's a good point. I thought that introducing the 'x' would solve it but haven't thought of it being optimized out, there you're absolutely right. Would the XOR algorithm solve it? for (i = 0; i < count; i += 3) { guint8 a = in[i + 0]; guint8 c = in[i + 2]; a ^= c; c ^= a; a ^= c; out[i + 0] = a; out[i + 1] = in[i + 1]; out[i + 2] = c; } And isn't this also an issue for the ORC versions? I'm also thinking that the function header (src being const) might be formally wrong because it is called with src==dst. I'm trying to avoid another function if possible because it makes either the code less legible and more complex (another function pointer, all functions doubled, and another test in convert_endian() ) or inefficient (if the test is done inside the swap_endian() itself). > @@ +1002,3 @@ > } > > +#define GST_AUDIO_FORMAT_IS_ENDIAN_CONVERSION(info1, info2) \ > > And next step, sign conversion \o/ :) Yes, but let's finish this one first, please :-) > @@ +1080,3 @@ > + ("same formats, no resampler and passthrough mixing -> > passthrough"); > + convert->convert = converter_passthrough; > + /* TODO: implement also in-place conversion */ > > Isn't this (converter_passthrough) never called anyway as basetransform will > work in passthrough mode then? And converter_passthrough could just check > for in==out and return then This code has already been already there I haven't touched it. But in my tests I haven't seen the converter_passthrough to be called at all. This might be due to "basetransform_class->passthrough_on_same_caps = TRUE;" being set in gst_audio_convert_class_init() But on the other hand that is outside of this class... So formally the converter_passthrough is IMHO correct. It just makes no sense ;-) > @@ +1116,3 @@ > + default: > + GST_ERROR ("unsupported sample width for endian conversion"); > + g_assert (0); > > g_assert_not_reached() Thanks. > ::: gst/audioconvert/gstaudioconvert.c > @@ +791,3 @@ > +gst_audio_convert_transform_ip (GstBaseTransform * base, GstBuffer * buf) > +{ > + return gst_audio_convert_transform (base, buf, buf); > > Maybe needs a g_assert() for gst_audio_converter_supports_inplace() to > return TRUE :) Isn't this one in _transform() enough? /* allow inbuf==outbuf (i.e. entry from transform_ip) * only if the buffer is writable */ g_assert (inbuf != outbuf || inbuf_writable);
(In reply to Petr Kulhavy from comment #41) > (In reply to Sebastian Dröge (slomo) from comment #40) > > Review of attachment 338444 [details] [review] [review] [review]: > > > > Generally looks good, thanks :) Just some minor remarks below > > > > ::: gst-libs/gst/audio/audio-converter.c > > @@ +863,3 @@ > > + */ > > +static void > > +converter_swap_endian_24 (gpointer dst, const gpointer src, gint count) > > > > Worries me a bit that dst and src can alias (in place transform), a compiler > > might optimize the code below to something broken in theory: it could get > > "x" after writing to in[i] for optimization reasons under the assumption > > that they don't alias > > > > Maybe we need a version for in-place and not? > > That's a good point. I thought that introducing the 'x' would solve it but > haven't thought of it being optimized out, there you're absolutely right. > Would the XOR algorithm solve it? > [...] Maybe but that's really awful :) > And isn't this also an issue for the ORC versions? Yes, ORC explicitly requires that there is no aliasing in the function arguments (it even uses the C99 restrict keyword). > I'm also thinking that the function header (src being const) might be > formally wrong because it is called with src==dst. > > I'm trying to avoid another function if possible because it makes either the > code less legible and more complex (another function pointer, all functions > doubled, and another test in convert_endian() ) or inefficient (if the test > is done inside the swap_endian() itself). Could generate those duplicated functions with a macro, I think that's my preferred solution right now. Let's do any fixes for this as a commit on top of the existing one though. > > @@ +1080,3 @@ > > + ("same formats, no resampler and passthrough mixing -> > > passthrough"); > > + convert->convert = converter_passthrough; > > + /* TODO: implement also in-place conversion */ > > > > Isn't this (converter_passthrough) never called anyway as basetransform will > > work in passthrough mode then? And converter_passthrough could just check > > for in==out and return then > > This code has already been already there I haven't touched it. But in my > tests I haven't seen the converter_passthrough to be called at all. This > might be due to "basetransform_class->passthrough_on_same_caps = TRUE;" > being set in gst_audio_convert_class_init() But on the other hand that is > outside of this class... So formally the converter_passthrough is IMHO > correct. It just makes no sense ;-) Well, something else might use the API and call it :) I guess that TODO comment just makes no sense. > > ::: gst/audioconvert/gstaudioconvert.c > > @@ +791,3 @@ > > +gst_audio_convert_transform_ip (GstBaseTransform * base, GstBuffer * buf) > > +{ > > + return gst_audio_convert_transform (base, buf, buf); > > > > Maybe needs a g_assert() for gst_audio_converter_supports_inplace() to > > return TRUE :) > > Isn't this one in _transform() enough? > > /* allow inbuf==outbuf (i.e. entry from transform_ip) > * only if the buffer is writable */ > g_assert (inbuf != outbuf || inbuf_writable); Yes, but it would be more explicit and understandable to have (also) this assertion in transform_ip()
(In reply to Sebastian Dröge (slomo) from comment #42) > > That's a good point. I thought that introducing the 'x' would solve it but > > haven't thought of it being optimized out, there you're absolutely right. > > Would the XOR algorithm solve it? > > [...] > > Maybe but that's really awful :) Well, I call this smart and elegant ;-) > > And isn't this also an issue for the ORC versions? > > Yes, ORC explicitly requires that there is no aliasing in the function > arguments (it even uses the C99 restrict keyword). Hmmm, that moves us slightly backwards... What is then the way to do ORC endian swap in place? > > I'm trying to avoid another function if possible because it makes either the > > code less legible and more complex (another function pointer, all functions > > doubled, and another test in convert_endian() ) or inefficient (if the test > > is done inside the swap_endian() itself). > > Could generate those duplicated functions with a macro, I think that's my > preferred solution right now. Let's do any fixes for this as a commit on top > of the existing one though. Not sure if I understand what you mean. Could you give an example, please. > > > @@ +1080,3 @@ > > > + ("same formats, no resampler and passthrough mixing -> > > > passthrough"); > > > + convert->convert = converter_passthrough; > > > + /* TODO: implement also in-place conversion */ > > > > > > Isn't this (converter_passthrough) never called anyway as basetransform will > > > work in passthrough mode then? And converter_passthrough could just check > > > for in==out and return then > > > > This code has already been already there I haven't touched it. But in my > > tests I haven't seen the converter_passthrough to be called at all. This > > might be due to "basetransform_class->passthrough_on_same_caps = TRUE;" > > being set in gst_audio_convert_class_init() But on the other hand that is > > outside of this class... So formally the converter_passthrough is IMHO > > correct. It just makes no sense ;-) > > Well, something else might use the API and call it :) I guess that TODO > comment just makes no sense. The point is that the introduction of transform_ip() semantically slightly changes the API of audio-converter. So far the conversion has never been thought to be in place (even though it wasn't very clear), but now by implementing transform_ip() we are de-facto allowing it. So if for whatever reason someone decides to call convert() in passthrough (just because they can) it would be logical and consistent that it indicates in-place=TRUE and convert() doesn't do anything if it detects src==dest. Effectively it's a duplication of what's been already done in base-transform, but that's the price for having audioconvert as a public class. > > > ::: gst/audioconvert/gstaudioconvert.c > > > @@ +791,3 @@ > > > +gst_audio_convert_transform_ip (GstBaseTransform * base, GstBuffer * buf) > > > +{ > > > + return gst_audio_convert_transform (base, buf, buf); > > > > > > Maybe needs a g_assert() for gst_audio_converter_supports_inplace() to > > > return TRUE :) > > > > Isn't this one in _transform() enough? > > > > /* allow inbuf==outbuf (i.e. entry from transform_ip) > > * only if the buffer is writable */ > > g_assert (inbuf != outbuf || inbuf_writable); > > Yes, but it would be more explicit and understandable to have (also) this > assertion in transform_ip() Agree, but probably also slower as it is an additional function call.
(In reply to Petr Kulhavy from comment #43) > > > And isn't this also an issue for the ORC versions? > > > > Yes, ORC explicitly requires that there is no aliasing in the function > > arguments (it even uses the C99 restrict keyword). > > Hmmm, that moves us slightly backwards... What is then the way to do ORC > endian swap in place? Have an orc function with only one argument, the destination array. > > > > @@ +1080,3 @@ > > > > + ("same formats, no resampler and passthrough mixing -> > > > > passthrough"); > > > > + convert->convert = converter_passthrough; > > > > + /* TODO: implement also in-place conversion */ > > > > > > > > Isn't this (converter_passthrough) never called anyway as basetransform will > > > > work in passthrough mode then? And converter_passthrough could just check > > > > for in==out and return then > > > > > > This code has already been already there I haven't touched it. But in my > > > tests I haven't seen the converter_passthrough to be called at all. This > > > might be due to "basetransform_class->passthrough_on_same_caps = TRUE;" > > > being set in gst_audio_convert_class_init() But on the other hand that is > > > outside of this class... So formally the converter_passthrough is IMHO > > > correct. It just makes no sense ;-) > > > > Well, something else might use the API and call it :) I guess that TODO > > comment just makes no sense. > > The point is that the introduction of transform_ip() semantically slightly > changes the API of audio-converter. So far the conversion has never been > thought to be in place (even though it wasn't very clear), but now by > implementing transform_ip() we are de-facto allowing it. So if for whatever > reason someone decides to call convert() in passthrough (just because they > can) it would be logical and consistent that it indicates in-place=TRUE and > convert() doesn't do anything if it detects src==dest. > > Effectively it's a duplication of what's been already done in > base-transform, but that's the price for having audioconvert as a public > class. Yes, also trivial to handle that correctly so lets just do that :) > > > > ::: gst/audioconvert/gstaudioconvert.c > > > > @@ +791,3 @@ > > > > +gst_audio_convert_transform_ip (GstBaseTransform * base, GstBuffer * buf) > > > > +{ > > > > + return gst_audio_convert_transform (base, buf, buf); > > > > > > > > Maybe needs a g_assert() for gst_audio_converter_supports_inplace() to > > > > return TRUE :) > > > > > > Isn't this one in _transform() enough? > > > > > > /* allow inbuf==outbuf (i.e. entry from transform_ip) > > > * only if the buffer is writable */ > > > g_assert (inbuf != outbuf || inbuf_writable); > > > > Yes, but it would be more explicit and understandable to have (also) this > > assertion in transform_ip() > > Agree, but probably also slower as it is an additional function call. It's a macro, almost no cost here For the aliasing, actually things are better than expected. There's only a problem with ORC here because it uses "restrict" for the arrays (which hints at the compiler that they will not alias). For the normal C code there is no problem as the compiler is only allowed to assume no aliasing happens if there is "restrict", or there are two pointers of a different type (which is not the case here). I would suggest that for simplicity we get rid of the ORC parts here and just do it all in C. Then we only need one function and not a separate one for in-place. And the speed difference is probably not big (right?).
(In reply to Sebastian Dröge (slomo) from comment #44) > For the aliasing, actually things are better than expected. There's only a > problem with ORC here because it uses "restrict" for the arrays (which hints > at the compiler that they will not alias). For the normal C code there is no > problem as the compiler is only allowed to assume no aliasing happens if > there is "restrict", or there are two pointers of a different type (which is > not the case here). > > I would suggest that for simplicity we get rid of the ORC parts here and > just do it all in C. Then we only need one function and not a separate one > for in-place. And the speed difference is probably not big (right?). Back to the roots? ;-) I agree with you. The performance gain is likely minimal and it's a good price for the simplicity. Let me prepare the patch (on top of this one) then.
Yes, or make a combined patch. It's mostly about removing code then anyway and the combined patch will be simpler :)
Created attachment 338519 [details] [review] Patch for endian conversion fast-path - version 6 Here are the amendments as discussed.
(In reply to Sebastian Dröge (slomo) from comment #44) > > > > > Maybe needs a g_assert() for gst_audio_converter_supports_inplace() to > > > > > return TRUE :) > > > > > > > > Isn't this one in _transform() enough? > > > > > > > > /* allow inbuf==outbuf (i.e. entry from transform_ip) > > > > * only if the buffer is writable */ > > > > g_assert (inbuf != outbuf || inbuf_writable); > > > > > > Yes, but it would be more explicit and understandable to have (also) this > > > assertion in transform_ip() > > > > Agree, but probably also slower as it is an additional function call. > > It's a macro, almost no cost here > Since glib debug is recommended to be always enabled (otherwise destabilizes even mostly bug-free code as the config script warns) this is not just a macro. And valgrind confirms the performance impact: DLmr Ir 3,034 1,122,550 g_assert (gst_audio_converter_supports_inplace (this->convert)); ...after 10205x calls of gst_audio_convert_transform_ip()
Ah that function call. I was considering to store that value somewhere, but anyway, it should be fine to just omit that assertion too.
Review of attachment 338519 [details] [review]: Good except for (mostly) cosmetics, thanks :) ::: gst-libs/gst/audio/audio-converter.c @@ +972,3 @@ +{ + guint64 *out = dst; + const guint64 *in = src; I wonder if "const guint64*" and "guint64*" are considered different types and aliasing bites us again here... should be ok to get rid of the const here and in the function signature ::: gst-libs/gst/audio/gstaudiopack.orc @@ -423,3 @@ muld t1, s1, 2147483648.0L convdl d1, t1 - Spurious whitespace change here ::: gst/audioconvert/gstaudioconvert.c @@ +791,3 @@ + g_assert (gst_audio_converter_supports_inplace (this->convert)); + + return gst_audio_convert_transform (base, buf, buf); Maybe add a comment to transform() that it can be called with in==out from here
(In reply to Sebastian Dröge (slomo) from comment #50) > Review of attachment 338519 [details] [review] [review]: > > Good except for (mostly) cosmetics, thanks :) > > ::: gst-libs/gst/audio/audio-converter.c > @@ +972,3 @@ > +{ > + guint64 *out = dst; > + const guint64 *in = src; > > I wonder if "const guint64*" and "guint64*" are considered different types > and aliasing bites us again here... should be ok to get rid of the const > here and in the function signature In my understanding it should be ok. const int *x and int *y pointing to the same chunk of memory are absolutely legal in C. The const just means the value cannot be changed through x, but it doesn't mean that the value in memory doesn't change. See also e.g. the memmove(), where the src pointer is also const. At the same time memmove() can be called with overlapping buffers. > > ::: gst-libs/gst/audio/gstaudiopack.orc > @@ -423,3 @@ > muld t1, s1, 2147483648.0L > convdl d1, t1 > - > > Spurious whitespace change here That one slipped through, thanks.
I guess that is indeed safe, yes
Created attachment 338531 [details] [review] Patch for endian conversion fast-path - version 7
Review of attachment 338531 [details] [review]: \o/
*** Bug 773207 has been marked as a duplicate of this bug. ***
Comment on attachment 338196 [details] [review] Simplification of the chain free process - version 2 This does not apply cleanly to git master, please rebase (same for the other patch) :) Thanks!
Created attachment 340898 [details] [review] Simplification of the chain free process - version 2 rebased to master
Created attachment 340899 [details] [review] Patch for endian conversion fast-path - version 7 rebased to master I must have missed the email notifications, sorry. First now I see there was some problem applying the patches. Here both rebased to the current master.
commit 010b9547d3832a5c3d71f73726f149b3b023882d Author: Petr Kulhavy <brain@jikos.cz> Date: Wed Oct 19 12:21:37 2016 +0200 audio-converter: optimize endian conversion Optimize LE<->BE conversion by adding a dedicated fast path instead of using the generic converter. Implement transform_ip function in order to do the endian swap in place. This saves buffer allocation for the intermediate format, can be done in place and also performs the conversion in one step instead of unpack-convert-pack. For all bit widths the naive algorithm is implemented, which provides the best performance when compiled with -O3. ORC was considered but eventually removed as it requires a dedicated function for in-place conversion (due to the "restrict" parameters). A more complex algorithm for the 24-bit conversion with unrolled loop and 32-bit processing is implemented in the #if 0 section. It performs better if compiled with -O2. With -O3 however the naive algorithm performs better. https://bugzilla.gnome.org/show_bug.cgi?id=773073 commit 640c54d8f8cf1a035044fa4f034b866883925a4d Author: Petr Kulhavy <brain@jikos.cz> Date: Fri Oct 21 14:30:31 2016 +0200 audio-convert: simplify the chain free process It is not needed to store a pointer to every single chain element to free it. Instead walk the channel list backwards and free the chain elements one by one. Rename GstAudioConverter->chain_pack to chain_end. https://bugzilla.gnome.org/show_bug.cgi?id=773073
Thanks for updating the patch :) Are you also going to work on in-place sign conversion?
(In reply to Sebastian Dröge (slomo) from comment #60) > Thanks for updating the patch :) Are you also going to work on in-place sign > conversion? Not planning it in the nearest future.
Ok, I'll take a look at that then. Seems simple enough :)
physical audio CD produces crispy sound ..high pitch noise. Just like last year (december 2015) since this commit: https://cgit.freedesktop.org/gstreamer/gst-plugins-base/commit/gst-libs/gst/audio/audio-converter.c?id=08734e7598ced4feae55529cb21609c1654a1dc0 It was solved with this commit: https://cgit.freedesktop.org/gstreamer/gst-plugins-base/commit/gst-libs/gst/audio/audio-converter.c?id=7bbfa39ada905b44743170f6c29d13e5fefb5888 OpenPli dm8000.
Sorry that was just history of audio-convert.c https://cgit.freedesktop.org/gstreamer/gst-plugins-base/commit/?id=08734e7598ced4feae55529cb21609c1654a1dc0 And https://cgit.freedesktop.org/gstreamer/gst-plugins-base/commit/?id=7bbfa39ada905b44743170f6c29d13e5fefb5888
Please file a new bug about that, and ideally also provide a testcase to reproduce the problem (or two pipelines that output different things to e.g. a wav file with/without this change).