GNOME Bugzilla – Bug 741015
videoconvert: Tune quality setting to not degrade performance compared to 1.4
Last modified: 2015-03-15 14:36:05 UTC
See summary. E.g. currently RGBA->I420 is much slower on ARM compared to 1.4.
On my desktop for this line: gst-launch-1.0 videotestsrc num-buffers=2000 ! video/x-raw,format=RGBA ! videoconvert ! video/x-raw,format=I420 ! fakesink 1.4: Execution ended after 0:00:01.206045132 git: Execution ended after 0:00:00.863105612 Both do exactly the same: splat and pack RGBA (videotestsrc) - both ORC unpack RGBA - both ORC matrix to YUV - 1.4 uses software - git uses ORC chroma downsample lines horizontal and vertical - 1.4 uses software for both - git uses ORC for both pack I420 - both use ORC - git has faster path for odd lines So either the ORC versions are slower than the software ones on arm, or there is a different code path on arm for 1.4 and git. Could you say what is the case?
On pandaboard for the same line: 1.4: Execution ended after 0:00:01.150434705 git: Execution ended after 0:00:02.721955917 The reason is the new matrix8 function that fails to compile (and the orc backup function is generally slower than a plain C version we used to have) ORC: WARNING: orcrules-neon.c(805): neon_rule_loadpX(): 64-bit parameters not implemented ORC: WARNING: orcrules-neon.c(805): neon_rule_loadpX(): 64-bit parameters not implemented ORC: WARNING: orcrules-neon.c(805): neon_rule_loadpX(): 64-bit parameters not implemented ORC: WARNING: orcrules-neon.c(790): neon_rule_loadpX(): 64-bit constants not implemented ORC: WARNING: orcrules-neon.c(1220): orc_neon_emit_loadiq(): unimplemented load of constant 255 ORC: WARNING: orccompiler.c(396): orc_program_compile_full(): program video_orc_matrix8 failed to compile, reason 256 ORC: INFO: orccompiler.c(416): orc_program_compile_full(): finished compiling (fail)
Maybe we should be able to provide a custom, optimized backup function to orc. And of course also implement support for 64 bit parameters ;) But I'm expecting this to be much slower on other architectures too then, especially when ORC can't be used at all.
Which would basically make videoconvert (for this case) 2.5 times slower in 1.6 compared to 1.4 on these platforms
Setting a custom backup function seems doable. We would need to disable compilation of the orc backup function and instead make it call our own version. In this case, though, it uses a different algorithm that uses less registers. It should be possible to get the same performance from the backup function.
with a custom backup function we get this: 1.4: Execution ended after 0:00:01.142854517 git: Execution ended after 0:00:00.886781410 commit f1cfa5bba9824374d769e312381d8f5d85a417bc Author: Wim Taymans <wtaymans@redhat.com> Date: Fri Dec 5 12:01:21 2014 +0100 orcc: allow setting custom backup function Add a new .backup keyword that instructs the orc compiler to call our custom backup function instead of generating one. This is interesting if the generated backup function is slower than a plain C implementation.
Everything seems to be fast enough here now. The main bottleneck back then was the chroma subsampling, which is basically the same code as videoscale... which is now equivalently fast to before or faster. And in general we have far more fastpaths now.