After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 741015 - videoconvert: Tune quality setting to not degrade performance compared to 1.4
videoconvert: Tune quality setting to not degrade performance compared to 1.4
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-base
git master
Other Linux
: Normal blocker
: 1.5.1
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2014-12-02 12:39 UTC by Sebastian Dröge (slomo)
Modified: 2015-03-15 14:36 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Sebastian Dröge (slomo) 2014-12-02 12:39:01 UTC
See summary.

E.g. currently RGBA->I420 is much slower on ARM compared to 1.4.
Comment 1 Wim Taymans 2014-12-03 09:13:26 UTC
On my desktop for this line: gst-launch-1.0 videotestsrc num-buffers=2000 ! video/x-raw,format=RGBA ! videoconvert ! video/x-raw,format=I420 ! fakesink

1.4:  Execution ended after 0:00:01.206045132
git:  Execution ended after 0:00:00.863105612

Both do exactly the same:

splat and pack RGBA  (videotestsrc)
   - both ORC

unpack RGBA
   - both ORC
matrix to YUV
   - 1.4 uses software
   - git uses ORC 
chroma downsample lines horizontal and vertical
   - 1.4 uses software for both
   - git uses ORC for both
pack I420
   - both use ORC
   - git has faster path for odd lines

So either the ORC versions are slower than the software ones on arm, or there is a different code path on arm for 1.4 and git. Could you say what is the case?
Comment 2 Wim Taymans 2014-12-04 12:20:17 UTC
On pandaboard for the same line:

1.4: Execution ended after 0:00:01.150434705
git: Execution ended after 0:00:02.721955917

The reason is the new matrix8 function that fails to compile (and the orc backup function is generally slower than a plain C version we used to have)

ORC: WARNING: orcrules-neon.c(805): neon_rule_loadpX(): 64-bit parameters not implemented
ORC: WARNING: orcrules-neon.c(805): neon_rule_loadpX(): 64-bit parameters not implemented
ORC: WARNING: orcrules-neon.c(805): neon_rule_loadpX(): 64-bit parameters not implemented
ORC: WARNING: orcrules-neon.c(790): neon_rule_loadpX(): 64-bit constants not implemented
ORC: WARNING: orcrules-neon.c(1220): orc_neon_emit_loadiq(): unimplemented load of constant 255
ORC: WARNING: orccompiler.c(396): orc_program_compile_full(): program video_orc_matrix8 failed to compile, reason 256
ORC: INFO: orccompiler.c(416): orc_program_compile_full(): finished compiling (fail)
Comment 3 Sebastian Dröge (slomo) 2014-12-04 12:40:49 UTC
Maybe we should be able to provide a custom, optimized backup function to orc. And of course also implement support for 64 bit parameters ;)

But I'm expecting this to be much slower on other architectures too then, especially when ORC can't be used at all.
Comment 4 Sebastian Dröge (slomo) 2014-12-04 12:41:34 UTC
Which would basically make videoconvert (for this case) 2.5 times slower in 1.6 compared to 1.4 on these platforms
Comment 5 Wim Taymans 2014-12-04 14:03:51 UTC
Setting a custom backup function seems doable. We would need to disable compilation of the orc backup function and instead make it call our own version.

In this case, though, it uses a different algorithm that uses less registers. It should be possible to get the same performance from the backup function.
Comment 6 Wim Taymans 2014-12-05 14:14:26 UTC
with a custom backup function we get this:

1.4: Execution ended after 0:00:01.142854517
git: Execution ended after 0:00:00.886781410

commit f1cfa5bba9824374d769e312381d8f5d85a417bc
Author: Wim Taymans <wtaymans@redhat.com>
Date:   Fri Dec 5 12:01:21 2014 +0100

    orcc: allow setting custom backup function
    
    Add a new .backup keyword that instructs the orc compiler to call our
    custom backup function instead of generating one. This is interesting if
    the generated backup function is slower than a plain C implementation.
Comment 7 Sebastian Dröge (slomo) 2015-03-15 14:36:05 UTC
Everything seems to be fast enough here now. The main bottleneck back then was the chroma subsampling, which is basically the same code as videoscale... which is now equivalently fast to before or faster. And in general we have far more fastpaths now.