GNOME Bugzilla – Bug 734043
videoconvert: add Orc optimization for I420 to BGRA for x86 [32 bit]
Last modified: 2018-05-01 09:50:31 UTC
When testing conversion of I420->BGRA, I see fast performance with gstreamer x86_64. However, gstreamer x86 performs much slower and posts the warning: ' ORC: WARNING: orccompiler.c(382): orc_program_compile_full(): program video_convert_orc_convert_I420_BGRA failed to compile, reason: register overflow for vector reg ' Is it possible to get the Orc optimization on the 32 bit build? Test pipeline: gst-launch-1.0 videotestsrc ! video/x-raw,format=I420,width=1280,height=1024 ! videoconvert ! video/x-raw,format=BGRA ! fakesink sync=true
Would be useful to also provide your CPU capabilities (e.g. just attach /proc/cpuinfo). Intel CPU are not all equal in their ability to do SIMD (hence gain something from ORC). Just being 64bit already allow for faster operation or around 2X for most pixel operations.
We've been running these tests on a Windows 7 laptop and creating a browser plugin that restricts us to using a 32-bit build of gstreamer. Below is the processor information. Processor 1 ID = 0 Number of cores 4 (max 8) Number of threads 8 (max 16) Name Intel Core i7 2860QM Codename Sandy Bridge Specification Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz Package (platform ID) Socket 988B rPGA (0x4) CPUID 6.A.7 Extended CPUID 6.2A Core Stepping D2 Technology 32 nm TDP Limit 45 Watts Tjmax 100.0 °C Core Speed 2394.3 MHz Multiplier x Bus Speed 24.0 x 99.8 MHz Stock frequency 2500 MHz Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX L1 Data cache 4 x 32 KBytes, 8-way set associative, 64-byte line size L1 Instruction cache 4 x 32 KBytes, 8-way set associative, 64-byte line size L2 cache 4 x 256 KBytes, 8-way set associative, 64-byte line size L3 cache 8 MBytes, 16-way set associative, 64-byte line size FID/VID Control yes Turbo Mode supported, enabled Max non-turbo ratio 25x Max turbo ratio 36x Max efficiency ratio 8x Max Power 72 Watts Min Power 36 Watts O/C bins none Ratio 1 core 36x Ratio 2 cores 35x Ratio 3 cores 33x Ratio 4 cores 33x TSC 2494.6 MHz APERF 3293.1 MHz MPERF 2494.5 MHz
this fails on 32bit with orc git, orc 0.4.22 and 0.4.18 I have this feeling the issue was introduced by the following commit: commit 14b5999bca16d9ac18bdcd5905c472bec2fe247e Author: Wim Taymans <wtaymans@redhat.com> Date: Thu Jan 9 18:12:00 2014 +0100 videoconvert: rework YUV->RGB fastpaths Rework the orc code to be around 10% faster and support arbitrary matrices. Pass the matrix parameters to the YUV->RGB functions to make them work for all matrices. This enables more and faster fastpath conversions. See https://bugzilla.gnome.org/show_bug.cgi?id=721701
Created attachment 288906 [details] Orc logs for this bug orc version 0.4.18 gst version 1.4
Hi, Any update on this? Regards, Eric T
So there are two ways to fix this: 1) Implement orc code that doesn't use too many registers (so it works on architectures with less registers than x86-64) 2) Implement register spilling in orc (i.e. use memory when we exceed the available number of available registers). I'm not 100% sure we can achieve the same quality/speed results with 1) accross all platforms. Maybe wim has some feedback on this. 2) doesn't seem as trivial as it seems (how do you figure out what's the *right* register to spill into main memory).
No activity for 4 years. Only applies on 32bit x86 machines. Closing. Re-open if a patch can be provided to fix this issue.