GNOME Bugzilla – Bug 721701
videoconvert: I420 to BGRA conversion is slower than in 0.10
Last modified: 2014-04-12 09:45:09 UTC
It appear that version 1.2.0-1.2.2 and perhaps previous versions introduce a dramatic performance penalty for video scaling and video format conversion of I420 to BGRA when compared to version 0.10.36. This performance penalty is present for common Ubuntu installations ranging from 12.04 until 13.10 as well as with a newly build version of GStreamer 1.2.2 with orc-0.4.18. It is checked that Orc is enabled when building gst-plugins-base. The following examples are all executed on Ubuntu 13.10 with GStreamer 1.2.2 for version 1.0 and GStreamer 0.10.36 for version 0.10. The system is a dual Xeon Quad core configuration with a total of 8 cores. First we create two test samples of 60 seconds each (just hit ctrl-c after running 60 seconds: $ gst-launch-1.0 -e -v videotestsrc pattern=18 is-live=true ! 'video/x-raw, width=1024, height=576, format=I420' ! videoconvert ! x264enc ! 'video/x-h264, profile=main' ! avimux ! filesink location=video1024x576.mp4 $ gst-launch-1.0 -e -v videotestsrc pattern=18 is-live=true ! 'video/x-raw, width=1280, height=720, format=I420' ! videoconvert ! x264enc ! 'video/x-h264, profile=main' ! avimux ! filesink location=video1280x720.mp4 Now we start top in another window to check cpu load and try decode and convert the latter file with 0.10 and 1.0 $ /usr/bin/gst-launch-0.10 -v filesrc location=./video1280x720.mp4 do-timestamp=true ! decodebin2 name=decoder ! ffmpegcolorspace ! 'video/x-raw-rgb, bpp=(int)32, depth=(int)32, endianness=(int)4321, red_mask=(int)65280, green_mask=(int)16711680, blue_mask=(int)-16777216, width=(int)1280, height=(int)720' ! queue ! fakesink silent=true sync=true and $ /usr/local/bin/gst-launch-1.0 -v filesrc location=./video1280x720.mp4 do-timestamp=true ! decodebin name=decoder ! videoconvert ! 'video/x-raw, format=(string)BGRA, width=(int)1280, height=(int)720' ! queue ! fakesink silent=true sync=true On my system, the 0.10 version will happily run at roughly 24-25 % CPU load while the 1.0 version runs at roughly 51% Now we add videoscaling by using the first file we created and scale it before converting it: $ /usr/bin/gst-launch-0.10 -v filesrc location=./video1024x576.mp4 do-timestamp=true ! decodebin2 name=decoder ! videoscale ! ffmpegcolorspace ! 'video/x-raw-rgb, bpp=(int)32, depth=(int)32, endianness=(int)4321, red_mask=(int)65280, green_mask=(int)16711680, blue_mask=(int)-16777216, width=(int)1280, height=(int)720' ! queue ! fakesink silent=true sync=true and $ /usr/local/bin/gst-launch-1.0 -v filesrc location=./video1024x576.mp4 do-timestamp=true ! decodebin name=decoder ! videoscale ! videoconvert ! 'video/x-raw, format=(string)BGRA, width=(int)1280, height=(int)720' ! queue ! fakesink silent=true sync=true On my system, the 0.10 version will happily run at roughly 33-34 % CPU load while the 1.0 version runs at roughly 101% Here is what I see when building the plugin base: configure: *** Plug-ins without external dependencies that will be built: adder app audioconvert audiorate audioresample audiotestsrc encoding gio playback subparse tcp typefind videoconvert videorate videoscale videotestsrc volume configure: *** Plug-ins without external dependencies that will NOT be built: configure: *** Plug-ins that have NOT been ported: configure: *** Plug-ins with dependencies that will be built: alsa cdparanoia ivorbisdec libvisual ogg pango theora vorbis ximagesink xvimagesink configure: *** Plug-ins with dependencies that will NOT be built: configure: *** Orc acceleration enabled. So I assume Orc acceleration for video scale and videoconvert is enabled.
To complete a bit more the report, I was able to reproduce the performance issue with an Intel i5 x86_64 using Ubuntu's GStreamer, but I was unable to reproduce it with the developers ppa.
I420 -> BGRA is not implemented correctly in 0.10, it uses the fast path but with the wrong color matrix and wrong chroma upsampling. In 1.0, it is implemented correctly but uses a slow path. Implementing a fast path with wrong upsampling would easy to add.
(In reply to comment #2) > I420 -> BGRA is not implemented correctly in 0.10, it uses the fast path but > with the wrong color matrix and wrong chroma upsampling. > > In 1.0, it is implemented correctly but uses a slow path. Implementing a fast > path with wrong upsampling would easy to add. But how do you explain that using the developers ppa instead of the default version provided with Ubuntu was on par with 0.10? I did run some tests with cachegrind comparing 0.10, 1.0 from Ubuntu and 1.2 from the developers ppa. 0.10 and 1.2 from the ppa was more or less the same number of CPU cycles (around 20M for the colorspace conversion function) while the version from Ubuntu was using 50M cycles more the colorspace conversion.
In 1.2.2 there was some changes that may have has an impact on the fast path. It is most likely related. Note that all this depends on proper build of ORC support. commit ed8460519f1571bc4a1ca079847a67f077d1efaf Author: Sebastian Dröge <sebastian@centricular.com> Date: Mon Dec 23 14:54:02 2013 +0100 videoconvert: Fix I420 to BGRA fast-path alpha setting This fast-path was adding 128 to every component including alpha while it should only be done for all components except alpha. This caused wrong alpha values to be generated. Also remove the high-quality I420 to BGRA fast-path as it needs the same fix, which causes an additional instruction, which causes orc to emit more than 96 variables, which then just crashes. This can only be fixed in orc by breaking ABI and allowing more variables.
The original reporter says that it also happens with 1.2.0. And that fix there adds exactly one ORC opcode per pixel, and as a result sets a non-random alpha value ;)
It appear i was a bit hasty in specifying 1.2.0 as introducing a performance penalty. Looking into it, version 1.2.0 only do so partly. The following pipeline for 1.2.0 consumes about 53%, which is roughly half of 1.2.2, but still nearly twice as much as for 0.10.36: $ /usr/local/bin/gst-launch-1.0 -v filesrc location=./video1280x720.mp4 do-timestamp=true ! decodebin name=decoder ! videoconvert ! 'video/x-raw, format=(string)BGRA, width=(int)1280, height=(int)720' ! queue ! fakesink silent=true sync=true However, the following pipeline only consumes roughly 25-29% which is only a little bit more than for 0.10.36 $ /usr/local/bin/gst-launch-1.0 -v filesrc location=./video1024x576.mp4 do-timestamp=true ! decodebin name=decoder ! videoscale ! videoconvert ! 'video/x-raw, format=(string)BGRA, width=(int)1280, height=(int)720' ! queue ! fakesink silent=true sync=true Best regards
(In reply to comment #6) The interesting part in previous comment is, the following: 1) Decoding 1280x720 and converting to BGRA is consuming twice as much CPU compared to 0.10.36 and half as much as compared to 1.2.2 2) Decoding 1024x576, then scaling to 1280x720 and then converting to BGRA is using only a little bit more CPU compared to 0.10.36 For version 1.0.10 the similar tests reveals: 1) Decoding 1280x720 and converting to BGRA is consuming 20% less CPU compared to 1.2.0, but still 80% more than 0.10.36 2) Decoding 1024x576, then scaling to 1280x720 and then converting to BGRA is using only a little bit more CPU compared to 0.10.36 similar to version 1.2.0 best regards Peter
This should bring it comparable to 0.10 commit 14b5999bca16d9ac18bdcd5905c472bec2fe247e Author: Wim Taymans <wtaymans@redhat.com> Date: Thu Jan 9 18:12:00 2014 +0100 videoconvert: rework YUV->RGB fastpaths Rework the orc code to be around 10% faster and support arbitrary matrices. Pass the matrix parameters to the YUV->RGB functions to make them work for all matrices. This enables more and faster fastpath conversions. See https://bugzilla.gnome.org/show_bug.cgi?id=721701 commit 5ca04cb79836ce817ce88189e97a1c41f3424686 Author: Wim Taymans <wtaymans@redhat.com> Date: Thu Jan 9 18:08:41 2014 +0100 videoconvert: fix I420 to BGRA fast-path some more Calculate alpha value differently so that we can avoid running out of registers.
This seems to have broken the build under jhbuild on fedora 20: CC libgstvideoconvert_la-videoconvert.lo videoconvert.c: In function 'convert_AYUV_ARGB': videoconvert.c:1209:7: error: too many arguments to function 'video_convert_orc_convert_AYUV_ARGB' width, height); ^ In file included from videoconvert.c:31:0: gstvideoconvertorc.h:113:6: note: declared here void video_convert_orc_convert_AYUV_ARGB (guint8 * ORC_RESTRICT d1, int d1_stride, const guint8 * ORC_RESTRICT s1, int s1_stride, int n, int m); ^ videoconvert.c: In function 'convert_AYUV_BGRA': videoconvert.c:1223:7: error: too many arguments to function 'video_convert_orc_convert_AYUV_BGRA' width, height); ^ In file included from videoconvert.c:31:0: gstvideoconvertorc.h:114:6: note: declared here void video_convert_orc_convert_AYUV_BGRA (guint8 * ORC_RESTRICT d1, int d1_stride, const guint8 * ORC_RESTRICT s1, int s1_stride, int n, int m); ^ videoconvert.c: In function 'convert_AYUV_ABGR': videoconvert.c:1237:7: error: too many arguments to function 'video_convert_orc_convert_AYUV_ABGR' width, height); ^ In file included from videoconvert.c:31:0: gstvideoconvertorc.h:115:6: note: declared here void video_convert_orc_convert_AYUV_ABGR (guint8 * ORC_RESTRICT d1, int d1_stride, const guint8 * ORC_RESTRICT s1, int s1_stride, int n, int m); ^ videoconvert.c: In function 'convert_AYUV_RGBA': videoconvert.c:1251:7: error: too many arguments to function 'video_convert_orc_convert_AYUV_RGBA' width, height); ^ In file included from videoconvert.c:31:0: gstvideoconvertorc.h:116:6: note: declared here void video_convert_orc_convert_AYUV_RGBA (guint8 * ORC_RESTRICT d1, int d1_stride, const guint8 * ORC_RESTRICT s1, int s1_stride, int n, int m); ^ videoconvert.c: In function 'convert_I420_BGRA': videoconvert.c:1268:9: error: too many arguments to function 'video_convert_orc_convert_I420_BGRA' width); ^ In file included from videoconvert.c:31:0: gstvideoconvertorc.h:117:6: note: declared here void video_convert_orc_convert_I420_BGRA (guint8 * ORC_RESTRICT d1, const guint8 * ORC_RESTRICT s1, const guint8 * ORC_RESTRICT s2, const guint8 * ORC_RESTRICT s3, int n); ^ make[1]: *** [libgstvideoconvert_la-videoconvert.lo] Error 1 make[1]: Leaving directory `/home/desrt/.cache/jhbuild/checkout/gst-plugins-base/gst/videoconvert' make: *** [all] Error 2
Builds for me in manual build, it's possible that the orc code didn't get regenerated, would make clean fix the issue ?
Seems that the problem is this: [desrt@moonpix videoconvert]$ make V=1 cp ./gstvideoconvertorc-dist.c tmp-orc.c cp ./gstvideoconvertorc-dist.h gstvideoconvertorc.h make all-am from the Makefile.am: # When 'make dist' is run at the top level, or 'make orc-update' # in a directory including this fragment, the generated source # files will be copied to $(ORC_SOURCE)-dist.[ch]. These files # should be checked in to git, since they are used if Orc is # disabled. The checked in -dist copies of these files are the old ones and I don't have ORC enabled, so I'm not rebuilding them. I guess you need to check in the updates?
Can this be closed now? Or do we miss some other fast-paths for this bug? commit 4ff4c1c10aebf0a1cf1623572d6d66864aebac60 Author: Sebastian Dröge <sebastian@centricular.com> Date: Fri Jan 10 09:21:08 2014 +0100 videoconvert: Update disted orc files
(In reply to comment #11) > I don't have ORC enabled, so I'm not rebuilding them. Note that without ORC many plugins will be quite slow, you really should build with ORC support.
gst-launch-0.10 -v videotestsrc num-buffers=1000 ! video/x-raw-yuv\,\ format\=\(fourcc\)I420\,\ width\=\(int\)1280\,\ height\=\(int\)720\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive\,framerate\=\(fraction\)30/1 ! ffmpegcolorspace ! 'video/x-raw-rgb, bpp=(int)32, depth=(int)24, endianness=(int)4321, red_mask=(int)65280, green_mask=(int)16711680, blue_mask=(int)-16777216, width=(int)1280, height=(int)720, framerate=(fraction)30/1, pixel-aspect-ratio=(fraction)1/1, interlace-mode=(string)progressive' ! fakesink silent=1 -v 0.10: Execution ended after 12328739773 ns. gst-launch-1.0 -v videotestsrc num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1280\,\ height\=\(int\)720\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive\,\ colorimetry\=\(string\)bt709\,\ framerate\=\(fraction\)30/1 ! videoconvert ! 'video/x-raw, format=(string)BGRx,width=(int)1280, height=(int)720' ! fakesink 1.3 git: Execution ended after 0:00:11.873438183 gst-launch-1.0 -v videotestsrc num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1280\,\ height\=\(int\)720\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive\,\ colorimetry\=\(string\)bt709\,\ framerate\=\(fraction\)30/1 ! videoconvert ! 'video/x-raw, format=(string)BGRx,width=(int)1280, height=(int)720' ! fakesink 1.2 git: Execution ended after 0:00:26.966103516
with scaling; gst-launch-0.10 -v videotestsrc num-buffers=1000 ! video/x-raw-yuv\,\ format\=\(fourcc\)I420\,\ width\=\(int\)1024\,\ height\=\(int\)576\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive\,framerate\=\(fraction\)30/1 ! videoscale ! ffmpegcolorspace ! 'video/x-raw-rgb, bpp=(int)32, depth=(int)24, endianness=(int)4321, red_mask=(int)65280, green_mask=(int)16711680, blue_mask=(int)-16777216, width=(int)1280, height=(int)720, framerate=(fraction)30/1, pixel-aspect-ratio=(fraction)1/1, interlace-mode=(string)progressive' ! fakesink silent=1 -v 0.10: Execution ended after 17131772544 ns. gst-launch-1.0 -v videotestsrc num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1024\,\ height\=\(int\)576\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive\,\ colorimetry\=\(string\)bt709\,\ framerate\=\(fraction\)30/1 ! videoscale ! videoconvert ! 'video/x-raw, format=(string)BGRx,width=(int)1280, height=(int)720' ! fakesink 1.3 git: Execution ended after 0:00:15.854307283 gst-launch-1.0 -v videotestsrc num-buffers=1000 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1024\,\ height\=\(int\)576\,\ pixel-aspect-ratio\=\(fraction\)1/1\,\ interlace-mode\=\(string\)progressive\,\ colorimetry\=\(string\)bt709\,\ framerate\=\(fraction\)30/1 ! videoscale ! videoconvert ! 'video/x-raw, format=(string)BGRx,width=(int)1280, height=(int)720' ! fakesink 1.2: Execution ended after 0:00:31.374607459
Let's consider this fixed then, new bug reports for other missing fast paths please
From file it's more dramatic: 0.10: video1280x720.mp4: Execution ended after 5664100050 ns. video1024x576.mp4: Execution ended after 4483866324 ns. 1.3 git: video1280x720.mp4: Execution ended after 0:00:02.899891548 video1024x576.mp4: Execution ended after 0:00:03.350581054 1.2.2: video1280x720.mp4: Execution ended after 0:00:09.539085249 video1024x576.mp4: Execution ended after 0:00:10.231766318 0.10 uses purely software for color conversion but uses orc for scaling.
This is still causing jhbuild to break. It seems that you've gone out of your way to allow the build to proceed without Orc, by allowing ./configure to succeed without it and checking generated files into git so that the build will work anyway. Those files have fallen out of date, causing the build to fail. If your intention is that Orc is hard-required to build from git then you should probably modify ./configure to require it and/or remove those generated files from git...
(In reply to comment #18) > This is still causing jhbuild to break. > > It seems that you've gone out of your way to allow the build to proceed without > Orc, by allowing ./configure to succeed without it and checking generated files > into git so that the build will work anyway. Those files have fallen out of > date, causing the build to fail. > > If your intention is that Orc is hard-required to build from git then you > should probably modify ./configure to require it and/or remove those generated > files from git... ORC is not a hard requirement, it's just generally a good idea to use it. The files in GIT are supposed to be kept up-to-date, and in tarballs they are guaranteed to be up-to-date (make dist cares for that). Is there still a build problem remaining?
Ryan, the generated files in git were updated this morning. Did you update your checkout?
I see the same orc issue, and I have slomo's commit from this morning. I tried a full autogen/make clean/make to no avail. In case it matters: $ orcc --version Orc Compiler 0.4.16.1 So it should be regenerating them I think, unless we require newer ?
*** Bug 721938 has been marked as a duplicate of this bug. ***
*sigh* There were changes to the .orc files after the version for which I updated the disted files. Next try: commit b21a67f99eae83c7ed2b40f7409f67d269d8a6f4 Author: Sebastian Dröge <sebastian@centricular.com> Date: Fri Jan 10 15:06:23 2014 +0100 videoconvert: Update disted orc files once again (also you need orc >= 0.4.18)
This last commit fixed the build here, thanks. Without even having to install some newer orc.
Indeed -- fixed with latest changes for me as well. Thanks!
Checked with GStreamer 1.2.3 and the issue of a hefty performance penalty is still there. Vincent and Ryan reported it as fixed, but apparently that did not make it into the 1.2.3? Just want to be sure it really is fixed in some future version. Regards Peter
The latter comments were about the build being broken, sorry if this was unclear. Looking at git, Wim's patches seem not to be in 1.2.3, but they are in master.
The 'target_milestone' field indicates where this is fixed. 1.3.1 means it's only in master, not the 1.2 branch. The patches were too invasive to cherry-pick IIRC.
Git Master works nicely. Thanks