After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 753591 - vaapi: encoders: implement "direct-uploading"
vaapi: encoders: implement "direct-uploading"
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gstreamer-vaapi
unspecified
Other Linux
: Normal enhancement
: 1.11.1
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2015-08-13 15:16 UTC by Víctor Manuel Jáquez Leal
Modified: 2016-11-22 14:31 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Víctor Manuel Jáquez Leal 2015-08-13 15:16:28 UTC
As decoders may handle "direct-rendering" by using vaDeriveImage (not enabled by default) when processing output buffers, we should implement a "direct-uploading" to upload raw buffers to encode using vaDeriveImage.

This is one of conclusions working with bug #744042
Comment 1 sreerenj 2016-03-24 16:54:48 UTC
Moving to Product:GStreamer, Component:gstreamer-vaapi
Comment 2 Víctor Manuel Jáquez Leal 2016-10-18 17:01:00 UTC
Right now, disabling USE_NATIVE_FORMATS, the usage of vaDeriveImage is activated:

diff --git a/gst/vaapi/gstvaapivideomemory.c b/gst/vaapi/gstvaapivideomemory.c
index 3b2b4d8..271b738 100644
--- a/gst/vaapi/gstvaapivideomemory.c
+++ b/gst/vaapi/gstvaapivideomemory.c
@@ -37,7 +37,7 @@ GST_DEBUG_CATEGORY_STATIC (gst_debug_vaapivideomemory);
 #endif

 /* Defined if native VA surface formats are preferred over direct rendering */
-#define USE_NATIVE_FORMATS 1
+#define USE_NATIVE_FORMATS 0

 /* ------------------------------------------------------------------------ */
 /* --- GstVaapiVideoMemory                                              --- */


Nonetheless, right now it only works for some streams when only using vaapi<codec>dec (in other words, disabling vaapipostproc and vaapidecodebin).

Some streams are not rendered correctly because of the difference of caps negotiation and allocation caps. In this case the negotiation caps seems to be the same as the negotiation caps.

When vaapipostproc is enabled, all fails. It seems that our logic of direct rendering is not valid for the case of vaapipostroc.

*In the case of encoding*  (videotestsrc ! vaapih264enc) I got this error log:

0:00:51.658394439 27976       0x81b4a0 DEBUG                  vaapi gstvaapiencoder_objects.c:517:gst_vaapi_enc_picture_encode: encode picture 0x04000009
0:00:51.658423732 27976       0x81b4a0 DEBUG                  vaapi gstvaapiutils.c:53:vaapi_check_status: vaBeginPicture(): surface is in use
0:00:51.658450228 27976       0x81b4a0 ERROR                  vaapi gstvaapiencoder.c:347:gst_vaapi_encoder_put_frame: failed to encode frame (status = -1)
0:00:51.658487580 27976       0x81b4a0 ERROR            vaapiencode gstvaapiencode.c:558:gst_vaapiencode_handle_frame: failed to encode frame 0 (status -1)

This is interesting, since the upstream element could map the derived image correctly and copy the data into it. But when the encoder encodes, it cannot read the input surface.
Comment 3 Víctor Manuel Jáquez Leal 2016-10-19 11:20:24 UTC
Got a hackish version that works only for 3 frames.

We need to destroy the derived image after unmapping, otherwise the driver will assume it is busy. As the documentation says:

An image created with vaDeriveImage should be freed with vaDestroyImage. The image and image buffer structures will be destroyed; however, the underlying surface will remain unchanged until freed with vaDestroySurfaces.

Right now, I force the destruction of the derived image after unmapping and it works, meanwhile the buffer is not reused again, since we would need to recreate the derived image again when mapping. 

Though, we ought to re-design a bit our gstvaapimeta to handle this use case. The previous design was only for downstream and decoding, where the image is not processed by va again because it is going to be rendered. The work flow here is different.
Comment 4 Víctor Manuel Jáquez Leal 2016-10-21 10:07:19 UTC
I have pushed a branch, in my repository, with direct upload working:

https://github.com/ceyusa/gstreamer-vaapi/commits/753591

Using this pipeline 

gst-launch-1.0 videotestsrc ! video/x-raw, width=1920, height=1080 ! vaapih264enc ! fakesink

Without the patches, the CPU usage is ~76%
With the patches, the CPU usage is ~69%

The last patch is for enabling direct rendering, but I see a couple problems: VPP fails to process surfaces with vaDerivedImages and, when the allocation caps are different from the negotation caps, the sink cannot map the frame.
Comment 5 sreerenj 2016-10-21 11:00:27 UTC
(In reply to Víctor Manuel Jáquez Leal from comment #4)
> I have pushed a branch, in my repository, with direct upload working:
> 
> https://github.com/ceyusa/gstreamer-vaapi/commits/753591

Please ask Josep Torra whether your branch giving better perfromance with his 4k encode test cases . If what we understood from the analysis done during gstconf is correct, then it supposed to give significant perfomance improvement while encoding non tiled surface.

> 
> Using this pipeline 
> 
> gst-launch-1.0 videotestsrc ! video/x-raw, width=1920, height=1080 !
> vaapih264enc ! fakesink
> 
> Without the patches, the CPU usage is ~76%
> With the patches, the CPU usage is ~69%
> 
> The last patch is for enabling direct rendering, but I see a couple
> problems: VPP fails to process surfaces with vaDerivedImages and, when the
> allocation caps are different from the negotation caps, the sink cannot map
> the frame.
Comment 6 Víctor Manuel Jáquez Leal 2016-10-21 15:02:00 UTC
I have updated the patches to re-enable direct rendering.

Rendering a 4K video in my Haswell (resizing to fullHD by vaapipostproc)

Without the patches, the CPU usage is ~17%
With the patches, the CPU usage is ~13%

Using xvimagesink as sink.
Comment 7 Josep Torra Valles 2016-10-21 16:55:21 UTC
(In reply to sreerenj from comment #5)
> (In reply to Víctor Manuel Jáquez Leal from comment #4)
> > I have pushed a branch, in my repository, with direct upload working:
> > 
> > https://github.com/ceyusa/gstreamer-vaapi/commits/753591
> 
> Please ask Josep Torra whether your branch giving better perfromance with
> his 4k encode test cases . If what we understood from the analysis done
> during gstconf is correct, then it supposed to give significant perfomance
> improvement while encoding non tiled surface.

We are not yet there, the change that this had been introduced causes dmabuf and non dmabuf case go into the intel_encoder_end_picture -> intel_encoder_check_yuv_surface ->  i965_image_processing path.

It's a relative improvement, but seems like the surface isn't created as a tiled surface.
Comment 8 Víctor Manuel Jáquez Leal 2016-10-21 17:07:05 UTC
(In reply to Josep Torra Valles from comment #7)
> (In reply to sreerenj from comment #5)
> > (In reply to Víctor Manuel Jáquez Leal from comment #4)
> > > I have pushed a branch, in my repository, with direct upload working:
> > > 
> > > https://github.com/ceyusa/gstreamer-vaapi/commits/753591
> > 
> > Please ask Josep Torra whether your branch giving better perfromance with
> > his 4k encode test cases . If what we understood from the analysis done
> > during gstconf is correct, then it supposed to give significant perfomance
> > improvement while encoding non tiled surface.
> 
> We are not yet there, the change that this had been introduced causes dmabuf
> and non dmabuf case go into the intel_encoder_end_picture ->
> intel_encoder_check_yuv_surface ->  i965_image_processing path.
> 
> It's a relative improvement, but seems like the surface isn't created as a
> tiled surface.

Cool.

A further test is to change alloc_flag to 0 in

https://github.com/ceyusa/gstreamer-vaapi/commit/7bb4a127cefb9273de443454dc27a4560dbc4752#diff-0f026fb085abba2cfd45eabad1f57be3R538

in this way, instead on allocating linear surfaces, it is allocated tiled surfaces
Comment 9 Josep Torra Valles 2016-10-21 18:17:11 UTC
I can't reach github now but I did the following change:

diff --git a/gst/vaapi/gstvaapipluginbase.c b/gst/vaapi/gstvaapipluginbase.c
index 7b39d34..8ce5e38 100644
--- a/gst/vaapi/gstvaapipluginbase.c
+++ b/gst/vaapi/gstvaapipluginbase.c
@@ -535,7 +535,7 @@ ensure_sinkpad_allocator (GstVaapiPluginBase * plugin, GstVideoInfo * vinfo)
 
     if (gst_caps_is_video_raw (plugin->sinkpad_caps)) {
       usage_flag = GST_VAAPI_IMAGE_USAGE_FLAG_DIRECT_UPLOAD;
-      alloc_flag = GST_VAAPI_SURFACE_ALLOC_FLAG_LINEAR_STORAGE;
+      //alloc_flag = GST_VAAPI_SURFACE_ALLOC_FLAG_LINEAR_STORAGE;
       GST_INFO_OBJECT (plugin, "enabling direct upload in sink allocator");
     }


It almost doubles encoder performance.

Being able to stream 2x4K@28 and 3x4K@22 when without it I can just 1x4K@28 in my hardware without using LP mode. I can't test LP Mode yet because the gpu hangs.
Comment 10 Víctor Manuel Jáquez Leal 2016-10-21 19:46:51 UTC
(In reply to Josep Torra Valles from comment #9)
> I can't reach github now but I did the following change:

Yeah, that's what I meant.

> It almost doubles encoder performance.
> 
> Being able to stream 2x4K@28 and 3x4K@22 when without it I can just 1x4K@28
> in my hardware without using LP mode. I can't test LP Mode yet because the
> gpu hangs.

:)
Comment 11 sreerenj 2016-10-24 14:02:45 UTC
(In reply to Josep Torra Valles from comment #9)
> I can't reach github now but I did the following change:
> 
> diff --git a/gst/vaapi/gstvaapipluginbase.c b/gst/vaapi/gstvaapipluginbase.c
> index 7b39d34..8ce5e38 100644
> --- a/gst/vaapi/gstvaapipluginbase.c
> +++ b/gst/vaapi/gstvaapipluginbase.c
> @@ -535,7 +535,7 @@ ensure_sinkpad_allocator (GstVaapiPluginBase * plugin,
> GstVideoInfo * vinfo)
>  
>      if (gst_caps_is_video_raw (plugin->sinkpad_caps)) {
>        usage_flag = GST_VAAPI_IMAGE_USAGE_FLAG_DIRECT_UPLOAD;
> -      alloc_flag = GST_VAAPI_SURFACE_ALLOC_FLAG_LINEAR_STORAGE;
> +      //alloc_flag = GST_VAAPI_SURFACE_ALLOC_FLAG_LINEAR_STORAGE;
>        GST_INFO_OBJECT (plugin, "enabling direct upload in sink allocator");
>      }
> 
> 
> It almost doubles encoder performance.
> 
> Being able to stream 2x4K@28 and 3x4K@22 when without it I can just 1x4K@28
> in my hardware without using LP mode. 

Good, but not enough ;),,,
I guess you are testing with out dmabuf, right ?
Wondering what is performance gap between dmabuf and non-dmabuf mode...

>I can't test LP Mode yet because the
> gpu hangs.
Comment 12 Víctor Manuel Jáquez Leal 2016-10-24 16:31:06 UTC
(In reply to sreerenj from comment #11)
> (In reply to Josep Torra Valles from comment #9)

> > It almost doubles encoder performance.
> > 
> > Being able to stream 2x4K@28 and 3x4K@22 when without it I can just 1x4K@28
> > in my hardware without using LP mode. 
> 
> Good, but not enough ;),,,

Josep asked me if we are going to push this for 1.10. @Sree, what do you thing? 
IMO if we push this, we also should merge Julien's work for dmabuf to downstream.

> I guess you are testing with out dmabuf, right ?
> Wondering what is performance gap between dmabuf and non-dmabuf mode...

In the case if dmabuf to upstream, I don't see more room for improvement (perhaps I'm missing something), but disabling the linear storage flag in the allocator creation.
Comment 13 sreerenj 2016-10-28 10:44:44 UTC
(In reply to Víctor Manuel Jáquez Leal from comment #12)
> (In reply to sreerenj from comment #11)
> > (In reply to Josep Torra Valles from comment #9)
> 
> > > It almost doubles encoder performance.
> > > 
> > > Being able to stream 2x4K@28 and 3x4K@22 when without it I can just 1x4K@28
> > > in my hardware without using LP mode. 
> > 
> > Good, but not enough ;),,,
> 
> Josep asked me if we are going to push this for 1.10. @Sree, what do you
> thing? 
> IMO if we push this, we also should merge Julien's work for dmabuf to
> downstream.

Hm, it could be too late for 1.10... I would prefer to not push for 1.10...Also this requires regression testing in multiple platforms..

> 
> > I guess you are testing with out dmabuf, right ?
> > Wondering what is performance gap between dmabuf and non-dmabuf mode...
> 
> In the case if dmabuf to upstream, I don't see more room for improvement
> (perhaps I'm missing something), but disabling the linear storage flag in
> the allocator creation.
Comment 14 Víctor Manuel Jáquez Leal 2016-10-28 11:41:58 UTC
(In reply to sreerenj from comment #13)
> (In reply to Víctor Manuel Jáquez Leal from comment #12)
> >
> > Josep asked me if we are going to push this for 1.10. @Sree, what do you
> > thing? 
> > IMO if we push this, we also should merge Julien's work for dmabuf to
> > downstream.
> 
> Hm, it could be too late for 1.10... I would prefer to not push for
> 1.10...Also this requires regression testing in multiple platforms..

I agree.
Comment 15 Víctor Manuel Jáquez Leal 2016-11-04 13:12:46 UTC
This branch has been merged into master:

* 202110b plugins: direct render when raw video
* faebca3 plugins: move src allocator error to instantiator
* 02f16e8 plugins: enable direct upload if raw video
* 1605a26 pluginutil: add gst_caps_is_video_raw()
* e0d73d6 plugins: receive caps in ensure_sinkpad_allocator()
* 2761e47 vaapivideomemory: destroy derived image at unmap
* ce8b148 vaapivideomemory: enhance logs for direct modes
* 7c69226 vaapivideomemory: add direct upload flag
* e988298 vaapivideomemory: set direct rendering at run-time
* 1bc5f00 vaapivideomemory: log in perf category when copy
* 7e90ae4 vaapivideomemory: error log is derive image fails
* dcbd411 vaapivideomemory: store surface alloc flags in qdata
* 4016388 vaapivideomemory: category init when object define

Closing
Comment 16 Florent Thiéry 2016-11-07 10:28:41 UTC
Can you explain the comment on 202110bded181192c1b571d51968d2b15cbe7478 "Pass also the caps, since they are needed to know the requested caps features." (e.g. sample pipeline).
Comment 17 Víctor Manuel Jáquez Leal 2016-11-07 10:33:58 UTC
(In reply to Florent Thiéry from comment #16)
> Can you explain the comment on 202110bded181192c1b571d51968d2b15cbe7478
> "Pass also the caps, since they are needed to know the requested caps
> features." (e.g. sample pipeline).

Sure.

I added the parameter GstCaps * caps to the function ensure_srcpad_allocator(), event that we are already passing the GstVidoInfo created from that caps, just to know if the caps are raw video on system memory, in order to enable the direct rendering.

Do you have any issues with this feature?
Comment 18 Florent Thiéry 2016-11-07 10:41:49 UTC
Does that imply anything in regards to the pipeline syntax ?

Regarding issues, i don't know yet, just started testing. My first quick test seemed to show a regression so i'm running more tests.

Are there any restrictions on the colorspace in order to take advantage of this new feature (nv12 ?) ?
Comment 19 Víctor Manuel Jáquez Leal 2016-11-07 11:08:53 UTC
(In reply to Florent Thiéry from comment #18)
> Does that imply anything in regards to the pipeline syntax ?

Nope.

> Regarding issues, i don't know yet, just started testing. My first quick
> test seemed to show a regression so i'm running more tests.

Ok. Let us know. Perhaps we'll rollback (or guard through a envvar) this feature.

> Are there any restrictions on the colorspace in order to take advantage of
> this new feature (nv12 ?) ?

Normally NV12 would be the best. But, depending on the backend, some color transformation can be done.
Comment 20 Florent Thiéry 2016-11-07 17:16:48 UTC
I can't rely too much on the encoding numbers because of #773546 which prevents me from running a lot of runs but so far, GT2-SKL is seeing a big improvement for the black 1080 pattern case (+40%) in low-power mode, +20% smpte and +1% for snow. PG (non-low power) mode also benefits from this (~10%). I think GT3e-SKL may have regressed between 0%-5% for all cases. Is there some kind of compression involved in this patch ?

The improvement is less visible for 4k test samples (there even is a -5% regression on low power mode for smpte and snow, but +15% on black).

If anyone is interested to run the same tests on other hw, try

https://github.com/UbiCastTeam/gstreamer-benchmarks

./run_tests.py -t tests/encoding-h264-nv12.py
Comment 21 Florent Thiéry 2016-11-08 16:47:30 UTC
Now that the leak seems plugged, i could run my tests reliably and i trust the numbers more.

On GT2, i see a nice performance increase for the PG mode (tune=0), especially when using 2 encoders ("channels"), but regressions on the low-power mode as soon as we use more encoders.

Numbers below are Mpix/s, so divide by ~2 to get 1080p fps (150 fps)

+-----------+----------+------+--------+---------+
|   tune    | encoders | 1.10 | master |  delta  |
+-----------+----------+------+--------+---------+
| low-power |        1 |  310 |    344 | 10,98%  |
| low-power |        2 |  379 |    355 | -6,41%  |
| low-power |        3 |  389 |    349 | -10,12% |
| none      |        1 |  196 |    214 | 9,35%   |
| none      |        2 |  224 |    273 | 22,06%  |
| none      |        3 |  237 |    281 | 18,54%  |
+-----------+----------+------+--------+---------+

On GT3e it's much better than without the patch, GT3e is finally faster than GT2 (~200 fps for 1 encoder)

+-----------+----------+------+--------+--------+
|   tune    | encoders | 1.10 | master | delta  |
+-----------+----------+------+--------+--------+
| low-power |        1 |  274 |    413 | 50,55% |
| low-power |        2 |  321 |    399 | 24,40% |
| low-power |        3 |  343 |    382 | 11,27% |
| none      |        1 |  176 |    232 | 31,88% |
| none      |        2 |  189 |    271 | 43,39% |
| none      |        3 |  198 |    278 | 40,57% |
+-----------+----------+------+--------+--------+

I find it odd that GT2 regresses on low-power tune with multiple encoders compared to before the patch. I'll try to see if this might be related to the leak plug.

Note that these values are an average over 3 patterns (black, smpte and snow); for instance, black culminates at 200 fps on GT2 and 250 fps on GT3.

+-----------+------------------------+----------+------------+-----------+----------+------------+-----------+
|   tune    |        pattern         | GT2 1.10 | GT2 master | GT2 delta | GT3 1.10 | GT3 master | GT3 delta |
+-----------+------------------------+----------+------------+-----------+----------+------------+-----------+
| low-power | pattern=black-1920*1ch |      344 |        401 | 16,57%    |      286 |        517 | 80,77%    |
| low-power | pattern=smpte-1920*1ch |      337 |        385 | 14,24%    |      311 |        496 | 59,49%    |
| low-power | pattern=snow-1920*1ch  |      248 |        245 | -1,21%    |      226 |        226 | 0,00%     |
+-----------+------------------------+----------+------------+-----------+----------+------------+-----------+

The snow case is unaffected.
Comment 22 Florent Thiéry 2016-11-09 09:47:52 UTC
I did some tests regarding the leak plug patch (https://bugzilla.gnome.org/show_bug.cgi?id=773546), and it does not change the parallel encoder performance, so the regression of GT2 low power with multiple encoders seems related to direct-uploading.
Comment 23 Florent Thiéry 2016-11-22 14:31:02 UTC
I re-ran my tests comparing 1.10.1 and today's master (using 5 pass), and the regression on GT2 using multiple encoders seems gone (or marginal). Virtually, GT2 performance seems unchanged.

1080p GT3 performance does improve a lot:

+-------------------------------+--------------------------+------+-------+-------+
|   Encoder                     |  Sample                  |1.10.1|master |  vs   |
+===============================+==========================+======+=======+=======+
|   *vaapih264enc tune=none*    |  pattern=black-1920*1ch  |  93  |  127  |  37%  |
+-------------------------------+--------------------------+------+-------+-------+
|   *vaapih264enc tune=none*    |  pattern=smpte-1920*1ch  |  87  |  117  |  34%  |
+-------------------------------+--------------------------+------+-------+-------+
|   *vaapih264enc tune=none*    |  pattern=snow-1920*1ch   |  88  |  118  |  34%  |
+-------------------------------+--------------------------+------+-------+-------+
| *vaapih264enc tune=low-power* |  pattern=black-1920*1ch  | 133  |  294  | 121%  |
+-------------------------------+--------------------------+------+-------+-------+
| *vaapih264enc tune=low-power* |  pattern=smpte-1920*1ch  | 163  |  258  |  58%  |
+-------------------------------+--------------------------+------+-------+-------+
| *vaapih264enc tune=low-power* |  pattern=snow-1920*1ch   | 113  |  115  |  2%   |
+-------------------------------+--------------------------+------+-------+-------+

However, GT2 is still a bit faster for 4k (tune=none), and tune=low-power seems identical

+-------------------------------+--------------------------+------+-------+---------+
|   Encoder                     |  Sample                  |GT2   |GT3    |  vs     |
+===============================+==========================+======+=======+=========+
|   *vaapih264enc tune=none*    |        *3840*1ch*        |  35  |  31   | -11,43% |
+-------------------------------+--------------------------+------+-------+---------+
|   *vaapih264enc tune=none*    |        *3840*2ch*        |  43  |  40   | -7,75%  |
+-------------------------------+--------------------------+------+-------+---------+
|   *vaapih264enc tune=none*    |        *3840*3ch*        |  42  |  40   | -4,76%  |
+-------------------------------+--------------------------+------+-------+---------+
| *vaapih264enc tune=low-power* |        *3840*1ch*        |  45  |  44   | -1,48%  |
+-------------------------------+--------------------------+------+-------+---------+
| *vaapih264enc tune=low-power* |        *3840*2ch*        |  45  |  46   |  2,21%  |
+-------------------------------+--------------------------+------+-------+---------+
| *vaapih264enc tune=low-power* |        *3840*3ch*        |  46  |  46   |  0,00%  |
+-------------------------------+--------------------------+------+-------+---------+

Let me know if i can help further