Bug 786054 – mapping vaapi buffer slow

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 786054 - mapping vaapi buffer slow


Summary:	mapping vaapi buffer slow


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gstreamer-vaapi
Version:	1.10.4
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.13.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-08-09 14:10 UTC by frank.huber
Modified:	2017-11-01 10:24 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
reproducer with time measurement for gst_buffer_map() (8.64 KB, text/plain) 2017-08-30 12:57 UTC, frank.huber		Details
simplified test file (6.34 KB, text/plain) 2017-09-21 11:37 UTC, Víctor Manuel Jáquez Leal		Details
hack for testing (926 bytes, patch) 2017-10-22 10:07 UTC, sreerenj	none	Details \| Review
plugins: resurrect direct rendering (2.29 KB, patch) 2017-10-31 12:52 UTC, Víctor Manuel Jáquez Leal	none	Details \| Review
plugins: resurrect direct rendering (2.41 KB, patch) 2017-10-31 13:12 UTC, Víctor Manuel Jáquez Leal	none	Details \| Review
plugins: direct rendering on memory:VASurface (2.54 KB, patch) 2017-10-31 17:26 UTC, Víctor Manuel Jáquez Leal	committed	Details \| Review

Description frank.huber 2017-08-09 14:10:30 UTC

If the number of decoded videos increase, the time-span of gst_buffer_map() increases dramatically.

With 16 h264 full HD videos each mapping needs more than 16 ms.

We map the buffer to copy the video data. The copy itself needs only 1 ms.

We assume the increase of the mapping time is a bug, maybe a concurrency issue. Maybe inside intel vaapi driver, not the gstreamer code.


setup
intel core i7
GStreamer 1.10.4
libva 1.8.2
libva intel driver 1.8.2

Comment 1 Víctor Manuel Jáquez Leal 2017-08-10 07:58:43 UTC

How's your pipeline? in order to reproduce the issue. Or, better yet, do you have a test app?

Comment 2 Víctor Manuel Jáquez Leal 2017-08-15 15:57:44 UTC

I wonder if this is related with bug 785092

Comment 3 frank.huber 2017-08-30 12:57:16 UTC

Created attachment 358757 [details]
reproducer with time measurement for gst_buffer_map()

Attached is a small reproducer with time measurement, appsink-src.c

Tested with  ./repro Highway_1080p60_8M.mp4

Highway_1080p60_8M.mp4 is a HD video with 60 Hz

Printouts are:

./repro Highway_1080p60_8M.mp4
libva info: VA-API version 0.40.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/va/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_40
libva info: va_openDriver() returns 0
Let's run!
libva info: VA-API version 0.40.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/va/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_40
libva info: va_openDriver() returns 0
libva info: VA-API version 0.40.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/va/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_40
libva info: va_openDriver() returns 0

==> map : average: 2092 us   min: 1236 us    max: 5076 us <==
==> copy: average:  480 us   min:  354 us    max: 1747 us <==
==> map : average: 3573 us   min: 1236 us    max: 5228 us <==
==> copy: average:  504 us   min:  348 us    max: 1747 us <==
==> map : average: 3843 us   min: 1236 us    max: 5412 us <==
==> copy: average:  468 us   min:  335 us    max: 1747 us <==
==> map : average: 3710 us   min: 1236 us    max: 6291 us <==
==> copy: average:  471 us   min:  335 us    max: 1747 us <==^C

Why the time need for    gst_buffer_map ()  is so high?
If we start this example multiple times the time for gst_buffer_map() increases dramatically
The time for memcpy is always lower and does not increase if we start the example multiple times

In our product we run 16 HD videos with 30 Hz and measure around  16 ms for gst_buffer_map()

This example started 16 times shows the values below:
only the time for gst_buffer_map (), the time for memcpy is always lower. Only the last line of all 16 jobs are listed

16 times 1920x1080@30Hz 
1: average: 3694 us   min: 1184 us    max: 18074 us
2: average: 2426 us   min: 1150 us    max: 18953 us
3: average: 2352 us   min: 1173 us    max: 19834 us
4: average: 3931 us   min: 1164 us    max: 14376 us
5: average: 3082 us   min: 1196 us    max: 15445 us
6: average: 2982 us   min: 1186 us    max: 16830 us
7: average: 2499 us   min: 1159 us    max: 8703 us
8: average: 3804 us   min: 1208 us    max: 18760 us
9: average: 3444 us   min: 1172 us    max: 14467 us
10: average: 3363 us   min: 1193 us    max: 15844 us
11: average: 4407 us   min: 1177 us    max: 16361 us
12: average: 4865 us   min: 1155 us    max: 19826 us
13: average: 3928 us   min: 1156 us    max: 12244 us
14: average: 3911 us   min: 1149 us    max: 19119 us
15: average: 4213 us   min: 1174 us    max: 15549 us
16: average: 3946 us   min: 1189 us    max: 14073 us


Another observations,

The time for gst_buffer_map() depends on the resolution of the video and it depends also on the framerate

The values above are measured with sync=FALSE
With sync=TRUE, the values are a bit lower, but IMHO still too high
Play with the sync parameter, FALSE/TRUE in line 245 in appsink-src.c


Maybe we are completely wrong….
Can anybody explain why gst_buffer_map() needs so much time

Thanks.

Comment 4 frank.huber 2017-09-06 12:49:56 UTC

any news?

Comment 5 frank.huber 2017-09-19 11:49:55 UTC

In the meantime we tested this issue with a 4k video, 3840x2178@60Hz 

The measured time for gst_buffer_map() is between

min: 4362 us   and   max: 27338 us

Comment 6 Víctor Manuel Jáquez Leal 2017-09-21 11:37:33 UTC

Created attachment 360184 [details]
simplified test file

I had simplified a bit the test app and used callgrind to measure the CPU consumption and the call graph.

The CPU consumption bottle neck are in memcopy and ffi calls. But in buffer map, which means the va image load, what I see is a lot mutex locking in the Intel driver.

I don't know the internal of the driver, but it might have a single list of buffers, and when there is concurrent pipelines decoding, it would take increasing time to lock, find a process, the surface to dump as image.

This would mean that the driver should be improved for this use-case. So, what I would recommend is to fill an issue in intel-vaapi-driver in github:

https://github.com/01org/intel-vaapi-driver/issues

Comment 7 frank.huber 2017-10-04 14:09:09 UTC

Victor,

thanks for your answer and your help.

In the meantime I opened a intel-vaapi-driver ticket:
https://github.com/01org/intel-vaapi-driver/issues/274

For me the test setup from intel looks total different...so maybe they can't see our issue.

Maybe you can give them a hint.

Thanks.

Comment 8 Thomas Scheuermann 2017-10-17 09:23:10 UTC

Hi,

in the vaapi driver ticket https://github.com/01org/intel-vaapi-driver/issues/274 is mentioned that vaGetImage is used in this case which seems to be slow.

How can here vaDeriveImage be used which should be faster?

Regards,
Thomas

Comment 9 Víctor Manuel Jáquez Leal 2017-10-17 09:44:18 UTC

(In reply to Thomas Scheuermann from comment #8)
> Hi,
> 
> in the vaapi driver ticket
> https://github.com/01org/intel-vaapi-driver/issues/274 is mentioned that
> vaGetImage is used in this case which seems to be slow.
> 
> How can here vaDeriveImage be used which should be faster?
> 
> Regards,
> Thomas

exporting the environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING=1

Comment 10 Víctor Manuel Jáquez Leal 2017-10-17 09:45:19 UTC

it is disabled by default because it is very unstable with mesa driver

Comment 11 sreerenj 2017-10-22 10:07:09 UTC

Created attachment 362041 [details] [review]
hack for testing

env variable not seems to be enabling deriveimage.
Can you try this quick hack?

Comment 12 Víctor Manuel Jáquez Leal 2017-10-22 14:04:16 UTC

Comment on attachment 362041 [details] [review]
hack for testing

I see!! when memory:VASurface is negotiated, the direct render is not set when it should. I'll have a proper patch soon.

Comment 13 Thorsten Quadt 2017-10-29 14:12:46 UTC

Hi sreerenj,

I'm back from prague and I was able to make a quick test that
included your patch. It looks like the hack is working. The
duration of the call has decreased by nearly 50 percent.

There are some public holidays here in Germany in the next
days. But I will retest this on our orignal setup on Thursday or
Friday.

Thank you again for your help.

Comment 14 Víctor Manuel Jáquez Leal 2017-10-31 12:52:36 UTC

Created attachment 362624 [details] [review]
plugins: resurrect direct rendering

Because of the changes in dmabuf allocator as default allocator for
raw video caps, the direct rendering feature got lost.

This patch brings back its configuration if the environment variable
GST_VAAPI_ENABLE_DIRECT_RENDERING is defined.

Comment 15 Víctor Manuel Jáquez Leal 2017-10-31 13:12:21 UTC

Created attachment 362625 [details] [review]
plugins: resurrect direct rendering

Because of the changes in dmabuf allocator as default allocator for
raw video caps, the direct rendering feature got lost.

This patch brings back its configuration if the environment variable
GST_VAAPI_ENABLE_DIRECT_RENDERING is defined.

Comment 16 Víctor Manuel Jáquez Leal 2017-10-31 17:26:48 UTC

Created attachment 362662 [details] [review]
plugins: direct rendering on memory:VASurface

As buffers negotiated with memory:VASurface caps feature can also be
mapped, they can also be configured to use VA derived images, in other
words "direct rendering".

Also, because of the changes in dmabuf allocator as default allocator,
the code for configuring the direct rendering was not clear.

This patch cleans up the code and enables direct rendering when the
environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING is defined,
even then the memory:VASurface cap feature is negotiated.

Comment 17 sreerenj 2017-10-31 22:08:20 UTC

(In reply to Víctor Manuel Jáquez Leal from comment #16)
> Created attachment 362662 [details] [review] [review]
> plugins: direct rendering on memory:VASurface
> 
> As buffers negotiated with memory:VASurface caps feature can also be
> mapped, they can also be configured to use VA derived images, in other
> words "direct rendering".
> 
> Also, because of the changes in dmabuf allocator as default allocator,
> the code for configuring the direct rendering was not clear.
> 
> This patch cleans up the code and enables direct rendering when the
> environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING is defined,
> even then the memory:VASurface cap feature is negotiated.

Push it :)

Comment 18 Víctor Manuel Jáquez Leal 2017-11-01 10:19:25 UTC

Attachment 362662 [details] pushed as 72362e1 - plugins: direct rendering on memory:VASurface