GNOME Bugzilla – Bug 786054
mapping vaapi buffer slow
Last modified: 2017-11-01 10:24:16 UTC
If the number of decoded videos increase, the time-span of gst_buffer_map() increases dramatically. With 16 h264 full HD videos each mapping needs more than 16 ms. We map the buffer to copy the video data. The copy itself needs only 1 ms. We assume the increase of the mapping time is a bug, maybe a concurrency issue. Maybe inside intel vaapi driver, not the gstreamer code. setup intel core i7 GStreamer 1.10.4 libva 1.8.2 libva intel driver 1.8.2
How's your pipeline? in order to reproduce the issue. Or, better yet, do you have a test app?
I wonder if this is related with bug 785092
Created attachment 358757 [details] reproducer with time measurement for gst_buffer_map() Attached is a small reproducer with time measurement, appsink-src.c Tested with ./repro Highway_1080p60_8M.mp4 Highway_1080p60_8M.mp4 is a HD video with 60 Hz Printouts are: ./repro Highway_1080p60_8M.mp4 libva info: VA-API version 0.40.0 libva info: va_getDriverName() returns 0 libva info: Trying to open /usr/lib/va/i965_drv_video.so libva info: Found init function __vaDriverInit_0_40 libva info: va_openDriver() returns 0 Let's run! libva info: VA-API version 0.40.0 libva info: va_getDriverName() returns 0 libva info: Trying to open /usr/lib/va/i965_drv_video.so libva info: Found init function __vaDriverInit_0_40 libva info: va_openDriver() returns 0 libva info: VA-API version 0.40.0 libva info: va_getDriverName() returns 0 libva info: Trying to open /usr/lib/va/i965_drv_video.so libva info: Found init function __vaDriverInit_0_40 libva info: va_openDriver() returns 0 ==> map : average: 2092 us min: 1236 us max: 5076 us <== ==> copy: average: 480 us min: 354 us max: 1747 us <== ==> map : average: 3573 us min: 1236 us max: 5228 us <== ==> copy: average: 504 us min: 348 us max: 1747 us <== ==> map : average: 3843 us min: 1236 us max: 5412 us <== ==> copy: average: 468 us min: 335 us max: 1747 us <== ==> map : average: 3710 us min: 1236 us max: 6291 us <== ==> copy: average: 471 us min: 335 us max: 1747 us <==^C Why the time need for gst_buffer_map () is so high? If we start this example multiple times the time for gst_buffer_map() increases dramatically The time for memcpy is always lower and does not increase if we start the example multiple times In our product we run 16 HD videos with 30 Hz and measure around 16 ms for gst_buffer_map() This example started 16 times shows the values below: only the time for gst_buffer_map (), the time for memcpy is always lower. Only the last line of all 16 jobs are listed 16 times 1920x1080@30Hz 1: average: 3694 us min: 1184 us max: 18074 us 2: average: 2426 us min: 1150 us max: 18953 us 3: average: 2352 us min: 1173 us max: 19834 us 4: average: 3931 us min: 1164 us max: 14376 us 5: average: 3082 us min: 1196 us max: 15445 us 6: average: 2982 us min: 1186 us max: 16830 us 7: average: 2499 us min: 1159 us max: 8703 us 8: average: 3804 us min: 1208 us max: 18760 us 9: average: 3444 us min: 1172 us max: 14467 us 10: average: 3363 us min: 1193 us max: 15844 us 11: average: 4407 us min: 1177 us max: 16361 us 12: average: 4865 us min: 1155 us max: 19826 us 13: average: 3928 us min: 1156 us max: 12244 us 14: average: 3911 us min: 1149 us max: 19119 us 15: average: 4213 us min: 1174 us max: 15549 us 16: average: 3946 us min: 1189 us max: 14073 us Another observations, The time for gst_buffer_map() depends on the resolution of the video and it depends also on the framerate The values above are measured with sync=FALSE With sync=TRUE, the values are a bit lower, but IMHO still too high Play with the sync parameter, FALSE/TRUE in line 245 in appsink-src.c Maybe we are completely wrong…. Can anybody explain why gst_buffer_map() needs so much time Thanks.
any news?
In the meantime we tested this issue with a 4k video, 3840x2178@60Hz The measured time for gst_buffer_map() is between min: 4362 us and max: 27338 us
Created attachment 360184 [details] simplified test file I had simplified a bit the test app and used callgrind to measure the CPU consumption and the call graph. The CPU consumption bottle neck are in memcopy and ffi calls. But in buffer map, which means the va image load, what I see is a lot mutex locking in the Intel driver. I don't know the internal of the driver, but it might have a single list of buffers, and when there is concurrent pipelines decoding, it would take increasing time to lock, find a process, the surface to dump as image. This would mean that the driver should be improved for this use-case. So, what I would recommend is to fill an issue in intel-vaapi-driver in github: https://github.com/01org/intel-vaapi-driver/issues
Victor, thanks for your answer and your help. In the meantime I opened a intel-vaapi-driver ticket: https://github.com/01org/intel-vaapi-driver/issues/274 For me the test setup from intel looks total different...so maybe they can't see our issue. Maybe you can give them a hint. Thanks.
Hi, in the vaapi driver ticket https://github.com/01org/intel-vaapi-driver/issues/274 is mentioned that vaGetImage is used in this case which seems to be slow. How can here vaDeriveImage be used which should be faster? Regards, Thomas
(In reply to Thomas Scheuermann from comment #8) > Hi, > > in the vaapi driver ticket > https://github.com/01org/intel-vaapi-driver/issues/274 is mentioned that > vaGetImage is used in this case which seems to be slow. > > How can here vaDeriveImage be used which should be faster? > > Regards, > Thomas exporting the environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING=1
it is disabled by default because it is very unstable with mesa driver
Created attachment 362041 [details] [review] hack for testing env variable not seems to be enabling deriveimage. Can you try this quick hack?
Comment on attachment 362041 [details] [review] hack for testing I see!! when memory:VASurface is negotiated, the direct render is not set when it should. I'll have a proper patch soon.
Hi sreerenj, I'm back from prague and I was able to make a quick test that included your patch. It looks like the hack is working. The duration of the call has decreased by nearly 50 percent. There are some public holidays here in Germany in the next days. But I will retest this on our orignal setup on Thursday or Friday. Thank you again for your help.
Created attachment 362624 [details] [review] plugins: resurrect direct rendering Because of the changes in dmabuf allocator as default allocator for raw video caps, the direct rendering feature got lost. This patch brings back its configuration if the environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING is defined.
Created attachment 362625 [details] [review] plugins: resurrect direct rendering Because of the changes in dmabuf allocator as default allocator for raw video caps, the direct rendering feature got lost. This patch brings back its configuration if the environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING is defined.
Created attachment 362662 [details] [review] plugins: direct rendering on memory:VASurface As buffers negotiated with memory:VASurface caps feature can also be mapped, they can also be configured to use VA derived images, in other words "direct rendering". Also, because of the changes in dmabuf allocator as default allocator, the code for configuring the direct rendering was not clear. This patch cleans up the code and enables direct rendering when the environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING is defined, even then the memory:VASurface cap feature is negotiated.
(In reply to Víctor Manuel Jáquez Leal from comment #16) > Created attachment 362662 [details] [review] [review] > plugins: direct rendering on memory:VASurface > > As buffers negotiated with memory:VASurface caps feature can also be > mapped, they can also be configured to use VA derived images, in other > words "direct rendering". > > Also, because of the changes in dmabuf allocator as default allocator, > the code for configuring the direct rendering was not clear. > > This patch cleans up the code and enables direct rendering when the > environment variable GST_VAAPI_ENABLE_DIRECT_RENDERING is defined, > even then the memory:VASurface cap feature is negotiated. Push it :)
Attachment 362662 [details] pushed as 72362e1 - plugins: direct rendering on memory:VASurface