GNOME Bugzilla – Bug 796521
msdk: Playback not smooth by using ximagesink.
Last modified: 2018-11-03 14:25:46 UTC
Reproduce Steps: ============================================ $gst-launch-1.0 filesrc location=/media/ts/Sally.ts '!' tsdemux '!' mpegvideoparse '!' msdkmpeg2dec '!' msdkvpp '!' video/x-raw,format=NV12 '!' videoconvert '!' ximagesink Video rendered not smooth. And if remove video/x-raw,format=NV12: $ gst-launch-1.0 filesrc location=/media/ts/Sally.ts '!' tsdemux '!' mpegvideoparse '!' msdkmpeg2dec '!' msdkvpp '!' videoconvert '!' ximagesink Video rendered looks good.
A part of this issue is covered in https://bugzilla.gnome.org/show_bug.cgi?id=796699 . A vpp fix might also needed to close this bug.
I think it is a performance issue with driver or mediasdk. Somehow, dmabuf backed NV12 to regular NV12 VA surface (not dmabuf) operation is taking more time. Extremely slow and fps dropping more than 50%. Use BGRx as output, it will work without any issue.
Do you still reproduce this in master ? We have introduce processing-deadline now to address these issues.
(In reply to sreerenj from comment #2) > Somehow, dmabuf backed NV12 to regular NV12 VA surface (not dmabuf) > operation is taking more time. Extremely slow and fps dropping more than 50%. Did you check the backing memory type ? Sounds like the NV12 data is non-cpu-cachable and BGRx data is cpu-cachable. The performance impact on CPU processing will be huge if that is the case.
(In reply to Nicolas Dufresne (ndufresne) from comment #3) > Do you still reproduce this in master ? Yes
(In reply to Nicolas Dufresne (ndufresne) from comment #4) > (In reply to sreerenj from comment #2) > > Somehow, dmabuf backed NV12 to regular NV12 VA surface (not dmabuf) > > operation is taking more time. Extremely slow and fps dropping more than 50%. > > Did you check the backing memory type ? Sounds like the NV12 data is > non-cpu-cachable and BGRx data is cpu-cachable. The performance impact on > CPU processing will be huge if that is the case. IIUC the video memory is USWC (uncacheable, speculative write-combining) and regular memcpy() from uswc to system memory is always slow. But I don't know why BGRx giving better performance. Might be related to the size and alignment of BGRx data?.. This issue is reproducible with both gstreamer-vaapi and gst-msdk with any 720p videos. gstreamer-vaapi: GST_VAAPI_ENABLE_DIRECT_RENDERING=1 gst-launch-1.0 -v filesrc location= Sally.ts ! tsdemux ! mpegvideoparse ! vaapimpeg2dec ! vaapipostproc ! video/x-raw, format=NV12 ! videoconvert ! xvimagesink Here we Explicitly enable direct_rendering (internally use mapping of vasurface with vaDeriveImage). In normal use cases, GStreamer-vaapi won't use direct rendering. gst-msdk: gst-msdk always use direct mapping. So the issue is easily reproducible with any pipeline, for eg: gst-launch-1.0 -v filesrc location= Sally.ts ! tsdemux ! mpegvideoparse ! msdkmpeg2dec ! msdkvpp ! video/x-raw, format=NV12 ! videoconvert ! xvimagesink
Note that this issue is not new, it's been around for years already.
The issue might come from the way read the NV12 memory in videoconvert. When you output BGRx, the code that reads the memory is simply do memcpy(), which in the converter, we will do random pixel access, and I'm thinking this could be extremely slow without cache. What's the reason for not producing cacheable buffers on memory coherant system like Intel platform ?
It is possible to get cached and un-cached memory in the output. Libva allows two methods to map video memory, Method_1: directly map with vaDervieImage (it is supported in iHD driver and vaapi-intel-driver). In usual use cases, the VASurface wrapping the tiled (Y-tiled) video memory and the memory mapped using vaDeriveImage() API is tiled too. In this case, driver uses the drm_intel_gem_bo_map_gtt() to map a tiled memory which generates USWC memory. Method_2: Create a VAImage structure using vaCreateImae api and user-specific video format. Then we download the vaSurface (which has decoded content) using shaders or vpp by invoking vaGetImage() api. This will generate "Linear" memory and this is Cacheable by CPU. We chose the method 2 as default in GStreaemr-vaapi if there is a requirement to map the video memory for software operations. But gst-msdk only has support for method_1 and we stuck with USWC memory. The iHD driver wasn't properly supporting the vaGetImage(), not sure about the current status though.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/issues/729.