GNOME Bugzilla – Bug 785092
20xH264 Video Decode Render-less CPU Usage spike up to ~70% with flag sync=false
Last modified: 2018-11-03 15:50:26 UTC
1. Create a simple bash script with the 20x gstreamer execution command ( gst-launch-1.0 -v filesrc location=/videos/1920x1080_10mbps_30fps.mp4 ! qtdemux ! vaapidecode ! fpsdisplaysink video-sink=fakesink text-overlay=false sync=false) 2. Observed the CPU usage spike up to 70% 3, Customer is expecting the CPU utilization result should not be so high if sync=false.Need to further to debug to find out the reason and root cause
Setting sync=false on sink elements disables any synchronization that would rate limit the processing. Therefore that command is decoding frames as fast as is possible on the hardware and is the cause of the higher CPU usage. As a result, this is all behaving as expected.
yes, you are right. And I have a patch to enhance this multiple channel encoding usage and fix this issue. I am appreciate that you could help me to review it. In my test, CPU usage could drop to 10% with nearly the same FPS for each channel. But for 1 channel, fps drop from 900+ to 500+, and CPU usage drop from 14% to 7% on my platform
Created attachment 355910 [details] [review] a patch to fix this issue
this is a performance tuning for a very specific use case (20x decoding pipelines) meanwhile, if I understand correctly, downgrades the "normal" use case (single pipeline decoding) It would be great to look another approach without the penalty for the most common use case.
If I understand well, this patch makes the offline processing significantly slower. Why don't you configure your process with lower priority instead?
Also note that in theory, fakesink will result into a frame copy (when this get fixed), so this will be a bad perf test.
This bug has been reported internally and we did some investigations too. Without Peng's patch, the kernel does implicit syncing. Peng's patch is adding explicit syncing. Ideally, both should behave similarly but there seem to be some differences in Kernel: Let me copy & paste Peng's comment on this: "the root cause of this issue is that Linux kernel i95 driver uses spin not sleep to implement the wait for some operating GEM Buffer Objects to be un busy. If the explicit sync/wait/map() functions are added during the decoding, it replaces the spin with sleep and saves the CPU usages"
Thanks for all your comments. 20x decoding is mainly used for transcoding or video wall application. I think this patch will help a lot for those use cases. And for 1 channel, the most common use case should be a video player. 60~ fps should be enough for most player. this patch doesn't have much effects on the player even if it drops decoding FPS.
Still, from your report "But for 1 channel, fps drop from 900+ to 500+", that means that if you are transcoding 1 stream in your use case, your application will be 44% slower. That's a massive cut. If this is a kernel bug, why are you looking for a solution in GStreamer, you should find a solution on the driver side ?
for transcoding 1 channel, the bottleneck should be encoding, it can't achieve such FPS like decoding. I am assuming that it is a mutex lock strategy in kernel, mutex_spin_on_owner() occupies a lot of CPU time in this use case. This mutex_spin_on_owner() just means that another CPU on the system is using the lock so it decided to spin instead of sleep. Till now, the only solution we known is add sync or wait in driver or middleware. We need to make a decision where to put this sync. Definitely, it can be added in driver, but it will always have this sync for decoding, middleware can't choose disabling this sync for some special use case. So do you have some better idea?
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gstreamer-vaapi/issues/59.