Bug 709018 – Dynamic allocate GstVaapiSurface to reduce video memory consumption

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 709018 - Dynamic allocate GstVaapiSurface to reduce video memory consumption


Summary:	Dynamic allocate GstVaapiSurface to reduce video memory consumption


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gstreamer-vaapi
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-09-29 08:44 UTC by Zhao, Halley
Modified:	2018-11-03 15:43 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
0001-GstVaapiSurfacePool-dynamic-alloc-Surface (1.96 KB, patch) 2013-09-29 08:46 UTC, Zhao, Halley	none	Details \| Review
0002-connect-GstVaapiContext-and-GstVaapiSurface-which-is (4.06 KB, patch) 2013-09-29 08:46 UTC, Zhao, Halley	none	Details \| Review
0003-GstVaapiSurface-Misc-fix-for-GstVaapiContext (1.41 KB, patch) 2013-09-29 08:47 UTC, Zhao, Halley	none	Details \| Review

Description Zhao, Halley 2013-09-29 08:44:22 UTC

we calculate out a max number ('capacity' of surface pool) for reference frame count . however, many stream don't use so many ref frames in reality.
it means, if we allocate surface with capacity number, it reserve more video memory than the real demand.

if we initiate surface pool with a smaller number, and allocate new surface when it is really required. it reduces video memory consumption.

for example:
the 1080P kauai stream, 14 surface is used in reality though the capacity number is 17. 
gst-launch-1.0 filesrc location=/home/halley/media/video/kauai_1080p_H.264_AAC_9478Kbps_30FPS_5m30s.mp4 ! qtdemux ! vaapidecode ! vaapisink

however, playbin doesn't benefit, because queue element always try to request more surface. 
I don't have good idea on it; or said, I don't find a good way to set the length of queue inside playbin.

Comment 1 Zhao, Halley 2013-09-29 08:46:41 UTC

Created attachment 256001 [details] [review]
0001-GstVaapiSurfacePool-dynamic-alloc-Surface

Comment 2 Zhao, Halley 2013-09-29 08:46:59 UTC

Created attachment 256002 [details] [review]
0002-connect-GstVaapiContext-and-GstVaapiSurface-which-is

Comment 3 Zhao, Halley 2013-09-29 08:47:16 UTC

Created attachment 256003 [details] [review]
0003-GstVaapiSurface-Misc-fix-for-GstVaapiContext

Comment 4 Gwenole Beauchesne 2013-09-30 10:03:37 UTC

Hi,

(In reply to comment #0)
> for example:
> the 1080P kauai stream, 14 surface is used in reality though the capacity
> number is 17. 

When you say 14 surfaces are used in reality, is this because max_num_ref_frames = 14 or what? Is the stream available on tinderbox or beee?

For H.264, we usually allocate the requested number of frames from the bitstream + a certain amount of "scratch" surfaces (4) to account for render delays. We probably could reduce the latter to two, but more likely 3.

Thanks.

Comment 5 Zhao, Halley 2013-10-10 01:24:35 UTC

here is the stream: http://halley-sandy.sh.intel.com/share/media/
max_num_ref_frames should be 13.

the root reason is: max_num_ref_frames is the possible-max, the real-max may be less than that. 
so if we allocate vasurface pre request, some video memory could be saved.

Comment 6 Gwenole Beauchesne 2013-10-10 08:31:50 UTC

Thanks for the stream.

Actually, this is what we do for XBMC, and we used to do for gst-vaapi. However, the main issues are (i) you have to re-create the VA decode context each time you increase the number of VA surfaces, (ii) for some drivers [typically PVR] this causes the underlying context to be reset as well with various other states (colocated buffers et al.).

So, this is not a general purpose solution. The heuristics need to be improved, while still keeping in mind conformance. More specifically, the DPB length is constrained by max_num_ref_frames, among other variables.

BTW, on most drivers, the underlying backing store is lazily allocated. So, the gain of doing that at the upper layer is thin. Do you have any actual figures? Thanks.

Comment 7 Zhao, Halley 2013-10-10 10:04:18 UTC

(In reply to comment #6)
thanks, it is more complicate than I expected.

> (i) you have to re-create the VA decode context each time you increase the number of VA surfaces, 
in fact, (for decoder) vaContext is re-create only when codec changes; it isn't re-created when a new surface created.
(it is bad to make vaSurface prepared before creating vaContext: encoder can't follow it, vpp also uses surface from decoder. pvr driver also changed to update context with new surfaces)

> (ii) for some drivers [typically PVR] this causes the underlying context to be reset as well with various other states (colocated buffers et al.).
I talked with driver developer, it is not fact, at least for now.

> BTW, on most drivers, the underlying backing store is lazily allocated. So, the
> gain of doing that at the upper layer is thin. 
we shouldn't assume the behavior of driver. pvr driver developer told me it is not lazy mode there.
even if it is lazy mode, we should change the recycle policy in video pool, we should not push a recycled surface to the tail of free_objects.

> Do you have any actual figures?
it used to be one bug fix on PR3 phone, I ported it to upstream gst-vaapi.
it do benefit there.

Comment 8 Gwenole Beauchesne 2013-10-10 13:21:01 UTC

(In reply to comment #7)
> (In reply to comment #6)
> thanks, it is more complicate than I expected.
> 
> > (i) you have to re-create the VA decode context each time you increase the number of VA surfaces, 
> in fact, (for decoder) vaContext is re-create only when codec changes; it isn't
> re-created when a new surface created.
> (it is bad to make vaSurface prepared before creating vaContext: encoder can't
> follow it, vpp also uses surface from decoder. pvr driver also changed to
> update context with new surfaces)

The existing API is designed this way: for decode sessions (VA context), you have to provide the full set of VA surfaces. Assuming that you could provide VA surfaces dynamically without changing the VA context is wrong.

So, situation (i) needs to be maintained, i.e. re-create VA context when the number of VA surfaces to be used for decoding grows.

> > (ii) for some drivers [typically PVR] this causes the underlying context to be reset as well with various other states (colocated buffers et al.).
> I talked with driver developer, it is not fact, at least for now.

The PVR driver has per-codec dependent contexts AFAICS. That context is lost when you re-create the parent VA context, and hereby any data it contained, e.g. colocated buffers.

Besides, re-creating the VA context for each newly added VA surface would increase initial latency. An intermediate mechanism could be to increase by steps of 2 surfaces. Though, I still not believe in that scenario.

> > BTW, on most drivers, the underlying backing store is lazily allocated. So, the
> > gain of doing that at the upper layer is thin. 
> we shouldn't assume the behavior of driver. pvr driver developer told me it is
> not lazy mode there.

Right.

> even if it is lazy mode, we should change the recycle policy in video pool, we
> should not push a recycled surface to the tail of free_objects.

Why?

> > Do you have any actual figures?
> it used to be one bug fix on PR3 phone, I ported it to upstream gst-vaapi.
> it do benefit there.

Now with the fix for re-creating the VA context with the new set of VA surfaces?

Comment 9 Zhao, Halley 2013-10-11 02:19:38 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> 
> The existing API is designed this way: for decode sessions (VA context), you
> have to provide the full set of VA surfaces. Assuming that you could provide VA
> surfaces dynamically without changing the VA context is wrong.
there is no explicitly wording for it in va.h.
GEN driver doesn't care the relation between context and surface.
PVR driver appends dynamic created surface to decoding context by an internal update.
neither requires to re-create the context.

> The PVR driver has per-codec dependent contexts AFAICS. That context is lost
> when you re-create the parent VA context, and hereby any data it contained,
> e.g. colocated buffers.
> 
the parent context isn't re-created, newly created surface is added to the surface chain.
> Besides, re-creating the VA context for each newly added VA surface would
> increase initial latency. An intermediate mechanism could be to increase by
> steps of 2 surfaces. Though, I still not believe in that scenario.
> 
agree. steps of 2 surface will be better.

> > even if it is lazy mode, we should change the recycle policy in video pool, we
> > should not push a recycled surface to the tail of free_objects.
> 
> Why?
the free_objects is initialized with 'capacity' vaSurface, if we always get from head and push to tail, the 'capacity' number of vaSurface will always be used one by one. (even if a smaller number is still fine).
I guess, basing on lazy mode assumption:
1. there will be 3 queue to recycle the surface
  a) all_objects: with size of 'capacity'
  b) used_objects: the object in use
  c) active_unused_objects: the object is not in use now, but will be use soon. 
2. the policy to manage the queue-s
  a) b+c is the recycle pool, which has a size smaller than 'capacity'.
  b) if length of c is less than 2 (or 1), it will request new object from a (which hasn't been used yet)
  c) c will dynamically increase up to b+c==a.
considering the above assumption and policy, I prefer the original patch.

> Now with the fix for re-creating the VA context with the new set of VA
> surfaces?
update VA context with new set of VA surfaces, not re-create.

Comment 10 Gwenole Beauchesne 2013-10-11 04:40:23 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > (In reply to comment #6)
> > 
> > The existing API is designed this way: for decode sessions (VA context), you
> > have to provide the full set of VA surfaces. Assuming that you could provide VA
> > surfaces dynamically without changing the VA context is wrong.
> there is no explicitly wording for it in va.h.
> GEN driver doesn't care the relation between context and surface.
> PVR driver appends dynamic created surface to decoding context by an internal
> update.
> neither requires to re-create the context.

So, on one side, you are saying that we should not assume about how the VA driver lazily allocates VA surfaces backing store or not, and here you are saying that it is OK to assume that it is not necessary to re-create VA context?

That cannot hold. The API mandates that you create the VA context for decoding with the VA surfaces that are intended to be used.

In the lazy VA surfaces allocation case in VA driver side, this is fully safe and API compliant, whereas in your assumption that VA driver updates VA context when it sees a new VA surface, this is a bug that is introduced as a mis-use of the API.

Besides, making the VA surface behave with dynamic allocation of the underlying memory is a superior approach because not only is it compliant with VA-API, but it also benefits to all VA decoders at once. This is what needs to be done.

The approach to use fewer VA surfaces than requested by the bitstream in max_num_ref_frames is also a bug per the H.264 specs. max_num_ref_frames determines the size of the sliding window for the decoding process. Since the decoded surface is not output before the DPB is full, except with special machinery defined in the standard, you have to have that number of VA surfaces anyway. Doing something else is not H.264 conformant either.

As I mentioned before, the only cases where you can reduce things is the number of spare (scratch) VA surfaces. I believe 3 is possible (instead of 4).

> > > even if it is lazy mode, we should change the recycle policy in video pool, we
> > > should not push a recycled surface to the tail of free_objects.
> > 
> > Why?
> the free_objects is initialized with 'capacity' vaSurface, if we always get
> from head and push to tail, the 'capacity' number of vaSurface will always be
> used one by one. (even if a smaller number is still fine).

Yes.

> > Now with the fix for re-creating the VA context with the new set of VA
> > surfaces?
> update VA context with new set of VA surfaces, not re-create.

This is not how you are supposed to use the API.

Comment 11 Gwenole Beauchesne 2013-10-11 04:45:15 UTC

In summary, the proposed patches:
- Introduce an error whereby you have to re-create the VA decode context as new VA surfaces are added. Otherwise, you are not VA-API compliant ;
- While doing so, you may hit VA driver bugs ;
- You are not H.264 compliant either if you further altered the DPB and output process to not use the full set of max_num_ref_frames length for the sliding window.

This means that, you cannot reduce the number of VA surfaces to be used to a number below max_num_ref_frames, so they will all be allocated anyway.

Comment 12 Zhao, Halley 2013-10-11 09:47:28 UTC

agree.

max_dec_frame_buffering is not set in the stream is root reason, the inferred one is too big (13) though num_ref_frames is small (2).

Comment 13 Víctor Manuel Jáquez Leal 2015-01-20 11:50:39 UTC

Let me chime in.

Regarding the first two patches,  given the thread, it looks to me that they are rejected, but the third one looks  still valid. So I propose to close this bug as obsolete and create a new bug with the last patch rebased. What do you think?

Comment 14 Víctor Manuel Jáquez Leal 2015-01-20 11:51:09 UTC

Sorry, setting back to unconfirmed.

Comment 15 Gwenole Beauchesne 2015-01-26 17:37:49 UTC

Review of attachment 256003 [details] [review]:

Probably comment on that this might be a performance regression since 18031dc. Though, this needs to be further tested with actual subpictures.

Comment 16 GStreamer system administrator 2018-11-03 15:43:50 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gstreamer-vaapi/issues/2.