Bug 768079 – waylandsink: add support wayland presentation time interface

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 768079 - waylandsink: add support wayland presentation time interface


Summary:	waylandsink: add support wayland presentation time interface


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-bad
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Duplicates:	768080 (view as bug list)
Depends on:
Blocks:

Reported:	2016-06-27 03:18 UTC by Wonchul Lee
Modified:	2018-11-03 13:52 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Wonchul Lee 2016-06-27 03:18:40 UTC

I bring comments about this tasks and wrap writer's name with angle brackets, sorry for the poor readability.

Waylandsink was handled by George Kiagiadakis and he had written presentation time interface codes for the demo, but the interface has been changed and settled down as a stable protocol.

I was starting it based on George's work (http://cgit.collabora.com/git/user/gkiagia/gst-plugins-bad.git/log/?h=demo), removing presentation queue and considering display stack delay.
It was predicting latency at display stack from wl_commit/damage/attach to frame presence and Pekka Paalanen(pq) advised that would not estimate the delay from wl_surface_commit() to display.

(it's part of comments)
<pq> wonchul, if you are trying to estimate the delay from wl_surface_commit() to display, and you don't sync the time you call commit() to the incoming events, that's going to be a lot less accurate.
<pq> 11:11:07> no, I literally meant replacing the queueing protocol calls with a queue implementation in the sink, so you don't use the queueing protocol anymore, but rely only on the feedback protocol to trigger attach+commits from the queue.
<pq> 11:12:27> the queue being a timestamp-ordered list of frame, just like in the weston implementation.

So, the way estimating the delay from wayland is not much accurate.
I turned to add a queue holding buffers before doing render() in the waylandsink

<Olivier Crête>
I'm a bit concerned about adding a queue in the sink that would increase the latency unnecessarily. I wonder if this could be done while queueing around 1 buffer there in normal streaming. Are we talking about queuing the actual frames or just information about the frames?

<Wonchul Lee>
I've queued reference of frames and tried to render based on the wayland presentation clock.
It could bring some delay depending on specific contents by adding a queue in the sink, It's not clear to me what specific factor cause delay yet, but yes, it would increase the latency at the moment.

The idea was disabling clock synchronization in gstbasesink and rendering(wayland commit/damage/attach) frames based on the calibrated wayland clock. I pushed the reference of gstbuffer to the queue and set the async clock callback to request render at a right time, and then rendered or dropped it depending on the adjusted timestamp.
This changes have issues that adjusted timestamp what requested to render is getting late than expected and it could cause dropping most of the frames at some cases since the adjusted timestamp was always late.
So I'm referring audiobasesink to adjust clock synchronization for the frames with wayland clock.

<Olivier Crête>
This work has two separate goals:

When the video has a different framerate than the display framerate, it should drops frames more or less evenly, so if you need to display 4 out of 5 frames, it should be something like 1,2,3,4,6,7,8,9,11,... Or if you need to display 30/60 frames it should display 1,3,5,7,9, etc .. Currently, GstBaseSink is not very clever about that.
And we have to be careful as this can be also caused by the compositor not being able to keep up. It's not because the display can do 60fps that the compositor is actually able to produce 60 new frames, it could be limited to a lower number, so we'll also have to make sure we're protected against that.
We want to guess the latency added by the display stack. The current GStreamer video sinks more or less assume that a buffer is rendered immediately when the render() vmethod returns, but this is not really how current display hardware work. Especially when you have double or triple buffering. So we want to know how much in advance to submit the buffer, but not too early to not display it one interval too early.
I just asked @nicolas a quick question about how he though we should do this, then we spent two hours whiteboarding ideas about this and we've barely been able to define the problem.

Here are some ideas we bounced around:

After submitting one frame (the first frame? the preroll frame?), we can have an idea of the upper bound of the latency for the live pipeline case. It should be the time between the moment a frame was submitted and when it was finally rendered + the "refresh". We can probably delay sending the async-done until the presented event of the first frame has arrived.
For the non-live case, we can probably find a way to submit the frame as early as possible before the next. Finding that time is the tricky part I think
@wonchul: could you summarize the different things your tried, what were the hypothesis and what were the results? It's important to keep these kinds of records for the Tax R&D filings (and so we can keep up with your work).

@pq or @daniels:

what is the logic behind the seq field, how do you expect it can be used? Do you know any example where it is used?
I'm also not sure how we can detect the case where the compositor cannot keep up? Or is the compositor is gnome-shell and has a gc that makes it miss a couple frames for no good reason?
From the info is the presented event (or any other way), is there a way we can evaluate when is the latest we can submit a buffer to have it arrive in time for a specific refresh? Or do we have to try and then do some kind of search to find what those deadlines are in practice?

<Pekka Paalanen>
seq field of wp_presentation_feedback.presented event:

No examples of use, I don't think. I didn't originally considerer it as needed, but it was added to allow implementing GLX_OML_sync_control on top of it. I do not think we should generally depend on seq unless you specifically care about the refresh count instead of timings. My intention with the design was that new code can work better with timestamps, while old code you don't want to port to timestamps could use seq as it has always done. Timestamps are "accurate", while seq may have been estimated from a clock in the kernel and may change its rate or may not have a constant rate at all.

seq comes from a time, when display refresh was a known guaranteed constant frequency, and you could use it as a clock by simply counting cycles. I believe all timing-sensitive X11 apps have been written with this assumption. But it is no longer exactly true, it has caveats (hard to maintain across video mode switches or display suspends, lacking hardware support, etc.), and with new display tech it will become even less true (variable refresh rate, self-refresh panels, ...).

seq is not guaranteed to be provided, it may be zero depending on the graphics stack used by the compositor. I'm also not sure what it means if you don't have both VSYNC and HW_COMPLETION in flags

The timestamp OTOH is always provided, but it may have some caveats which should be indicated by unset bits in flags.

Compositor not keeping up:

Maybe you could use the tv + refresh from presented event to guess when the compositor should be presenting your frame, and compare afterwards with what actually happened?

I can't really think of a good way to know if the compositor cannot keep up or why it cannot keep up. Hickups can happen and the compositor probably won't know why either. All I can say is collect statistics and analyze then over time. This might be a topic for further investigations, but to get more information about which steps take too much time we need some kernel support (explicit fencing) that is being developed, and make the compositor use that information.

Only hand-waving, sorry.

Finding the deadline:

I don't think there is a way to know really, and also the compositor might be adjusting its own schedules, so it might be variable.

The way I imaged it is that from presented event you compute the time of the next possible presentation, and if you want to hit that, submit a frame ASAP. This should get you just below one display-frame-cycle latency in any case, if your rendering is already complete.

If we really need the deadline, that would call for extending the protocol, so that the compositor could tell you when the deadline is. The compositor chooses the deadline based on how fast it thinks it can do a composition and hit the right vblank.

<Wonchul Lee>
About the latency, I tried to get latency added by the display stack from the wl commit/damage/attach to the present frame. It's a variable delay depending on the situation as pq mentioned before and could disturb targeting next present. The way we could assume optimal latency by accumulating it and observe a gap by the presentation feedback, maybe not always reliable.

I tried to synchronize GStreamer clock time with presentation feedback to render a frame on time and added a queue in GstWaylandSink to request render on each presentation feedback if there's a frame on time, similar to what George did. It's not well fit with GstBaseSink though, and GstWaylandSink needs to disable BaseSink time synchronization and computing itself. I faced unexpected underflow (consistently increasing delay) when playing with mpegts stream, so It also needs proper QOS handling to prevent underflow.

I would be good to get reliable latency from the display stack to make use of it when synchronizing presenting time whether computing it GstWaylandSink itself or not, there's a latency what we're missing anyway, though I'm not sure it's feasible.

<Pekka Paalanen>
@wonchul btw. what do you mean when you say "synchronize GStreamer clock time with presentation feedback"?

Does it mean something else than looking at what clock is advertised by wp_presentation.clock_id and then synchronizing GStreamer clock with clock_gettime() using the given clock id? Or does synchronizing mean something else than being able to convert a timestamp from one clock domain to the other domain?

<Nicolas Dufresne>
@pq I would need some clarification about submitting frame ASAP. If we blindly do that, frames will get displayed too soon on screen (in playback, decoders are much faster then the expected render speed). In GStreamer, we have infrastructure to wait until the moment is right. The logic (simplified) is to wait for the right moment minus the "currently expected" render latency, and submit. This is in playback case of course, and is to ensure the best possible A/V sync. In that case we expect the presentation information to be helpful in constantly correcting that moment. What we miss, is some semantic, as just blindly obey to the computed render delay of last frames does not seem like best idea. We expected to be able to calculate, or estimate, a submission window that will (most of the time) hit the screen at an estimated time.

For the live case, we're still quite screwed. Nothing seems to improve our situation. We need at start to pick a latency, and if later find that latency was too small (the latency is the window in which we are able to adapt), we end-up screwing up the audio (a glitch) to increase that latency window. So again, some semantic that we could use to calculate a pessimistic latency from the first presentation report would be nice.

<Olivier Crête>
I think that in the live case you can probably keep a 1 frame queue at the sink, so when a new frame arrives, you can decide if you want to present the queued one at the next refresh or replace it with a new one. Then the thread that talks to the compositor (and gets the events, etc), can pick the buffers from the "queue" to send to the compositor.

<Nicolas Dufresne>
Ok, that make sense for non-live. Would be nice to document the intended use, that was far from obvious. We keep thinking we need to look at the number, but we don't understand at first the the moment we get called back is important. You seem to assume that we can "pick" a frame, like if the sink was pulling whatever it wants randomly, that unfortunately not how things works. We can though introduce a small queue (some late queue) so we only start blocking upstream when that queue is full. And it would help making decisions

For live it's much more complex. The entire story about declared latency is because if we don't declare any latency, that queue will always be empty. Worst case, the report will always tell use that we have displayed the frame late. I'm quite sure you told me that the render pipeline can have multiple step, where submitting frame 1 2 3 at 1 blank distance, will render on blank 3 4 5 with effectively 3 blank latency. That latency is what we need to report for proper A/V sink in live pipeline, and changing is to be done with care as it breaks the audio. That we need some ideas, cause right now we have no clue.

Comment 1 Nicolas Dufresne (ndufresne) 2016-09-30 18:05:38 UTC

*** Bug 768080 has been marked as a duplicate of this bug. ***

Comment 2 Nicolas Dufresne (ndufresne) 2016-09-30 20:04:49 UTC

I'm gathering code for that in a branch (careful, this branch is rebased often).

https://git.collabora.com/cgit/user/nicolas/gst-plugins-bad.git/log/?h=wayland-presentation

Arun and I have been tracing the behaviour without any custom synchronization. So far we got interesting results. Here's a very meaningful graph

http://imgur.com/a/7VijX

On this graph, we try to play 30fps video on an output at some unprecise 60Hz. We can observe that the refresh rate is in fact slower, hence it runs later and later until the submission hits the previous vblank. So most frame are displayed on 2 blank, and once in a while it only renders on 1 blank.

This zig-zag result is exactly what we expect and the behaviour is acceptable. What is not though is the jumps you may notice close to the edges, or whenever the scheduler decides to kick at the wrong moment. This is noticeable in this graph by those vertical line that looks like glitches. For that reason, we came to the same conclusion as George had previously experiment in it's demo branch.

https://git.collabora.com/cgit/user/gkiagia/gst-plugins-bad.git/commit/?h=demo

To ensure smoothness, we need to control the submission time to ensure we do actually chose a vblank instead of leaving it to our luck. The simplest mechanism is to make sure the presented callback keep triggering and to only draw on this callback. This callback represent the sooner point it time for which client will hit the next blank. This is though not the most efficient, so a step up would be to predict the appropriate presented callback for the current frame base on the last presented callback and the refresh (or a refresh calculated from the sequence number if you are precision hungry).

From there, we'll still be displaying the frame a bit far from the presentation time. That's because there is a certain latency possibly introduced by the compositor and/or the driver/hw. We don't know this latency in advance, so we have no choice but to figure-out at run time. I'll be experimenting few estimation method, ideally a method that gives a value out of the preroll phase, and then could suggest the app to update the latency later when a more accurate value is available. I'll also try and avoid implementing our own sync for now, but try and be smarter then just replacing GstBuffer pointer like we do.

If someone knows if there is standard for this. As you can see on the graph, it produce a zigzag when the video rate is different from the display rate. We can adjust latency to center this zig-zag around 0, or we can lower that zigzag so the smallest delay tends to zero. I'm not sure what is best, my intuition tells me that using center is more accurate even if it means that at some moment the frames are displayed too soon, and other too late.

Comment 3 George Kiagiadakis 2017-03-09 16:45:46 UTC

I am looking a bit into this.

First of all, I have rebased again on top of current master (conflicted with dmabuf stuff - fixed):
https://git.collabora.com/cgit/user/gkiagia/gst-plugins-bad.git/log/?h=wayland-presentation

Some comments:

* I don't particularly like doing the streaming thread throttling in prepare() (with that g_cond_wait()). It doesn't make it clear where the throttling happens, I was confused for quite some time. It will probably also make it hard to implement unlock() and waiting for preroll.

* Numbers look confusing. It looks like the render time of the next buffer is always smaller than the last presentation time by a factor of about 3 seconds in my tests:

base_time 285:57:44.954398746 prepared_buffer_pts 0:00:00.149999999 presentation_time 285:57:45.033256712 refresh_duration 0:00:00.016666666 render_time 285:57:41.282301844

It looks to me as if all buffers are actually very late, but synchronization happens nevertheless because of the g_cond_wait in prepare(). Maybe I am missing something, I'll continue looking...

* There is a g_cond_signal() in show_frame() which doesn't make any sense, as show_frame() runs on the same thread as prepare().

Comment 4 Daniel Stone 2017-03-09 16:46:59 UTC

(In reply to George Kiagiadakis from comment #3)
> * Numbers look confusing. It looks like the render time of the next buffer
> is always smaller than the last presentation time by a factor of about 3
> seconds in my tests:
> 
> base_time 285:57:44.954398746 prepared_buffer_pts 0:00:00.149999999
> presentation_time 285:57:45.033256712 refresh_duration 0:00:00.016666666
> render_time 285:57:41.282301844

Mismatch in which clock was used perhaps?

Comment 5 Nicolas Dufresne (ndufresne) 2017-03-10 01:45:53 UTC

The traces are not self explanatory, hence need work indeed. Remember this is early work.

Using prepare() is very important, since the render function is synchronised. It would prevent you from doing any kind of adaptation. I'll probably be able to dive again in this work next week, and will probably provide initial support for preroll and unlock.

If I remember correctly, the sync happens on the clock thread (which is also running the WL queue, for race free handling). Prepare just wait for a slot to be free in the two buffer queue. You can unlock that easily.

Comment 6 Haihua Hu 2018-09-27 05:43:18 UTC

Hi Nicolas,

What is the progress of this enhancement? We need this feature to fix a waylandsink hang issue caused by the patch in the ticket:
https://bugzilla.gnome.org/show_bug.cgi?id=794793

best regards
Jared

Comment 7 Nicolas Dufresne (ndufresne) 2018-09-27 11:28:25 UTC

I'm not actively working on this at the moment.

Comment 8 GStreamer system administrator 2018-11-03 13:52:31 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/issues/402.