GNOME Bugzilla – Bug 761312
memory leak
Last modified: 2016-02-09 01:48:14 UTC
you can pretty quickly grow the memory usage of gnome-terminal by running: while true; do echo ; done and then giving the window focus over and over again. It seems to be leaking shared memory buffers.
i'm going to attach two patches. neither is quite ready for prime time. The first was my first attempt at fixing the leak. it's a minimal change the makes the leak go away, but ->busy stuff in the code is sort of half baked so this isn't really a full solution. The second attempts to drop the ->busy flag and instead keep track of two cairo surfaces at the same time, one in flight and one where changes get staged. It's not ready for prime time, because during testing i've noticed the buffer release callback never gets called. Still investigating.
Created attachment 320042 [details] [review] wayland: Don't call set_busy twice on the same surface If the compositor is using a shared memory buffer allocated by a client, then it's the client's responsibility to refrain from destroying the buffer until the compositor releases it. This is accomplished by taking a reference to the cairo surface assocatiated with the buffer after a frame, and dropping the reference when the compositor releases the buffer. In some cases though, the compositor doesn't release the buffer until a new buffer is set, so if we have staged drawing before the frame completes we can end up taking multiple references to the buffer and keeping it alive after it's released. This commit solves the problem by ensuring we only call _gdk_wayland_shm_surface_set_busy if isn't already busy.
Created attachment 320043 [details] [review] wip! wayland: rework buffer handling to fix leak
I'll look into this more on monday
(In reply to Ray Strode [halfline] from comment #1) > The second attempts to drop the ->busy flag and instead keep track of two > cairo surfaces at the same time, one in flight and one where changes get > staged. It's not ready for prime time, because during testing i've noticed > the buffer release callback never gets called. Still investigating. Note that the buffer release callback doesn't get flushed by the server; you have to do something else which will generate an event to the client. So if you're blocking with wl_display_dispatch_queue, you'll want to use wl_display_roundtrip_queue instead, as the wl_callback::done event will cause a flush.
So I looked into this more today and it's actually more dire than a missing flush: mutter doesn't send the release when it's done with the commit! It only sends the release when the buffer is destroyed. I'll attach the mutter patch that I think should address that problem and the latest iteration of the gdk patch. It's still pretty lightly tested, but seems to work okay. I'll pound on it harder tomorrow.
Created attachment 320152 [details] [review] wayland: release buffer after processing commit When a client is ready for the compositor to read a surface's shared memory buffer, it tells the compositor via wl_surface_commit. From that point forward, the baton is given to the compositor: it knows it can read the buffer without worring about the client making changes out from under it. After the compositor has uploaded the pixel contents to the video card it is supposed to release the buffer back to the client so that the client can reuse it for future use. At the moment, mutter only releases the buffer when a new buffer is attached. This is problematic, since it means the client has to have a second buffer prepared before the compositor gives the first one back. Preparing the second buffer potentially involves copying megabytes of pixel data, so that's suboptimal, and there's no reason mutter couldn't release the buffer earlier. This commit changes mutter to release a surface's buffer as soon as it's done processing the commit request.
Created attachment 320153 [details] [review] wayland: stage uncommited changes to dedicated buffer Right now we use one buffer for both staged changes (freshly painted changes waiting for the frame clock to send to the compositor) and committed changes (changes actively being read by the compositor process). This creates a problem in the event we need to stage updates at the same time the compositor is processing commited updates: we can't change what the compositor is actively processing. The current solution for handling this contention is to allocate a temporary buffer on the spot at the time the updates are staged, and to copy that buffer back to the shared buffer later. The problem, though, is that the copy to the shared buffer currently happens as soon as the updates are finished being staged, not when the shared buffer is done being processed by the compositor. In order to address that problem, this commit changes the code to always stage changes to a dedicated staging buffer. The staging buffer is used exclusively by the client until the client is done with it, and then once that staging buffer is committed, the client never writes to that buffer again. If the client needs to stage new updates, it allocates a brand new staging buffer initialized with a copy of the contents from the committed buffer. As an optimization, the compositor has the option of releasing the commited buffer back to the client. If it does so before the client needs to stage new updates, then the client will reuse the buffer for staging future updates. This optimization prevents having to allocate a new staging buffer and the associated cost of initializing that new buffer with a readback of the committed buffer.
Review of attachment 320153 [details] [review]: ::: gdk/wayland/gdkwindow-wayland.c @@ +685,3 @@ + cairo_set_operator (cr, CAIRO_OPERATOR_SOURCE); + cairo_paint (cr); + cairo_destroy (cr); So this is costly. It won't happen much in practice now that mutter releases the buffer before the frame callback, but we could optimize this by avoiding the copy up front, only doing incremental changes into the staging buffer, and then compositing the read back buffer on top of the incremental changes using the DEST_OUT operator. Another idea would be to use memcpy instead of cairo to do the copy. We know the buffers are linear with width == rowstride, so we can get by with one memcpy call. It could be that cairo/pixman already fall back to a memcpy in this situation though.
cairo should be able to detect it's a memcpy, I'm quite sure. The much preferred way would be to do true double buffering -- keep two buffers around, and then use a buffer_age-like system so that we copy the previous swap's damage before we draw ourselves.
Review of attachment 320152 [details] [review]: Yikes, I had a lot of trouble understanding buffer management back then. Sorry for the late release!
Review of attachment 320042 [details] [review]: No.
Review of attachment 320153 [details] [review]: ::: gdk/wayland/gdkwindow-wayland.c @@ +451,3 @@ + if (impl->commited_surface == NULL) + { + /* Means the buffer wasn't distroyed when we nullified the surface "destroyed" @@ +694,2 @@ /* Unlike other backends the Cairo surface is not just a cheap wrapper * around some other backing. It is the buffer itself. This comment definitely needs rewording now that it's a bit more complex and we don't have "the Cairo surface".
Comment on attachment 320152 [details] [review] wayland: release buffer after processing commit Attachment 320152 [details] pushed as 0165cb6 - wayland: release buffer after processing commit
Created attachment 320235 [details] [review] wayland: stage uncommited changes to dedicated buffer Right now we use one buffer for both staged changes (freshly painted changes waiting for the frame clock to send to the compositor) and committed changes (changes actively being read by the compositor process). This creates a problem in the event we need to stage updates at the same time the compositor is processing commited updates: we can't change what the compositor is actively processing. The current solution for handling this contention is to allocate a temporary buffer on the spot at the time the updates are staged, and to copy that buffer back to the shared buffer later. The problem, though, is that the copy to the shared buffer currently happens as soon as the updates are finished being staged, not when the shared buffer is done being processed by the compositor. In order to address that problem, this commit changes the code to always stage changes to a dedicated staging buffer. The staging buffer is used exclusively by the client until the client is done with it, and then once that staging buffer is committed, the client never writes to that buffer again. If the client needs to stage new updates, it allocates a brand new staging buffer, draws to it, and back fills the undrawn parts of the buffer from a copy of the contents of the committed buffer. As an optimization, the compositor has the option of releasing the commited buffer back to the client. If it does so before the client needs to stage new updates, then the client will reuse the buffer for staging future updates. This optimization prevents having to allocate a new staging buffer and the associated cost of back filling that new buffer with a readback of the committed buffer.
so this iteration addresses the two comment text problems Jasper found, and also does read back after drawing to prevent overdraw. It's the same idea in spirit as the DEST_OUT change I proposed in comment 9, but uses the damage region instead (which was a suggestion from Jasper on irc).
Review of attachment 320152 [details] [review]: ::: src/wayland/meta-wayland-surface.c @@ +610,3 @@ + if (pending->buffer) + meta_wayland_buffer_release_control (pending->buffer); Is this really correct? If the buffer is an EGL buffer, then doesn't our surface actor texture directly map to the buffer allocated by the client? I.e. we cannot release the buffer until it is replaced by a future commit + next frame, if the buffer is a EGL buffer. For SHM buffers than this looks good.
Review of attachment 320152 [details] [review]: Do EGL buffers have any release semantics?
Yes. mesa listens for release events and keeps track of what buffers are in use. IIRC it'll normally keep 2 or 3 buffers, depending on whether its a synchronized subsurface or not, in which it needs 3 (active, cached, pending). Assuming we are not copying the EGL buffer content, in mutter we need to wait with sending wl_buffer.release until the buffer is replaced by some other on at least the next commit. I'm not sure if we'll need to hold the buffer even longer when we have buffer->plane optimizations.
Is it even possible to know when the GPU is done with an imported buffer in OpenGL, except until after swap?
I'd assume I can't touch a texture until after swapping. In any way, if a commit replaces a buffer before the swap, the previous buffer should be released immediately anyway, and the new buffer should be the one that is drawn the coming frame. And I also assume you can't release a buffer that is drawn using a hw-plane until after a flip. If the flow is commit -> draw -> flip -> commit, I assume we can release such buffers on the second commit, but if its commit -> draw -> commit -> flip, we'd need to wait with releasing until after the flip. But I guess that won't happen since we always flip directy after drawing (?).
Review of attachment 320235 [details] [review]: ::: gdk/wayland/gdkwindow-wayland.c @@ +124,3 @@ + cairo_surface_t *staging_surface; + cairo_surface_t *commited_surface; s/commited/committed/. The naming is a bit unfortunate. We now have "surface" which is a wl_surface, and staging_surface, and commited_surface which are cairo_surface_t. Maybe name these *_crsurface or something to make it more clear? @@ +404,3 @@ + cairo_surface_destroy (impl->commited_surface); + impl->commited_surface = NULL; + } Shouldn't we do this on begin_paint instead? We might receive a .release after .frame but before we enter the paint loop. @@ +483,3 @@ + g_warning ("Received buffer release notification for untracked buffer"); + return; + } Is this true branch block really an error? If we 1. commit a buffer A 2. receive .frame 3. commit a buffer B 4. receive .frame 5. receive A.release At 5. impl->commited will be NULL, and it wouldn't be an error. The same would happen if we create and read back on .begin_paint. 1. draw frame on A 2. commit a buffer A 3. draw frame on B 4. commit a buffer B 5. receive A.release At 5. impl->commited would also be NULL, and it wouldn't be an error. I think what we need to do is something like: cairo_surface_t *cairo_surface = wl_buffer_get_user_data (wl_buffer); /* Destroy any old released cairo surface we no longer will re-use. */ if (impl->commited_surface != cairo_surface) { cairo_surface_destroy (cairo_surface); return; } if (impl->staging_surface == NULL) impl->staging_surface = cairo_surface; else cairo_surface_destroy (cairo_surface); impl->commited_surface = NULL; @@ +759,3 @@ + * that we didn't draw over + */ + read_back_commited_surface (window); Is this needed? If we ensure we start with a filled surface on begin_paint, can't we assume it'll stay filled here as well? @@ +1951,1 @@ } Should we also not do impl->commited_surface = NULL? When, in the future, the wl_buffer.release is received, it can destroy the cairo surface if it knows its not to be reused.
(In reply to Jonas Ådahl from comment #22) > Review of attachment 320235 [details] [review] [review]: > > ::: gdk/wayland/gdkwindow-wayland.c > @@ +404,3 @@ > + cairo_surface_destroy (impl->commited_surface); > + impl->commited_surface = NULL; > + } > > Shouldn't we do this on begin_paint instead? We might receive a .release > after .frame but before we enter the paint loop. > > @@ +483,3 @@ > + g_warning ("Received buffer release notification for untracked > buffer"); > + return; > + } > > Is this true branch block really an error? If we > > 1. commit a buffer A > 2. receive .frame > 3. commit a buffer B > 4. receive .frame > 5. receive A.release > > At 5. impl->commited will be NULL, and it wouldn't be an error. > > The same would happen if we create and read back on .begin_paint. > > 1. draw frame on A > 2. commit a buffer A > 3. draw frame on B > 4. commit a buffer B > 5. receive A.release > > At 5. impl->commited would also be NULL, and it wouldn't be an error. > Actually, it wouldn't since committed would only be NULL during paint, and we wouldn't handle events during painting. We still need to handle the window tear down situation though, i.e. a .release after the window was destroyed. > I think what we need to do is something like: > > cairo_surface_t *cairo_surface = wl_buffer_get_user_data (wl_buffer); > > /* Destroy any old released cairo surface we no longer will re-use. */ > if (impl->commited_surface != cairo_surface) > { > cairo_surface_destroy (cairo_surface); > return; > } > > if (impl->staging_surface == NULL) > impl->staging_surface = cairo_surface; > else > cairo_surface_destroy (cairo_surface); > > impl->commited_surface = NULL; > And I just realized you replaced the buffer listener, so this won't work either. So I suppose what needs to be done is to add the GdkWindow the cairo surface is attached to (assuming we don't attach the same cairo surface to multiple GdkWindow's), and set the wl_buffer user data back to the associated cairo_surface_t. That seems to me the most reasonable way to handle the above described situation.
(In reply to Jasper St. Pierre from comment #20) > Is it even possible to know when the GPU is done with an imported buffer in > OpenGL, except until after swap? It's not even finished after swap; that's just when it gets scheduled. glFinish or EGL Sync objects will tell you, but in practical terms, Weston demands that the kernel synchronise rendering in FIFO receive order, and that SwapBuffers flush rendering, so it's safe to release it after having called SwapBuffers on the last frame you'll use it for. Having surfaces directly scanned out of course complicates things and demands you wait for flip-complete events from KMS, but luckily you dodged that bullet, so.
Hi, > we cannot release the buffer until it is replaced by a future commit + > next frame, if the buffer is a EGL buffer. For SHM buffers than this looks > good. Yea, that totally makes sense, will do an update to correct that.
Review of attachment 320235 [details] [review]: ::: gdk/wayland/gdkdisplay-wayland.c @@ -1020,3 @@ width*scale, height*scale, stride, WL_SHM_FORMAT_ARGB8888); - wl_buffer_add_listener (data->buffer, &buffer_listener, surface); Won't this cause buffers of wl_pointer.set_cursor surfaces to leak?
(In reply to Jonas Ådahl from comment #22) > Review of attachment 320235 [details] [review] [review]: > > ::: gdk/wayland/gdkwindow-wayland.c > @@ +124,3 @@ > > + cairo_surface_t *staging_surface; > + cairo_surface_t *commited_surface; > > s/commited/committed/. > > The naming is a bit unfortunate. We now have "surface" which is a > wl_surface, and staging_surface, and commited_surface which are > cairo_surface_t. Maybe name these *_crsurface or something to make it more > clear? I agree it's less than optimal but crsurface is hideous looking :-) How about we rename surface to wl_surface, rename subsurface to wl_subsurface? And, it's wordy but, we could call the cairo surface, cairo_staging_surface, and cairo_committed_surface. Alternatively, we stuff all the proxy objects in a server_objects substructure or something. > @@ +404,3 @@ > + cairo_surface_destroy (impl->commited_surface); > + impl->commited_surface = NULL; > + } > > Shouldn't we do this on begin_paint instead? We might receive a .release > after .frame but before we enter the paint loop. There may be more than one begin_paint before the .frame callback. We could do it in the next begin_paint after a frame callback. That would give the compositor a little bit longer to release the buffer before we resort to a copy, but doing it in every begin_paint would force a readback of the entire buffer in the case the begin_paint happened before the frame callback. Also, note we destroy the wl_buffer when we destroy the cairo surface, so if we did it in the wrong begin_paint then we could end up destroying the buffer before the frame callback. > Is this true branch block really an error? If we > > 1. commit a buffer A > 2. receive .frame > 3. commit a buffer B > 4. receive .frame > 5. receive A.release > At 5. impl->commited will be NULL, and it wouldn't be an error. well, see above, at step 2 we destroy buffer A so step 5 won't ever happen. > The same would happen if we create and read back on .begin_paint. > > 1. draw frame on A > 2. commit a buffer A > 3. draw frame on B > 4. commit a buffer B > 5. receive A.release > At 5. impl->commited would also be NULL, and it wouldn't be an error. Again, 3 might happen before we receive the .frame callback and we'd destroy the buffer before the .frame callback. Of course, this code all makes the assumption that it's okay to destroy the buffer after the frame callback, which as pointed out in the side discussion in this bug, isn't necessarily true. > I think what we need to do is something like: > > cairo_surface_t *cairo_surface = wl_buffer_get_user_data (wl_buffer); > > /* Destroy any old released cairo surface we no longer will re-use. */ > if (impl->commited_surface != cairo_surface) > { > cairo_surface_destroy (cairo_surface); > return; > } > > if (impl->staging_surface == NULL) > impl->staging_surface = cairo_surface; > else > cairo_surface_destroy (cairo_surface); > > impl->commited_surface = NULL; you mean from the buffer_release_callback. That sounds fine. It does mean that if the buffer release never comes (because of a compositor bug) then we'll continue to accumulate buffers, but I guess that's unavoidable. We can't really second guess the compositor; if it's broken the user is hosed anyway. I'll update the patch. > @@ +759,3 @@ > + * that we didn't draw over > + */ > + read_back_commited_surface (window); > > Is this needed? If we ensure we start with a filled surface on begin_paint, > can't we assume it'll stay filled here as well? We can't assume we're starting with a filled surface. begin_paint might just be painting a small part of the window, and the surface may have already been committed. (In reply to Jonas Ådahl from comment #23) > > At 5. impl->commited would also be NULL, and it wouldn't be an error. > Actually, it wouldn't since committed would only be NULL during paint, and > we wouldn't handle events during painting. committed is set from the frame clock, but not all painting is tied to the frame clock. Widgets can call gdk_window_process_updates or whatever, outside of the frame clock.
(In reply to Ray Strode [halfline] from comment #27) > Of course, this code all makes the assumption that it's okay to destroy the > buffer after the frame callback, which as pointed out in the side discussion > in this bug, isn't necessarily true. To clarify this, a little: in practice the assumption is okay, since we're only talking about shm buffers and there's no legitimate reason for a compositor to need to access to a shm buffer after processing the frame, but conceptually, I guess, it's uncouth to destroy a buffer before it's explicitly released.
(In reply to Ray Strode [halfline] from comment #27) > (In reply to Jonas Ådahl from comment #22) > > Review of attachment 320235 [details] [review] [review] [review]: > > > > ::: gdk/wayland/gdkwindow-wayland.c > > @@ +124,3 @@ > > > > + cairo_surface_t *staging_surface; > > + cairo_surface_t *commited_surface; > > > > s/commited/committed/. > > > > The naming is a bit unfortunate. We now have "surface" which is a > > wl_surface, and staging_surface, and commited_surface which are > > cairo_surface_t. Maybe name these *_crsurface or something to make it more > > clear? > I agree it's less than optimal but crsurface is hideous looking :-) > How about we rename surface to wl_surface, rename subsurface to > wl_subsurface? And, it's wordy but, we could call the cairo surface, > cairo_staging_surface, and cairo_committed_surface. Alternatively, we stuff > all the proxy objects in a server_objects substructure or something. > Either is fine by me. The only request I have is to make the difference obvious. > > @@ +404,3 @@ > > + cairo_surface_destroy (impl->commited_surface); > > + impl->commited_surface = NULL; > > + } > > > > Shouldn't we do this on begin_paint instead? We might receive a .release > > after .frame but before we enter the paint loop. > There may be more than one begin_paint before the .frame callback. We could > do it in the next begin_paint after a frame callback. That would give the > compositor a little bit longer to release the buffer before we resort to a > copy, but doing it in every begin_paint would force a readback of the entire > buffer in the case the begin_paint happened before the frame callback. > Also, note we destroy the wl_buffer when we destroy the cairo surface, so if > we did it in the wrong begin_paint then we could end up destroying the > buffer before the frame callback. I suppose begin_paint() would only create the staging buffer and read back if there was no staging buffer already created. And yea, we can only ever destroy any buffer after the .release (or if it is replaced by a second attach before a .commit). > > > Is this true branch block really an error? If we > > > > 1. commit a buffer A > > 2. receive .frame > > 3. commit a buffer B > > 4. receive .frame > > 5. receive A.release > > At 5. impl->commited will be NULL, and it wouldn't be an error. > well, see above, at step 2 we destroy buffer A so step 5 won't ever happen. I didn't realize that. We can't destroy the surface on .frame, since it may be for whatever reason still be in use. > > > The same would happen if we create and read back on .begin_paint. > > > > 1. draw frame on A > > 2. commit a buffer A > > 3. draw frame on B > > 4. commit a buffer B > > 5. receive A.release > > At 5. impl->commited would also be NULL, and it wouldn't be an error. > > Again, 3 might happen before we receive the .frame callback and we'd destroy > the buffer before the .frame callback. > > Of course, this code all makes the assumption that it's okay to destroy the > buffer after the frame callback, which as pointed out in the side discussion > in this bug, isn't necessarily true. > > > I think what we need to do is something like: > > > > cairo_surface_t *cairo_surface = wl_buffer_get_user_data (wl_buffer); > > > > /* Destroy any old released cairo surface we no longer will re-use. */ > > if (impl->commited_surface != cairo_surface) > > { > > cairo_surface_destroy (cairo_surface); > > return; > > } > > > > if (impl->staging_surface == NULL) > > impl->staging_surface = cairo_surface; > > else > > cairo_surface_destroy (cairo_surface); > > > > impl->commited_surface = NULL; > you mean from the buffer_release_callback. That sounds fine. It does mean > that if the buffer release never comes (because of a compositor bug) then > we'll continue to accumulate buffers, but I guess that's unavoidable. We > can't really second guess the compositor; if it's broken the user is hosed > anyway. I'll update the patch. Yea, if the compositor behaves that bad, we're screwed anyway :P > > > @@ +759,3 @@ > > + * that we didn't draw over > > + */ > > + read_back_commited_surface (window); > > > > Is this needed? If we ensure we start with a filled surface on begin_paint, > > can't we assume it'll stay filled here as well? > We can't assume we're starting with a filled surface. begin_paint might just > be painting a small part of the window, and the surface may have already > been committed. But won't there be scenarios where we make a almost-full frame memcopy here every frame even though the buffer already contained the correct pixels? If we ensure we have a fully drawn buffer when creating the staging buffer on-demand (possibly in begin_frame), we'd never need to complement after painting. My point being, the regular case for shm buffers is (assuming the compositor copies the shm buffer to the gpu and releases it immediately): 1. draw 2. commit 3. release 4. frame 5. goto 1 If this happens, we should never need to read back at all. If we make the default case handle this, and create the staging buffer on demand and doing a full read back when doing so, we would avoid most read backs except in rare cases. > > (In reply to Jonas Ådahl from comment #23) > > > At 5. impl->commited would also be NULL, and it wouldn't be an error. > > Actually, it wouldn't since committed would only be NULL during paint, and > > we wouldn't handle events during painting. > committed is set from the frame clock, but not all painting is tied to the > frame clock. Widgets can call gdk_window_process_updates or whatever, > outside of the frame clock. (In reply to Ray Strode [halfline] from comment #28) > (In reply to Ray Strode [halfline] from comment #27) > > Of course, this code all makes the assumption that it's okay to destroy the > > buffer after the frame callback, which as pointed out in the side discussion > > in this bug, isn't necessarily true. > To clarify this, a little: in practice the assumption is okay, since we're > only talking about shm buffers and there's no legitimate reason for a > compositor to need to access to a shm buffer after processing the frame, but > conceptually, I guess, it's uncouth to destroy a buffer before it's > explicitly released. Sure, mutter will now release them immediately, but that is not something we can rely on. We must simply never destroy or re-use the buffer before it is explicitly released, because that is how the protocol is specified.
> But won't there be scenarios where we make a almost-full frame memcopy here > every frame even though the buffer already contained the correct pixels? If > we ensure we have a fully drawn buffer when creating the staging buffer > on-demand (possibly in begin_frame), we'd never need to complement after > painting. So my initial patch from comment 8 did the copy up front, but because of comment 9 and discussion on IRC I changed it to do the copy afterward. I don't quite follow what you're saying, though. I think maybe when you say "every frame" above you mean "every paint operation in a frame". It's true we do a read back at the end of each paint, when instead, we should only do it once. > My point being, the regular case for shm buffers is (assuming the compositor > copies the shm buffer to the gpu and releases it immediately): > > 1. draw > 2. commit > 3. release > 4. frame > 5. goto 1 > > If this happens, we should never need to read back at all. If we make the > default case handle this, and create the staging buffer on demand and doing > a full read back when doing so, we would avoid most read backs except in > rare cases. Right, but we could optimize the more rare case, too, which is the whole point of doing a partial read back instead of the upfront copy I did in the first patch. I guess, though, actually the partial read back shouldn't happen until right before we commit the buffer, then the read back will only happen, at most, once per frame.
Created attachment 320270 [details] [review] wayland: don't prematurely release EGL buffers commit 0165cb697466ba0843b993416e00d4f768c00d45 changed mutter to release committed shm buffers as soon as they were uploaded to the GPU. It also inadvertently changed mutter to prematurely release EGL buffers (which never get copied, but get used directly). This commit corrects that mistake.
Review of attachment 320270 [details] [review]: You never release EGL buffers now.
explain your reasoning for thinking that please. egl buffers get released when they're unrefed as far as I can tell.
Review of attachment 320270 [details] [review]: Ah, I didn't catch that. I didn't realize we also released on destroy. Perhaps this is good now.
Comment on attachment 320270 [details] [review] wayland: don't prematurely release EGL buffers Attachment 320270 [details] pushed as 7adbb58 - wayland: don't prematurely release EGL buffers
Created attachment 320305 [details] [review] wayland: rename window->surface to window->wl_surface The name surface is really overloaded when dealing with wayland windows. To alleviate ambiguity, this commit changes the name of the "surface" and "subsurface" members to have a wl_ prefix.
Created attachment 320306 [details] [review] wayland: move server proxy objects to substructure This commit moves the server proxy objects to a substructure for clarity.
Created attachment 320307 [details] [review] wayland: use g_clear_pointer when destroying cairo surfaces There are a few places where we destroy a cairo surface and then nullify it. This commit changes those to use g_clear_pointer instead. It also drops a cairo_surface_finish call that is unnecessary
Created attachment 320308 [details] [review] wayland: always return FALSE from begin_paint The client and compositor share access to the window pixel buffers. After the client hands off (commits) the buffer to the compositor it's not supposed to write to it again until it's released by the compositor. The code tries to deal with this contention by allocating a temporary buffer and using that in the mean time. This temporary buffer is allocated by a higher layer of the code when begin_paint returns TRUE. Unfortunately, that layer of the code has no idea when the buffer is released, so it ends up blitting the temporary buffer back to the shared buffer prematurely. This commit changes begin_paint to always return FALSE. A future commit will address the contention problem in a different way.
Created attachment 320309 [details] [review] wayland: don't handle buffer release centrally Right now we handle buffer releases coming from the compositor in a central place. We add a listener when first creating the shared buffers. This is problematic because a buffer can only have one listener on it at once so users of the buffer can't get notified when it's released. This commit moves the buffer listener code from the centrally managed display code to the cursor and window code.
Created attachment 320310 [details] [review] wayland: stage uncommitted changes to dedicated buffer Right now we use one buffer for both staged changes (freshly painted changes waiting for the frame clock to send to the compositor) and committed changes (changes actively being read by the compositor process). This creates a problem in the event we need to stage updates at the same time the compositor is processing committed updates: we can't change what the compositor is actively processing. The current solution for handling this contention is to allocate a temporary buffer on the spot at the time the updates are staged, and to copy that buffer back to the shared buffer later. The problem, though, is that the copy to the shared buffer currently happens as soon as the updates are finished being staged, not when the shared buffer is done being processed by the compositor. In order to address that problem, this commit changes the code to always stage changes to a dedicated staging buffer. The staging buffer is used exclusively by the client until the client is done with it, and then once that staging buffer is committed, the client never writes to that buffer again. If the client needs to stage new updates, it allocates a brand new staging buffer, draws to it, and back fills the undrawn parts of the buffer from a copy of the contents of the committed buffer. As an optimization, the compositor has the option of releasing the committed buffer back to the client. If it does so before the client needs to stage new updates, then the client will reuse the buffer for staging future updates. This optimization prevents having to allocate a new staging buffer and the associated cost of back filling that new buffer with a readback of the committed buffer.
Review of attachment 320305 [details] [review]: Looks good to me.
Review of attachment 320306 [details] [review]: I suppose we can do that.
Review of attachment 320307 [details] [review]: Looks good to me.
Review of attachment 320310 [details] [review]: Think I've mostly managed to wrap my head around the flow now. I see a couple of issues, but for the most part it looks good. I do however think that the two patches before this would be better squashed into this one, since the intermediate state isn't really an improvement (looks like we'll just always draw on the live buffer, but maybe I'm wrong). ::: gdk/wayland/gdkwindow-wayland.c @@ +627,3 @@ } +static const cairo_user_data_key_t gdk_wayland_cairo_key; Please call this something other than the gdk_wayland_cairo_key in gdkdisplay-wayland.c. @@ +668,3 @@ + * the old committed buffer again. + */ + impl->staging_cairo_surface = g_steal_pointer (&impl->committed_cairo_surface); If we get a wl_buffer.release after a ..update_size() we'd set the staging cairo surface to the committed surface which has the wrong size. Simply unsetting impl->committed_cairo_surface on update_size() should fix that I suppose, but you'd also have to unset the user_data so it can destroy gracefully even on GdkWindow tear down. Doing that I think could just drop impl->backfill_cairo_surface and just always use impl->committed_surface (and always "detach" it i.e. unset user_data) when we've read back its content. Not sure that's better than using a separate reference ala fillback_cairo_surface though. @@ +1961,3 @@ + g_clear_pointer (&impl->staging_cairo_surface, cairo_surface_destroy); + g_clear_pointer (&impl->backfill_cairo_surface, cairo_surface_destroy); Must we not unset the cairo surface's user data, and handle a post-GdkWindow destruction wl_buffer.release event? Now, if the GdkWindow is destroyed before the wl_buffer.release of the committed buffer it'll fetch a pointer pointing to freed memory.
(In reply to Jonas Ådahl from comment #45) > I do however think that the two patches before this would be > better squashed into this one, since the intermediate state > isn't really an improvement (looks like we'll just always > draw on the live buffer, but maybe I'm wrong). Well, I split it out because the patch is kind of unwieldy. The cursor changes are really just a casualty of the other code, so i'd rather it didn't get thrown into the same patch. Cutting out the set_busy stuff up front doesn't make the code worse, we're already always drawing on a live buffer we aren't supposed to. That's because the temporary buffer it allocated in begin_paint and blitted out in end_paint, which are always bracketed together when used, so there's no chance a release will ever before the shared buffer is drawn to. So i'm just cutting out broken code, and having that rescission in the same patch as the fix makes the fix harder to follow. Let me propose a compromise, I'll do it as separate commits, but when I push them I'll first do: $ git rebase master $ git checkout master $ git merge --no-ff --no-commit mybranch $ git commit and give a merge commit with a summary message of what's changed. Then when users look at the merge commit with cgit or git they will see all the patches squashed together, alongside a nice message, but they can also look at the individual commits separately, if they need to. sound good? > +static const cairo_user_data_key_t gdk_wayland_cairo_key; > > Please call this something other than the gdk_wayland_cairo_key in > gdkdisplay-wayland.c. Sure. It doesn't matter in practice, since the keys are static, of course, but I don't have a problem tidying this up for clarity sake. > @@ +668,3 @@ > If we get a wl_buffer.release after a ..update_size() we'd set the staging > cairo surface to the committed surface which has the wrong size. > > Simply unsetting impl->committed_cairo_surface on update_size() should fix > that I suppose, Yup, will make that fix. > but you'd also have to unset the user_data so it can destroy > gracefully even on GdkWindow tear down. > > Doing that I think could just drop impl->backfill_cairo_surface and just > always use impl->committed_surface (and always "detach" it i.e. unset > user_data) when we've read back its content. Not sure that's better than > using a separate reference ala fillback_cairo_surface though. It's an interesting question. Thinking about it, I think having two different members for the two different references makes sense, if for no other reason than, it makes inspecting what's going on in gdb easier. You can infer more about the state of things at a breakpoint with the separate committed surface reference. We could go a step further and stuff all unreleased references in a hash table, but maybe that's overkill. Though, rather than unsetting user data and detaching the surface from the buffer we could just ref the impl so it sticks around until all outstanding releases are processed. Then if we do want to do the hash table down the road, we'll be positioned to be able to do it.
Created attachment 320365 [details] [review] wayland: rename cairo surface user data key to be more specific This commit renames the key name to be more specific for clarity.
Created attachment 320366 [details] [review] wayland: stage uncommitted changes to dedicated buffer Right now we use one buffer for both staged changes (freshly painted changes waiting for the frame clock to send to the compositor) and committed changes (changes actively being read by the compositor process). This creates a problem in the event we need to stage updates at the same time the compositor is processing committed updates: we can't change what the compositor is actively processing. The current solution for handling this contention is to allocate a temporary buffer on the spot at the time the updates are staged, and to copy that buffer back to the shared buffer later. The problem, though, is that the copy to the shared buffer currently happens as soon as the updates are finished being staged, not when the shared buffer is done being processed by the compositor. In order to address that problem, this commit changes the code to always stage changes to a dedicated staging buffer. The staging buffer is used exclusively by the client until the client is done with it, and then once that staging buffer is committed, the client never writes to that buffer again. If the client needs to stage new updates, it allocates a brand new staging buffer, draws to it, and back fills the undrawn parts of the buffer from a copy of the contents of the committed buffer. As an optimization, the compositor has the option of releasing the committed buffer back to the client. If it does so before the client needs to stage new updates, then the client will reuse the buffer for staging future updates. This optimization prevents having to allocate a new staging buffer and the associated cost of back filling that new buffer with a readback of the committed buffer.
(In reply to Ray Strode [halfline] from comment #46) > (In reply to Jonas Ådahl from comment #45) > > I do however think that the two patches before this would be > > better squashed into this one, since the intermediate state > > isn't really an improvement (looks like we'll just always > > draw on the live buffer, but maybe I'm wrong). > Well, I split it out because the patch is kind of unwieldy. > > The cursor changes are really just a casualty of the other > code, so i'd rather it didn't get thrown into the same patch. > > Cutting out the set_busy stuff up front doesn't make the > code worse, we're already always drawing on a live buffer we > aren't supposed to. That's because the temporary buffer it > allocated in begin_paint and blitted out in end_paint, which > are always bracketed together when used, so there's no chance > a release will ever before the shared buffer is drawn to. So > i'm just cutting out broken code, and having that rescission > in the same patch as the fix makes the fix harder to follow. The theory isn't that GTK+ will receive and process a receive, it's that it will win the race and mutter will have completed the texture upload / copy by then and queued up a release by then.
the texture upload doesn't start until commit time though, which happens after end_paint.
Heh, I think that was refactored around at some point, then. There used to be a race GTK+ could win.
oh, i guess one potentially "winnable" race would be: begin_paint -> paint -> end_paint -> commit -> begin_paint to temporary -> paint -> compositor finishes with first commit -> end_paint -> commit
Review of attachment 320366 [details] [review]: ::: gdk/wayland/gdkwindow-wayland.c @@ +671,3 @@ + * the old committed buffer again. + */ + impl->staging_cairo_surface = g_steal_pointer (&impl->committed_cairo_surface); Looks like we'll leak GdkWindowImplWayland and the released buffer here if the window was already destroyed before wl_buffer.release was received. The triggering flow: 1. draw into and attach+commit wl_buffer@123 (committed = wl_buffer@123, staging == NULL, window ref = 2) 2. destroy window (window ref = 1, last ref held by wl_buffer@123, committed = wl_buffer@123) 3. wl_buffer@123.release (committed = NULL, staging = wl_buffer@123, window_ref = 1) What is the benefit of not just detaching buffers/cairo-surfaces on window destruction? We'd leak them either way if we disconnect from the server since we won't get any release events, and to fix that we need a per display connection list of cairo surfaces that needs to be explicitly destroyed on disconnect, and having a circular reference (impl -> cairo surface -> impl) doesn't seem to help with that.
So my first reaction to comment 53 was that I thought there shouldn't be a leak. in step 3 committed = NULL, so the buffer release call back will destroy the surface and its associated user data which will call the final reference to the window to disappear. Turns out I never nullified the variable in destroy though, just update_size. I'll do one more update with that change. Regarding what the advantage is? It's not about leaks if the server goes away, i mean if the server goes away we're going down too. It's pretty much 6 of 1 and half a dozen of another to me, and I could happily go either way. The only reason (and a small one at that) that I decided to go this route was for accounting reasons to aid in future debugging. In the future, if we're trying to wrap our heads around a bug and we break with gdb we get access to the destroyed window impl and can inspect its state to help deduce what's going on. Doing it this way, also leaves the code in a position where we could track unreleased buffers separately if we wanted to with some other changes (which would also help with debugging down the line, but is otherwise unnecessary) So I went this way because it seemed marginally better at the time, and it still seems marginally better now. I don't *really* care though. I'm only reluctant to change it now because I already did it this way and I don't see a strong reason for changing it to the other way. If you give me a strong reason to change it, or if this way really bothers you, i'll do another round.
Created attachment 320458 [details] [review] wayland: stage uncommitted changes to dedicated buffer Right now we use one buffer for both staged changes (freshly painted changes waiting for the frame clock to send to the compositor) and committed changes (changes actively being read by the compositor process). This creates a problem in the event we need to stage updates at the same time the compositor is processing committed updates: we can't change what the compositor is actively processing. The current solution for handling this contention is to allocate a temporary buffer on the spot at the time the updates are staged, and to copy that buffer back to the shared buffer later. The problem, though, is that the copy to the shared buffer currently happens as soon as the updates are finished being staged, not when the shared buffer is done being processed by the compositor. In order to address that problem, this commit changes the code to always stage changes to a dedicated staging buffer. The staging buffer is used exclusively by the client until the client is done with it, and then once that staging buffer is committed, the client never writes to that buffer again. If the client needs to stage new updates, it allocates a brand new staging buffer, draws to it, and back fills the undrawn parts of the buffer from a copy of the contents of the committed buffer. As an optimization, the compositor has the option of releasing the committed buffer back to the client. If it does so before the client needs to stage new updates, then the client will reuse the buffer for staging future updates. This optimization prevents having to allocate a new staging buffer and the associated cost of back filling that new buffer with a readback of the committed buffer.
Review of attachment 320458 [details] [review]: Looks correct to me now. Marked as reviewed just because of a couple of minor notes. ::: gdk/wayland/gdkwindow-wayland.c @@ +208,3 @@ + */ + impl->committed_cairo_surface = NULL; + Stray newline. @@ +366,3 @@ + if (!impl->backfill_cairo_surface) + goto out; + Another possible early out would be possible if you calculated paint_region earlier and finished early if it was empty.
Okay I pushed the patches with those new suggested changes. As discussed I consolidated the last 3 patches with a merge commit ( You can see squashed patchset by visiting commit 2ebae407 ) Attachment 320305 [details] pushed as f90db30 - wayland: rename window->surface to window->wl_surface Attachment 320306 [details] pushed as 3ac78ea - wayland: move server proxy objects to substructure Attachment 320307 [details] pushed as 1cfa2f4 - wayland: use g_clear_pointer when destroying cairo surfaces Attachment 320308 [details] pushed as 2c30008 - wayland: always return FALSE from begin_paint Attachment 320309 [details] pushed as 40e9119 - wayland: don't handle buffer release centrally Attachment 320365 [details] pushed as e6f92df - wayland: rename cairo surface user data key to be more specific Attachment 320458 [details] pushed as c80dd54 - wayland: stage uncommitted changes to dedicated buffer Thanks for the careful review.
*** Bug 761733 has been marked as a duplicate of this bug. ***