GNOME Bugzilla – Bug 761613
crash with xwayland glamor
Last modified: 2016-02-19 14:01:06 UTC
So I haven't seen this crash yet (not sure why), but rtcm pointed out on irc that the meta_fatal in meta_wayland_buffer_take_control in commit 0165cb697466ba0843b993416e00d4f768c00d45 was triggering a crash for him because of this code in Xwayland: http://cgit.freedesktop.org/xorg/xserver/tree/hw/xwayland/xwayland.c#n367 I'll attach a patches that I think should address the issue, but I'm not seeing the problem yet so I can't be sure the patch fixes it. I'll also attach a work-in-progess patchset to try to notice when a client is doing things to the buffer when it shouldn't be. not sure if this should go in or not. definitely not without me carefully exercising the new code paths to make sure they work, which I haven't yet.
Created attachment 320516 [details] [review] wayland: survive client attaching a buffer twice w/o intermediate release Right now we abort if a client attaches and commits the same buffer to a surface twice, without waiting for a release in between. That's wrong, the compositor should be more robust to client misbehavior than that. Xwayland, for instance, reattaches the already attached buffer every time it posts damage so we need to accomodate that.
Created attachment 320517 [details] [review] wayland: use glib function for fetching timestamp The code currently implements a function, get_time, that fetches a timestamp. That duplicates code already in glib, and the glib implementation is better, anyway, since it doesn't skew backward when the system clock is changed. This commit changes the code to use g_get_monotonic_time and drop the get_time function.
Created attachment 320518 [details] [review] wayland: consolidate frame callback code At the moment frame callbacks are instantiated in the surface code and finished in the compositor code. This commit consolidates both halves of functionality into the surface code, by generalizing MetaWaylandFrameCallback to complete a specified closure instead of handling wl_callback resources specifically.
Created attachment 320519 [details] [review] wayland: send error to client if it tries to use unavailable buffer If a client tries to use a buffer before it's available for use then we should send back an error so it knows it's doing something with undefined behavior. This commit tracks when damage is posted to a buffer waits until the next frame callback before allowing damage to be posted to the buffer again. The exceptions, of course, are shm buffers which we cede control of immediately after copying the pixels out of, so they can be reused right away.
Rui, if you could give attachment 320516 [details] [review] a try, i'd appreciate it! what's your reproducer ? just running an x client (say glxgears ?) ?
Review of attachment 320518 [details] [review]: ::: src/wayland/meta-wayland-buffer.h @@ +39,3 @@ + MetaWaylandFrameCallback frame_callback; + It feels generally wrong to associate a wl_buffer with a frame callback. The same buffer may be attached to any number of surfaces, and there is nothing incorrect about that AFAIU. When a frame callback should be invoked or dropped should be a detail of the surface role, as its the role that determines how a surface is drawn.
Review of attachment 320519 [details] [review]: A buffer is not ever "unavailable", it's just "in use by the compositor". I don't think we should kill clients just because they post damage after having drawn to a live buffer. We should just let them potentially get glitches, since that is what they are asking for. This seems like it could also potentially kill Xwayland if it looses the release race, since it (at least did) commit the same buffer over and over again.
Review of attachment 320516 [details] [review]: ::: src/wayland/meta-wayland-surface.c @@ +597,3 @@ + should_attach_buffer = FALSE; + + if (should_attach_buffer) Couldn't this just be if (pending->newly_attached && pending->buffer != surface->buffer) { ... ? It'd cover the case when the client attaches the same buffer twice by ignoring the non-update. We'd also ignore attaching a second NULL buffer. Why does surface->buffer->accessible make any difference here?
Review of attachment 320517 [details] [review]: Looks good to me.
Shouldn't it be enough to just don't try to re-take control of a attached buffer if that buffer is already the active one? Adding anything related to frame callbacks to MetaWaylandBuffer seems wrong to me anyhow, since its on the wrong level. The meta_fatal() could IMHO as well be a g_assert() or g_return_if_fail() to detect bugs in mutter, but it shouldn't be triggerable by a client.
(In reply to Jonas Ådahl from comment #6) > It feels generally wrong to associate a wl_buffer with a frame callback. The > same buffer may be attached to any number of surfaces, and there is nothing > incorrect about that AFAIU. When a frame callback should be invoked or > dropped should be a detail of the surface role, as its the role that > determines how a surface is drawn. makes sense. (In reply to Jonas Ådahl from comment #7) > Review of attachment 320519 [details] [review] [review]: > > A buffer is not ever "unavailable", it's just "in use by the compositor". I > don't think we should kill clients just because they post damage after > having drawn to a live buffer. We should just let them potentially get > glitches, since that is what they are asking for. Well, one of the original premises of wayland was "every frame is perfect", so I don't entirely agree that we should be okay with clients doing things that can lead to graphical glitches, but I also see the point that killing the client is a strong reaction. Given your opinion here and the problems you've pointed out in comment 6 I think we can just drop this patch. I wasn't 100% sure the patch should go in anyway. In the future, I guess, if we want to provide better accountability to clients we can do it in the role specific code and, perhaps do it in a less fatal way. > This seems like it could also potentially kill Xwayland if it looses the > release race, since it (at least did) commit the same buffer over and over > again. I don't think so. Xwayland doesn't reuse the buffer until the frame completes, and the patch explicitly allows the buffer to be reused after the frame completes (which isn't strictly correct, but should keep xwayland working). Anyway, that's moot, let's just drop this for now!
(In reply to Jonas Ådahl from comment #8) > Why does surface->buffer->accessible make any difference here? the compositor releases control of shm buffers as soon as the pixel data is uploaded to the gpu, and so those buffers aren't accessible for posting future damage requests without first reattaching the buffer. If we don't have the accessible check there, then a shm buffer that reattaches won't actually reattach. Put another way: if the buffer isn't accessible to the compositor because the compositor ceded control of it, then the compositor needs to retake control of the buffer before it can use it again.
(In reply to Jonas Ådahl from comment #10) > Shouldn't it be enough to just don't try to re-take control of a attached > buffer if that buffer is already the active one? Not following you here. If it's an egl buffer we don't release control of the buffer until a new one is attached to the surface, so under what scenario are you thinking we should re-take control of the buffer? Are you talking about shm buffers? > Adding anything related to > frame callbacks to MetaWaylandBuffer seems wrong to me anyhow, since its on > the wrong level. Yup I think we're on the same page here. > The meta_fatal() could IMHO as well be a g_assert() or g_return_if_fail() to > detect bugs in mutter, but it shouldn't be triggerable by a client. i guess g_return_if_fail is a little bit kinder to the user if don't have the state machine quite right.
(In reply to Ray Strode [halfline] from comment #11) > (In reply to Jonas Ådahl from comment #7) > > Review of attachment 320519 [details] [review] [review] [review]: > > > > A buffer is not ever "unavailable", it's just "in use by the compositor". I > > don't think we should kill clients just because they post damage after > > having drawn to a live buffer. We should just let them potentially get > > glitches, since that is what they are asking for. > Well, one of the original premises of wayland was "every frame is perfect", > so I don't entirely agree that we should be okay with clients doing things > that can lead to graphical glitches, but I also see the point that killing > the client is a strong reaction. Given your opinion here and the problems > you've pointed out in comment 6 I think we can just drop this patch. I > wasn't 100% sure the patch should go in anyway. In the future, I guess, if > we want to provide better accountability to clients we can do it in the role > specific code and, perhaps do it in a less fatal way. Wayland enables the possibility of "every frame is perfect", but the compositor has no way to enforce it. A client can draw funky buffers as much as it likes and there is nothing we can do about it. (In reply to Ray Strode [halfline] from comment #12) > (In reply to Jonas Ådahl from comment #8) > > Why does surface->buffer->accessible make any difference here? > > the compositor releases control of shm buffers as soon as the pixel data is > uploaded to the gpu, and so those buffers aren't accessible for posting > future damage requests without first reattaching the buffer. If we don't > have the accessible check there, then a shm buffer that reattaches won't > actually reattach. Put another way: if the buffer isn't accessible to the > compositor because the compositor ceded control of it, then the compositor > needs to retake control of the buffer before it can use it again. Ah I see. Yea, that makes sense. Also, I wonder if we actually should make the "accessible" bool a ref-count style int, since there might be several surfaces that all are actively using the same buffer. Also, might I suggest that we rename "accessible" to "is_in_use" or something like that? A buffer is always accessible until destroyed, but what we are doing is tracking when the compositor is *using* the buffer.
(In reply to Jonas Ådahl from comment #14) > (In reply to Ray Strode [halfline] from comment #11) > > (In reply to Jonas Ådahl from comment #7) > > > Review of attachment 320519 [details] [review] [review] [review] [review]: > > > > > > A buffer is not ever "unavailable", it's just "in use by the compositor". I > > > don't think we should kill clients just because they post damage after > > > having drawn to a live buffer. We should just let them potentially get > > > glitches, since that is what they are asking for. > > Well, one of the original premises of wayland was "every frame is perfect", > > so I don't entirely agree that we should be okay with clients doing things > > that can lead to graphical glitches, but I also see the point that killing > > the client is a strong reaction. Given your opinion here and the problems > > you've pointed out in comment 6 I think we can just drop this patch. I > > wasn't 100% sure the patch should go in anyway. In the future, I guess, if > > we want to provide better accountability to clients we can do it in the role > > specific code and, perhaps do it in a less fatal way. > > Wayland enables the possibility of "every frame is perfect", but the > compositor has no way to enforce it. A client can draw funky buffers as much > as it likes and there is nothing we can do about it. true. I do think, at some point, we could help clients achieve "every frame is perfect" by somehow alerting the user a developer about a misbehaving client when we can detect it. maybe only if an environment variable is set, not sure. Anyway, we can do that down the road. > Ah I see. Yea, that makes sense. Also, I wonder if we actually should make > the "accessible" bool a ref-count style int, since there might be several > surfaces that all are actively using the same buffer. > > Also, might I suggest that we rename "accessible" to "is_in_use" or > something like that? A buffer is always accessible until destroyed, but what > we are doing is tracking when the compositor is *using* the buffer. Yea, I think a ref count is necessary indeed. will look into it.
Comment on attachment 320517 [details] [review] wayland: use glib function for fetching timestamp Attachment 320517 [details] pushed as 50099c4 - wayland: use glib function for fetching timestamp
Created attachment 320662 [details] [review] wayland: examine pending->buffer when prepping toplevel commit toplevel_surface_commit looks at surface->buffer to decide how to proceed. It should instead look at pending->buffer, since pending represents the new changes to be committed. This commit changes the function to look at pending->buffer instead. At this point, the change may seem a bit academic, since they both hold the same value, but it's actually important prep work for a future change dealing with buffer management.
Created attachment 320663 [details] [review] wayland: only release buffer early for toplevel surfaces It's important we don't prematurely release a surface used by the pointer, since the pointer references the surface on every mouse update. This commit moves the buffer release to toplevel_surface_commit
Created attachment 320664 [details] [review] wayland: get rid of buffer->copied_data boolean We currently track whether or not a buffer can be released early by looking at the copied_data boolean on the buffer. This boolean is, practically speaking, always set to TRUE for shm buffers and is always false otherwise. We can just as easily check if the buffer is a shm buffer to decide whether or not to do an early release. That's better from a theoretical point of view since copied_data assumes a 1-to-1 relationship between surface and buffer, which may not actually hold. This commit drops copied_data and changes the check to instead see if the buffer is shm.
Created attachment 320665 [details] [review] wayland: rework buffer handling one more time commit 0165cb6974 changed mutter to release shm buffers back to the compositor as early as possible. In order to track whether or not a buffer can be released, the commit introduced an "accessible" flag on the buffer structure (to mean "still being used, don't release yet") The problem, though, is that a buffer can theoretically be used by more than one surface at a time. If we have multiple surfaces delaying the release of the buffer, then we need to know when every surface is done with the buffer before doing the release. A single boolean is inappropriate for tracking whether or not multiple surfaces are delaying release of the buffer, a usage counter is needed instead. Furthermore, the surface objects don't need to access a MetaWaylandBuffer object that's been released early—the texture created from the buffer is referenced separately. Since we don't need the buffer after release, we can use the buffer references on the surface objects directly as our usage counter to know when to release. This commit drops meta_wayland_buffer_{take,release}_control and the associated "accessible" struct member, and instead relies on the ref count of the MetaWaylandBuffer itself to know when to do the release.
Created attachment 320666 [details] [review] wayland: handle client attaching already attached buffer better If a client attaches the same buffer that it previously attached, we end up doing a number of function calls that ultimately won't do anything. This commit adds a check up front to avoid the function calls entirely. The check makes the code clearer.
bug 761899 is likely a dupe of this one, commit 7adbb58 seems to be the cause of the fatal error and there are a couple of reproducer in bug 761899 (namely either xvidtune or vlc)
*** Bug 761899 has been marked as a duplicate of this bug. ***
attachment 320665 [details] [review] and attachment 320666 [details] [review] do not apply cleanly anymore on head of git master apparently (git am -3 doesn't help are the blobs are missing) but the rebase is trivial. Once applied, it solves the issue with vlc and xvidtune but now mutter crashes with gnome-terminal (reverting to the origin/master heads doesn't have that issue). I tried twice to be sure.
backtrace is:
+ Trace 235977
Created attachment 321442 [details] [review] wayland: return from toplevel commit early if no new buffer meta_wayland_surface_toplevel_commit has a lot of logic to handle a new buffer getting attached as part of the commit. None of that code needs to run if there is no new buffer attached. This commit short-circuits that case.
Created attachment 321443 [details] [review] wayland: examine pending->buffer when prepping toplevel commit toplevel_surface_commit looks at surface->buffer to decide how to proceed. It should instead look at pending->buffer, since pending represents the new changes from the client to be committed. At this point, the change may seem a bit academic, since surface->buffer reflects changes made to the surface by the client, but in the future we're going to nullify surface->buffer when the compositor is finished with it, so this commit is important prep work for when the invariant will no longer hold.
Created attachment 321444 [details] [review] wayland: move toplevel surface geometry update to helper function toplevel_surface_commit has some fairly complicated code to deal with syncing the window geometry with the buffer surface geometry. For clarity, this commit moves that code to a helper function.
Created attachment 321445 [details] [review] wayland: only release buffer early for toplevel surfaces It's important we don't prematurely release a surface used by the pointer, since the pointer references the surface on every mouse update. This commit moves the buffer release to toplevel_surface_commit
Created attachment 321446 [details] [review] wayland: get rid of buffer->copied_data boolean We currently track whether or not a buffer can be released early by looking at the copied_data boolean on the buffer. This boolean is, practically speaking, always set to TRUE for shm buffers and is always false otherwise. We can just as easily check if the buffer is a shm buffer to decide whether or not to do an early release. That's better from a theoretical point of view since copied_data assumes a 1-to-1 relationship between surface and buffer, which may not actually hold. This commit drops copied_data and changes the check to instead see if the buffer is shm.
Created attachment 321447 [details] [review] wayland: rework buffer handling one more time commit 0165cb6974 changed mutter to release shm buffers back to the compositor as early as possible. In order to track whether or not a buffer can be released, the commit introduced an "accessible" flag on the buffer structure (to mean "still being used, don't release yet") The problem, though, is that a buffer can theoretically be used by more than one surface at a time. If we have multiple surfaces delaying the release of the buffer, then we need to know when every surface is done with the buffer before doing the release. A single boolean is inappropriate for tracking whether or not multiple surfaces are delaying release of the buffer, a usage counter is needed instead. Furthermore, the surface objects don't need to access a MetaWaylandBuffer object that's been released early—the texture created from the buffer is referenced separately. Since we don't need the buffer after release, we can use the buffer references on the surface objects directly as our usage counter to know when to release. This commit drops meta_wayland_buffer_{take,release}_control and the associated "accessible" struct member, and instead relies on the ref count of the MetaWaylandBuffer itself to know when to do the release.
Still not good I think, now it doesn't crash anymore with gnome-terminal, vlc nor xvidtune but starting gnome-terminal, nothing happens, the window never gets mapped. It's kinda "funny", the gnome-terminal shows in the shell's window picker, you can even click on it, but it never gets mapped on screen.
oh sorry, i guess i screwed up my git-bz incantation and dropped the first patch
Created attachment 321498 [details] [review] wayland: don't depending on surface->buffer to know if a window is showable The core window code defers showing a window until there's a buffer on the surface associated with the window. A surface may release its buffer early, though, so checking for the existance of a buffer isn't a good check. This commit changes the check to use a boolean instead.
^ this one is supposed to be first on the branch
Works for me! thanks!
Review of attachment 321498 [details] [review]: This doesn't handle the case that if someone explicitly attaches NULL with wl_surface.attach on a wl_shell_surface, it should become unshowable. The issue seems to be that we're using surface->buffer for "the surface that should be shown" and some object refcounting thing. I'm not quite sure I get it -- want to chat on IRC?
Review of attachment 321442 [details] [review]: Looks OK.
Review of attachment 321443 [details] [review]: I don't like changing the meaning of surface->buffer, as discussed earlier.
Review of attachment 321444 [details] [review]: I don't see the point, I'm +0 on it.
Review of attachment 321445 [details] [review]: I'm very confused why we need this. More explanation is definitely needed.
Review of attachment 321447 [details] [review]: So we're going back to the old refcount system I had -- heh. I don't like the approach, though. I think the MetaWaylandBuffer and the CoglTexture should be associated together and have the same lifetime. Let's chat and see what's happening here. I'm definitely confused why "if SHM, copy as soon as possible and send a buffer release in ensure_cogl_texture" apparently doesn't work. For all buffers, not just toplevels.
(In reply to Jasper St. Pierre from comment #37) > Review of attachment 321498 [details] [review] [review]: > > This doesn't handle the case that if someone explicitly attaches NULL with > wl_surface.attach on a wl_shell_surface, it should become unshowable. It doesn't actually need to handle that case, since we meta_window_unmanage the window, and dissociate the surface from the window. We could set is_showable to FALSE in that case, but I don't think it matters in practice. Still I don't mind adding a line to that effect. > The issue seems to be that we're using surface->buffer for "the surface that > should be shown" and some object refcounting thing. I'm not quite sure I get > it So the problem is that, in theory, a buffer can be associated with more than one surface at once. That means the is_accessible boolean on the buffer isn't right if we care about multiple surfaces using the same buffer. Just because one surface copied the pixels from the buffer and is done with it doesn't mean they all have. So now we need to track when all surfaces are using the buffer and only send a release when all surfaces are finished. You could accomplish this with a use_count variable that gets incremented by every surface that attaches the buffer, and then only release the buffer when the use_count drops to 0 (see comment 14/comment 15), but there's a complication there: we have to be careful to not bump the use_count if a surface attaches the same buffer that it's already attached. This is because clients sometimes attach the same buffer when they don't need to, and we need to treat that as a no-op. But, of course, the attachment shouldn't be a no-op if the buffer was previously released and is getting reused. And a surface wouldn't be able to tell if the buffer is getting recyled, with just a use_count variable. So we would have to introduce more state on each surface object to detect that case. It's all doable, but a question popped into my head: "Do the surfaces actually need to keep the MetaWaylandBuffer around when it has 0 use_count ?" I think the answer is no. Once we've released the buffer, the surfaces shouldn't need to access the object anymore. So having two counts, use_count and ref_count is complexity we can trim out. If we're willing to no longer assume surface->buffer != NULL means the same thing as "surface has valid texture", and instead treat it to mean "surface has a buffer that isn't released yet, since the surface depends on the buffer", then we can go back to the original spirit of surface->buffer from before bug 761312 and simplify how it all works. That's what the patchset does, but of course, it has to weed out: 1) users of the code that assume surface->buffer != NULL means "surface has valid texture". attachment 321498 [details] [review] and attachment 321443 [details] [review] do that 2) the assumptions in the code that assume only one surface will use a buffer at once. attachment 321446 [details] [review] and attachment 321447 [details] [review] does that
(In reply to Jasper St. Pierre from comment #40) > Review of attachment 321444 [details] [review] [review]: > > I don't see the point, I'm +0 on it. Right, it's a lateral organizational change in the code. It's only important because the patch after it depends on running some code at the bottom of the toplevel_commit function, and the code getting moved has early returns. If we want to keep the code in the same function we need to change the early returns to "goto outs" or drop the returns and make them skip over the parts of the code they were trying to avoid by returning early. Moving it to a subroutine seemed like the cleanest approach to me (especially since that chunk of moved code is pretty self contained)
(In reply to Jasper St. Pierre from comment #41) > Review of attachment 321445 [details] [review] [review]: > > I'm very confused why we need this. More explanation is definitely needed. So it's confusing because the commit message is wrong. backstory is I noticed there may be a problem with cursor code, but put a pin in doing the actual investigation until later then forgot to follow up. The problem is wl_pointer::set_cursor needs to access the buffer associated with the cursor surface, which won't be available if the buffer is already released.
So Jasper and I and #wayland had a discussion about buffer semantics, and we ultimately settled on one problem with destroying the MetaWaylandBuffer at release time. It means we have to recreate the texture and do a full buffer copy if there's attach/damage after the buffer has been released. It would be good to be able to optimize that case, such that only the damaged region is copied. Given that, I think it makes sense to rework the patchset again, to use the more complicated use_count / ref_count split.
Comment on attachment 321442 [details] [review] wayland: return from toplevel commit early if no new buffer Attachment 321442 [details] pushed as acd5050 - wayland: return from toplevel commit early if no new buffer
(In reply to Ray Strode [halfline] from comment #46) > So Jasper and I and #wayland had a discussion about buffer semantics, and we > ultimately settled on one problem with destroying the MetaWaylandBuffer at > release time. It means we have to recreate the texture and do a full buffer > copy if there's attach/damage after the buffer has been released. It would > be good to be able to optimize that case, such that only the damaged region > is copied. > > Given that, I think it makes sense to rework the patchset again, to use the > more complicated use_count / ref_count split. Sorry for not looking at the patches earlier. Two separate ref counters is what I assumed was needed, yes, and that the MetaWaylandBuffer would indeed stay referenced by the surface attaches another buffer.
right, I was just hoping we could simplify the code and bring it back to the original spirit of the way Jasper wrote it originally, but it didn't pan out. Will do an update soon.
Created attachment 321611 [details] [review] wayland: get rid of buffer->copied_data boolean We currently track whether or not a buffer can be released early by looking at the copied_data boolean on the buffer. This boolean is, practically speaking, always set to TRUE for shm buffers and is always false otherwise. We can just as easily check if the buffer is a shm buffer to decide whether or not to do an early release. That's better from a theoretical point of view since copied_data assumes a 1-to-1 relationship between surface and buffer, which may not actually hold. This commit drops copied_data and changes the check to instead see if the buffer is shm.
Created attachment 321612 [details] [review] wayland: change accessible boolean to use_count counter Since a buffer can be used by multiple surfaces at once, we need to release the buffer only after all surfaces are finished with it. Currently we track whether or not to release the buffer based on the accessible boolean. This commit changes it to a counter to accomodate multiple users. Also, each surface needs to know whether not it is done with the buffer, so this commit adds a buffer_used boolean to the surface state. -- So we still have a problem with surfaces getting used for pointers potentially getting released early. It's kind of a side issue, though, and a minor one at that, so I think we can address it in a different bug.
Review of attachment 321611 [details] [review]: Looks good to me.
Review of attachment 321612 [details] [review]: ::: src/wayland/meta-wayland-surface.c @@ +178,3 @@ +surface_use_buffer (MetaWaylandSurface *surface) +{ + if (surface->using_buffer) When would these ever hit? I feel like any case this would happen would also be done by "surface->buffer == buffer", no?
Review of attachment 321612 [details] [review]: Looks good to me.
Review of attachment 321612 [details] [review]: ::: src/wayland/meta-wayland-surface.c @@ +178,3 @@ +surface_use_buffer (MetaWaylandSurface *surface) +{ + if (surface->using_buffer) The surface->buffer == buffer check is in surface_set_buffer(), while this is called from apply_pending_state(). I assume it's needed when wl_surface.attach(eglbufferA); wl_surface.damage(); wl_surface.commit(); (note: no wl_buffer.release() on eglbufferA wl_surface.attach(eglbufferA); wl_surface.damage(); wl_surface.commit(); While this means the client attached an in-use buffer with damage which is silly, we are already using it, and shouldn't add increase the use count.
Review of attachment 321612 [details] [review]: ::: src/wayland/meta-wayland-surface.c @@ +178,3 @@ +surface_use_buffer (MetaWaylandSurface *surface) +{ + if (surface->using_buffer) there's a bug here though. I shouldn't decrement the use_count by setting release_new_buffer = TRUE if I don't increment the use_count in surface_use_buffer.
pushed with that change: Attachment 321611 [details] pushed as e097bc8 - wayland: get rid of buffer->copied_data boolean Attachment 321612 [details] pushed as bed8242 - wayland: change accessible boolean to use_count counter