After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 728464 - Screen update/flickering problems with nvidia drivers
Screen update/flickering problems with nvidia drivers
Status: RESOLVED FIXED
Product: mutter
Classification: Core
Component: general
3.12.x
Other Linux
: Normal critical
: ---
Assigned To: mutter-maint
mutter-maint
: 664858 738814 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2014-04-17 19:25 UTC by Davi
Modified: 2015-10-07 06:40 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
compositor: Use pre and post paint clutter repaint functions (3.93 KB, patch)
2014-05-05 20:46 UTC, Rui Matos
rejected Details | Review
compositor: Add support for GL_EXT_x11_sync_object (16.83 KB, patch)
2014-05-05 20:48 UTC, Rui Matos
needs-work Details | Review
compositor: Sync X drawing only once per frame (9.85 KB, patch)
2014-05-06 17:16 UTC, Rui Matos
none Details | Review
compositor: Add support for GL_EXT_x11_sync_object (17.21 KB, patch)
2014-05-06 17:16 UTC, Rui Matos
reviewed Details | Review
compositor: Sync X drawing only once per frame (10.78 KB, patch)
2014-05-08 18:28 UTC, Rui Matos
none Details | Review
compositor: Sync X drawing only once per frame (4.96 KB, patch)
2014-05-08 20:04 UTC, Rui Matos
reviewed Details | Review
compositor: Sync X drawing only once per frame (4.83 KB, patch)
2014-05-09 11:43 UTC, Rui Matos
accepted-commit_now Details | Review
compositor: Add support for GL_EXT_x11_sync_object (18.24 KB, patch)
2014-05-09 11:54 UTC, Rui Matos
needs-work Details | Review
Add support for an "async XSync" (4.82 KB, patch)
2014-05-09 21:05 UTC, Owen Taylor
rejected Details | Review
compositor: Sync X drawing only once per frame (4.89 KB, patch)
2014-05-12 23:19 UTC, Rui Matos
committed Details | Review
display: Add public getters for sync extension presence and event base (1.97 KB, patch)
2014-05-12 23:19 UTC, Rui Matos
accepted-commit_now Details | Review
compositor: Add support for GL_EXT_x11_sync_object (21.07 KB, patch)
2014-05-12 23:22 UTC, Rui Matos
none Details | Review
MetaSyncRing: disable after a number of reboot attempts (5.94 KB, patch)
2014-05-13 13:58 UTC, Rui Matos
accepted-commit_now Details | Review
sync-ring: Use the backend X connection to create fences (8.89 KB, patch)
2015-03-06 15:04 UTC, Rui Matos
none Details | Review
compositor: Add support for GL_EXT_x11_sync_object (23.37 KB, patch)
2015-03-06 15:12 UTC, Rui Matos
committed Details | Review
Fix race between glWaitSync and X fence reset (3.37 KB, patch)
2015-08-05 00:13 UTC, Aaron Plattner
committed Details | Review
fixup! compositor: Fix GL_EXT_x11_sync_object race condition (2.89 KB, patch)
2015-08-05 13:58 UTC, Rui Matos
committed Details | Review
Rendering problem with gnome-terminall/mutt combo (93.38 KB, image/png)
2015-08-06 06:56 UTC, İsmail Dönmez
  Details
Rendering problem with gnome-terminall/mutt combo (9.13 KB, image/png)
2015-08-06 06:58 UTC, İsmail Dönmez
  Details
x11: Move damage handling to the backend (20.33 KB, patch)
2015-08-11 17:38 UTC, Rui Matos
rejected Details | Review
compositor: Handle fences in the frontend X connection (3.72 KB, patch)
2015-08-12 14:28 UTC, Rui Matos
committed Details | Review

Description Davi 2014-04-17 19:25:00 UTC
Since the update to driver nvidia 334.21 I've been experiencing sever flickering and update problems in several GTK+ applications. As per [1] I'm reporting this here.

I'll describe some examples:

Firefox:
- The window does not update when I switch tabs, unless I move the cursor.
- Popovers cause severe flickering
- Typing in a text area does not always update the content

GNOME Terminal:
- Scrolling with shift+pg up/pg down does not update the window

Liferea:
- Selecting a different headline does not update the content panel

Some info:
Arch Linux x86_64
linux-3.14.1-1
GPUs: GeForce 310M and GeForce GT 440
Drivers nvidia-334.21-x and nvidia-337.12-1
mutter-3.12.0-1
clutter-1.18.0-1

Please, let me know if you need more info. Thanks.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=724788#c7
Comment 1 jarf.barker 2014-04-18 02:28:53 UTC
I also see similar issues with Gnome 3.12 on Arch with GeForce 560ti and driver version 334 or higher.  I'll just add a few more examples:

- Corruption of window close buttons in Nautilus on mouse hover
- Nautilus will not update window contents until mouse moves
- Laggy hover animation on left-side menu in Tweak Tool
Comment 2 Joran Martinière 2014-04-19 08:59:30 UTC
Also have this bug on Arch Linux with GNOME 3.12 up to date and GeForce GTX 660 with nVidia 337.12 drivers:
- It mostly happens when scrolling in Firefox without moving mouse, the window does not update immediately. I have to move the cursor and sometimes it takes up to 2 secounds to update screen.
Comment 3 t.jp 2014-04-19 12:48:00 UTC
I reported a bug a while ago where the UI of Gnome would flicker with the newest beta driver at that time.
https://devtalk.nvidia.com/default/topic/690704/linux/-334-16-parts-of-the-ui-randomly-flicker-when-moving-the-mouse/

It turned out to be a bug in clutter.

In the meantime the patch was applyied to clutter on Arch Linux fixing the extreme flickering. Around the same time I started noticing redrawing problems.

Issue:

The UI doesn't update instantly when UI elements have changed for example a folder and its contents was loaded or a filename edited.

Example:

https://www.youtube.com/watch?v=Y8_06aJYbEY

Several applications don't redraw properly. This is especially noticable in gedit and nautilus when editing text. 

In this example notice how it doesn't display that the first character of the folder name is deleted despite hitting backspace like a maniac. This is also obvious when typing "This is a test" and the last character only is displayed when clicking inside nautilus to finish the renaming process (probably triggering a new redraw or something).

This will also happens when editing the path in nautilus leading to tons of erros, when trying to correct the path.

This will also give the impression that nautilus takes unusually long to open a folder completely, since the content isn't drawn as soon as it appears.

The PC in the example is using a GTX 580 with the 337.12 driver and Gnome 3.12 from [testing] on Arch Linux.

I have a laptop with a 6150 Go also using Gnome 3.12 from [testing] using the 304.121 driver and it doesn't have this issue. When using nouveau on this laptop everything is fine aswell.

Packages affected:

The example is using Gnome 3.12 with clutter 1.18 using nvidia 337.12. When downgrading to Gnome 3.10 with clutter 1.16.4 and nvidia 334.21 the issue is still there.

This does not happen in non-gnome applications though suggesting that it might be still Gnomes fault.

Reproducing the folder renaming example might not work on the first try. I rebooted and was able to create and rename a folder without a problem. When trying to create a second folder called "This is another test" the redrawing issue started to appear first noticed by the perceived slow reaction when hitting backspace. I repeated this a few times and the bug just seems to kick in at some point after which it is always present.
Comment 4 nekroman.sk 2014-05-03 10:25:36 UTC
Setting nVidia graphic card to Performance, solved this problem. This bug is showing only when card is underclocked.
Comment 5 t.jp 2014-05-03 16:30:33 UTC
(In reply to comment #4)
> Setting nVidia graphic card to Performance, solved this problem. This bug is
> showing only when card is underclocked.

My lowest powersaveing mode is 50 Mhz graphics clock and Gnome really starts to feel laggy in terms of input when clocked at this speed. I tried reproducing my example above and at 50 Mhz it takes almost 1s for the character to appear. This doesn't happen when the card is clocked at 782 Mhz. For some reason I can't reproduce my example anymore where it always showed only the second last input no matter how long I waited.
Comment 6 Rui Matos 2014-05-05 20:46:01 UTC
Created attachment 275914 [details] [review]
compositor: Use pre and post paint clutter repaint functions

Instead of using a stage paint callback for post paint operations,
let's make this nicely symetric and use both pre and post paint
functions.
Comment 7 Rui Matos 2014-05-05 20:48:20 UTC
Created attachment 275915 [details] [review]
compositor: Add support for GL_EXT_x11_sync_object

If the GL implementation reports this extension we should use it to
synchronize X rendering with our own GL rendering.
--

The algorithm used here is just a translation from C++ to C from a
compiz patch by NVIDIA's James Jones. See
https://launchpadlibrarian.net/162371765/add_x_to_gl_sync.patch .
Comment 8 drago01 2014-05-06 10:17:50 UTC
Review of attachment 275914 [details] [review]:

OK.
Comment 9 drago01 2014-05-06 10:33:00 UTC
Review of attachment 275915 [details] [review]:

::: src/Makefile.am
@@ +137,3 @@
 	compositor/region-utils.c		\
 	compositor/region-utils.h		\
+	compositor/xtoglsync.c			\

Should be a bit more consistent with the rest of the naming schema i.e call it
x-to-gl-sync.{c,h} or better x11-to-gl-sync.{c,h}

::: src/compositor/compositor.c
@@ +793,3 @@
+       * drawing, otherwise we would stall when we draw a frame
+       * where only the clutter scene changes. */
+      compositor->need_xtoglsync = TRUE;

That should be in meta_surface_actor_x11_pre_paint () where we have the XSync() for open source drivers.
Should also remove the comment that says that we should use EXT_x11_sync_object (because well we do with this patch). 

Also we probably should not have to call XSync when we use EXT_x11_sync_object .. that needs some getters and setters in the compositor.

::: src/compositor/xtoglsync.c
@@ +124,3 @@
+  gboolean x11_sync_object = FALSE;
+
+  _glGetIntegerv (GL_NUM_EXTENSIONS, &num_extensions);

Yeah I know you have just ported this code over from James's compiz patch but instead of reinventing what we already have can't we just use _cogl_check_extension () instead?

@@ +140,3 @@
+
+static gboolean
+load_required_symbols (void)

Same here can't we just use cogl_get_proc_address () here (which already does the rest behind the scenes) ?
Comment 10 Rui Matos 2014-05-06 15:42:38 UTC
(In reply to comment #9)
> Review of attachment 275915 [details] [review]:
> 
> ::: src/Makefile.am
> @@ +137,3 @@
>      compositor/region-utils.c        \
>      compositor/region-utils.h        \
> +    compositor/xtoglsync.c            \
> 
> Should be a bit more consistent with the rest of the naming schema i.e call it
> x-to-gl-sync.{c,h} or better x11-to-gl-sync.{c,h}

Yeah, the name sucks. I'm not fond of this either but sure I'll change to it.

> ::: src/compositor/compositor.c
> @@ +793,3 @@
> +       * drawing, otherwise we would stall when we draw a frame
> +       * where only the clutter scene changes. */
> +      compositor->need_xtoglsync = TRUE;
> 
> That should be in meta_surface_actor_x11_pre_paint () where we have the XSync()
> for open source drivers.
> Also we probably should not have to call XSync when we use EXT_x11_sync_object
> .. that needs some getters and setters in the compositor.

Why do we do the XSync() there? It actually looks wrong to me since XSync() is a global hammer and we're potentially calling it several times per frame this way. Wouldn't it be better to move it to meta_pre_paint_func() as well and there we could decide to do one or the other just once per frame?


> ::: src/compositor/xtoglsync.c
> @@ +124,3 @@
> +  gboolean x11_sync_object = FALSE;
> +
> +  _glGetIntegerv (GL_NUM_EXTENSIONS, &num_extensions);
> 
> Yeah I know you have just ported this code over from James's compiz patch but
> instead of reinventing what we already have can't we just use
> _cogl_check_extension () instead?

cogl_check_extension is deprecated and the only thing it does is strstr (ext, name) != NULL anyway. Also, I don't see a cogl method to get the raw extensions string.

And, IIUC, for GL >= 3 you are supposed to use glGetStringi() to check extensions.

> @@ +140,3 @@
> +
> +static gboolean
> +load_required_symbols (void)
> 
> Same here can't we just use cogl_get_proc_address () here (which already does
> the rest behind the scenes) ?

I could but since I'm not using cogl anywhere else, it seems a bit inconsistent and pointless for such a trivial thing.
Comment 11 drago01 2014-05-06 15:51:45 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > Review of attachment 275915 [details] [review] [details]:
> > 
> > ::: src/Makefile.am
> > @@ +137,3 @@
> >      compositor/region-utils.c        \
> >      compositor/region-utils.h        \
> > +    compositor/xtoglsync.c            \
> > 
> > Should be a bit more consistent with the rest of the naming schema i.e call it
> > x-to-gl-sync.{c,h} or better x11-to-gl-sync.{c,h}
> 
> Yeah, the name sucks. I'm not fond of this either but sure I'll change to it.
>
> > ::: src/compositor/compositor.c
> > @@ +793,3 @@
> > +       * drawing, otherwise we would stall when we draw a frame
> > +       * where only the clutter scene changes. */
> > +      compositor->need_xtoglsync = TRUE;
> > 
> > That should be in meta_surface_actor_x11_pre_paint () where we have the XSync()
> > for open source drivers.
> > Also we probably should not have to call XSync when we use EXT_x11_sync_object
> > .. that needs some getters and setters in the compositor.
> 
> Why do we do the XSync() there? 

There is a big comment explaining it basically it does the same as your patches for open source drivers.

> It actually looks wrong to me since XSync() is
> a global hammer and we're potentially calling it several times per frame this
> way. Wouldn't it be better to move it to meta_pre_paint_func() as well and
> there we could decide to do one or the other just once per frame?

Yeah moving it there and either calling it when !compositor->have_xtoglsync && compositor->need_xtoglsync would make sense (move it along with the comment though).

> 
> > ::: src/compositor/xtoglsync.c
> > @@ +124,3 @@
> > +  gboolean x11_sync_object = FALSE;
> > +
> > +  _glGetIntegerv (GL_NUM_EXTENSIONS, &num_extensions);
> > 
> > Yeah I know you have just ported this code over from James's compiz patch but
> > instead of reinventing what we already have can't we just use
> > _cogl_check_extension () instead?
> 
> cogl_check_extension is deprecated and the only thing it does is strstr (ext,
> name) != NULL anyway. Also, I don't see a cogl method to get the raw extensions
> string.
> 
> And, IIUC, for GL >= 3 you are supposed to use glGetStringi() to check
> extensions.

OK.

> > @@ +140,3 @@
> > +
> > +static gboolean
> > +load_required_symbols (void)
> > 
> > Same here can't we just use cogl_get_proc_address () here (which already does
> > the rest behind the scenes) ?
> 
> I could but since I'm not using cogl anywhere else, it seems a bit inconsistent
> and pointless for such a trivial thing.

Yeah it somehow feels a bit disconnected from the rest but OK.
Comment 12 Rui Matos 2014-05-06 17:16:36 UTC
Created attachment 276008 [details] [review]
compositor: Sync X drawing only once per frame

We only need to call XSync() once per frame to synchronize X with GL
drawing.
--

This got uglier than I intended since I had to split the
XDamageSubtract() call from updating the pixmap. Hope Jasper won't
hate it too much...
Comment 13 Rui Matos 2014-05-06 17:16:49 UTC
Created attachment 276009 [details] [review]
compositor: Add support for GL_EXT_x11_sync_object

If GL advertises this extension we'll use it to synchronize X with GL
rendering instead of relying on the XSync() behavior with open source
drivers.
Comment 14 Jasper St. Pierre (not reading bugmail) 2014-05-06 17:38:33 UTC
(In reply to comment #12)
> This got uglier than I intended since I had to split the
> XDamageSubtract() call from updating the pixmap. Hope Jasper won't
> hate it too much...

It looks fine to me, but I'd like to have it go through Owen's eyes just in case.
Comment 15 Rui Matos 2014-05-08 18:28:51 UTC
Created attachment 276189 [details] [review]
compositor: Sync X drawing only once per frame

--

v2: Keep the same exact ordering of operations as before.

The previous version was making window resizing lagging the
pointer. This fixes it.
Comment 16 Rui Matos 2014-05-08 20:04:28 UTC
Created attachment 276197 [details] [review]
compositor: Sync X drawing only once per frame

--

v3: Jasper pointed out that we shouldn't need to do the XSync() before
update_pixmap() and indeed we don't. Even for the nvidia fence, doing
update_pixmap() before it seems to work correctly too.

This patch, then, becomes much simpler.
Comment 17 Jasper St. Pierre (not reading bugmail) 2014-05-08 20:10:17 UTC
Review of attachment 276197 [details] [review]:

::: src/compositor/compositor.c
@@ +778,3 @@
+          process_damage (compositor, (XDamageNotifyEvent *) event, window);
+
+          compositor->need_sync_drawing = TRUE;

Can you put this in process_damage?

@@ +1138,2 @@
   MetaWindowActor *top_window;
+  GList *l;

Can you keep the order of GList / MetaWindowActor here consistent?

@@ +1165,3 @@
+    {
+      /* We need to make sure that any X drawing that happens before
+       * the XDamageSubtract() for each MWA above is visible to

"for each window". It's not really specific to the MWA.
Comment 18 Jasper St. Pierre (not reading bugmail) 2014-05-08 20:26:25 UTC
Review of attachment 276009 [details] [review]:

::: src/compositor/compositor.c
@@ +801,3 @@
     clutter_x11_handle_event (event);
 
+  if (!meta_is_wayland_compositor () &&

have_x11_sync_object should only be set to TRUE, so there's no need for this.

@@ +1181,3 @@
     meta_window_actor_pre_paint (l->data);
 
+  if (compositor->have_x11_sync_object && compositor->need_sync_drawing)

Can this be:

  if (compositor->need_sync_drawing)
    {
       /* giant comment */
       if (compositor->have_x11_sync_object)
         xtoglsync_insert_wait ();
       else
         {
           XSync ();
           compositor->need_sync_drawing = FALSE;
         }
    }

::: src/compositor/x-to-gl-sync.c
@@ +68,3 @@
+static XSyncValue sync_value_zero;
+static XSyncValue sync_value_one;
+static Display *xdisplay = NULL;

A bit concerned about all the globals here, but I won't press on it.

@@ +151,3 @@
+  g_return_val_if_fail (dlhandle != NULL, FALSE);
+
+  if (!load_symbol (dlhandle, "glGetIntegerv", (void **) &_glGetIntegerv))

I think we can rely on glGetIntegerv, glGetStringi, and glXGetProcAddress existing.

@@ +183,3 @@
+sync_insert (XToGLSync *self)
+{
+  g_assert (self->state == XTOGLSYNC_STATE_READY);

Let's cross our fingers we never hit this, and that X11 sends us all the events on time!

@@ +202,3 @@
+    {
+    case XTOGLSYNC_STATE_DONE:
+

I don't know how much you care about retaining code style, but we don't put blank lines here.

@@ +213,3 @@
+
+      break;
+

Or here.

@@ +356,3 @@
+
+  alarm_to_sync = g_hash_table_new (NULL, NULL);
+  syncs_array = g_malloc0 (NUM_SYNCS * sizeof (XToGLSync *));

Could you allocate this statically? It's constant-size.
Comment 19 drago01 2014-05-08 20:34:33 UTC
(In reply to comment #18)
> Review of attachment 276009 [details] [review]:
> 
> ::: src/compositor/compositor.c
> @@ +801,3 @@
>      clutter_x11_handle_event (event);
> 
> +  if (!meta_is_wayland_compositor () &&
> 
> have_x11_sync_object should only be set to TRUE, so there's no need for this.
> 
> @@ +1181,3 @@
>      meta_window_actor_pre_paint (l->data);
> 
> +  if (compositor->have_x11_sync_object && compositor->need_sync_drawing)
> 
> Can this be:
> 
>   if (compositor->need_sync_drawing)
>     {
>        /* giant comment */
>        if (compositor->have_x11_sync_object)
>          xtoglsync_insert_wait ();
>        else
>          {
>            XSync ();
>            compositor->need_sync_drawing = FALSE;
>          }
>     }
> 
> ::: src/compositor/x-to-gl-sync.c
> @@ +68,3 @@
> +static XSyncValue sync_value_zero;
> +static XSyncValue sync_value_one;
> +static Display *xdisplay = NULL;
> 
> A bit concerned about all the globals here, but I won't press on it.
> 
> @@ +151,3 @@
> +  g_return_val_if_fail (dlhandle != NULL, FALSE);
> +
> +  if (!load_symbol (dlhandle, "glGetIntegerv", (void **) &_glGetIntegerv))
> 
> I think we can rely on glGetIntegerv, glGetStringi, and glXGetProcAddress
> existing.

We do not link against libGL directly at all (and this was an intentional change made by the cogl people).
Comment 20 Rui Matos 2014-05-09 11:43:43 UTC
Created attachment 276234 [details] [review]
compositor: Sync X drawing only once per frame

--
Addressed all the comments.
Comment 21 Rui Matos 2014-05-09 11:54:50 UTC
Created attachment 276236 [details] [review]
compositor: Add support for GL_EXT_x11_sync_object

--
(In reply to comment #18)
> +  if (!meta_is_wayland_compositor () &&
>
> have_x11_sync_object should only be set to TRUE, so there's no need
for this.

Removed.

> +  if (compositor->have_x11_sync_object &&
compositor->need_sync_drawing)
>
> Can this be:

Sure.

> +static XSyncValue sync_value_zero;
> +static XSyncValue sync_value_one;
> +static Display *xdisplay = NULL;
>
> A bit concerned about all the globals here, but I won't press on it.

I can change it if you want but I don't think we'll need more than one
instance of these things, unless we start using more than one gpu in
the compositor...

> +  if (!load_symbol (dlhandle, "glGetIntegerv", (void **)
&_glGetIntegerv))
>
> I think we can rely on glGetIntegerv, glGetStringi, and
glXGetProcAddress
> existing.

As drago01 says, we don't link to libGL at all. I added a comment
about it now.

We do need the GL headers to be present at compile time now so I added
a check for it in configure.ac. I just made it unconditionally require
GL headers that define GL_EXT_x11_sync_object to keep it simple. Mesa
has had the definition since version 8.0 from Feb 2012.

If you prefer we could instead copy the #defines from mesa which is
what the original nvidia patch does.

> +  g_assert (self->state == XTOGLSYNC_STATE_READY);
>
> Let's cross our fingers we never hit this, and that X11 sends us all
the events
> on time!

I thought I had replaced all of them with g_return_if_fail. Fixed

> +  alarm_to_sync = g_hash_table_new (NULL, NULL);
> +  syncs_array = g_malloc0 (NUM_SYNCS * sizeof (XToGLSync *));
>
> Could you allocate this statically? It's constant-size.

Ok.

Also fixed the excess blank lines and replace g_warnigns with
meta_warnings.
Comment 22 Jasper St. Pierre (not reading bugmail) 2014-05-09 14:58:27 UTC
Review of attachment 276234 [details] [review]:

OK.
Comment 23 Jasper St. Pierre (not reading bugmail) 2014-05-09 15:04:25 UTC
Review of attachment 276236 [details] [review]:

Looks good. A few nits, just fix them before committing.

::: src/compositor/compositor.c
@@ +1182,3 @@
       /* We need to make sure that any X drawing that happens before
+       * the XDamageSubtract() for each window above is visible to
+       * subsequent GL rendering; for drivers which don't support

; the standardized way to do this GL_EXT_X11_sync_object. Since this isn't implemented yet in mesa, we also have a path that relies on the implementation of the open source drivers.

Anything else, we just hope for the best.

::: src/compositor/x-to-gl-sync.c
@@ +99,3 @@
+  if (!*func)
+    {
+      meta_verbose ("XToGLSync: failed to resolve required symbol \"%s\"\n", name);

You don't need \n at the end of meta_verbose or meta_warning.
Comment 24 drago01 2014-05-09 20:02:02 UTC
Review of attachment 276236 [details] [review]:

Something is wrong with this .. it seems to mess up cogls clip stack and I am getting tons of warnings:

(gnome-shell:29762): Cogl-WARNING **: ./driver/gl/cogl-clip-stack-gl.c:419: GL error (1281): Invalid value
 Window manager warning: Log level 16: (compositor/x-to-gl-sync.c:220):sync_check_update_finished: runtime check failed: (status != GL_WAIT_FAILED)
Comment 25 Owen Taylor 2014-05-09 21:05:58 UTC
Created attachment 276263 [details] [review]
Add support for an "async XSync"

One of the things that I don't love about the patch is use of sync
extension alarms - the API is clunky and hard to reason about,
and all the code really wants to do is check if the XSyncResetFence()
call has been processed. It's basically completely fine to do:

 if (XLastKnownRequestProcessed (xdisplay) < serial_of_reset_fence_call)
    XSync (xdisplay, False);

this works perfectly unless there is no events or request replies
received betwen the fence reset and the point where we try to use it
again (roughly 5 frames later.) But since we only cycle through and use
fences when we are receiving damage events from the server, this is
actually impossible. So the XSync() should never be hit, and if it
was hit, it wouldn't be a big deal. 

This patch avoids relying on events/replies coincidentally happening
by reorganizing code in display.c a bit to expose an "async XSync()"
functionality. But thinking about it, I can't think of any advantages.

In the code to get serial_of_reset_fence_call, you need the
same libX11 bug workaround in this patch. (And the comment.) So:

 XSyncResetFence(...)
 reset_serial = XNextRequest(xdisplay) - 1 ;

I can't think that it would actually matter because of the huge interval,
there's basically no chance that the Reset wouldn't have taken effect,
but getting it 100% right takes this out of the equation for figuring
out what's causing lockups.

===

Add functions meta_display_generate_event() and
meta_display_wait_for_generated_event() that we can use to block until
some X server request that we made at a previous time has been
processed.
Comment 26 Owen Taylor 2014-05-09 21:13:39 UTC
Review of attachment 276263 [details] [review]:

As noted in the comment, I don't think this is actually needed. And no reason to add complexity that we don't need just on speculation that it could be useful for some later change to Mutter.
Comment 27 Owen Taylor 2014-05-09 21:19:04 UTC
Review of attachment 275914 [details] [review]:

Noooooo :-)

commit 47b21b3547775baaec51f10aa7ef3d77ea093803
Author: Owen W. Taylor <otaylor@fishsoup.net>
Date:   Wed Apr 24 16:49:06 2013 -0400

    Use new clutter_stage_set_paint_callback() function for after-paint notification
    
    Commit 4f2bb583bf8c changed things so that the compositor used
    clutter_threads_add_repaint_func_full (CLUTTER_REPAINT_FLAGS_POST_PAINT
    to get after-paint notification and send _NET_WM_FRAME_DRAWN, but this
    doesn't actually work, since Clutter will already have blocked for
    VBlank before calling post-paint functions.
    
    The result is that frame synced toolkits like GTK 3.8 will normally
    only be able to draw every other frame.
    
    Since ::paint doesn't work either, a new function
    clutter_stage_set_paint_callback() has been added to Clutter
    (and will be included in the 1.14 branch)
    
    https://bugzilla.gnome.org/show_bug.cgi?id=698794
Comment 28 Owen Taylor 2014-05-09 21:22:01 UTC
Note that with my last comment, you probably still want to do the post-frame sync stuff in POST_PAINT - it's good to get the frame out as soon as possible and not futz around with fences until after swapping buffers. You just cannot replace the current use of clutter_stage_set_paint_callback() for sending _NET_WM_FRAME_DRAWN with a POST_PAINT callback.
Comment 29 Owen Taylor 2014-05-09 21:53:35 UTC
Review of attachment 276234 [details] [review]:

Seems OK to me too.
Comment 30 Owen Taylor 2014-05-09 22:54:22 UTC
Review of attachment 276236 [details] [review]:

I didn't find any correctness problems reading the code, but I do have a fair number of stylistic suggestions - I think we want to make this read more like a part of Mutter.

::: src/compositor/compositor.c
@@ +440,3 @@
   post_paint_windows (compositor);
+
+  if (compositor->have_x11_sync_object && compositor->need_sync_drawing)

This reads as if you sometimes will leave need_sync_drawing enabled (though the logic is right.) See suggestion below.

@@ +1204,3 @@
+        {
+          XSync (compositor->display->xdisplay, False);
+          compositor->need_sync_drawing = FALSE;

clearing this conditionally reads weird - there's no sense in which the frame "needed sync drawing" and now doesn't for the XSync case but still does for the x11_sync_object frame.

Can we call this frame_has_updated_xsurfaces and always clear it in the post?

::: src/compositor/x-to-gl-sync.c
@@ +46,3 @@
+  XTOGLSYNC_STATE_WAITING,
+  XTOGLSYNC_STATE_DONE,
+  XTOGLSYNC_STATE_RESET_PENDING,

Can you describe these states in a comment?

@@ +65,3 @@
+static guint current_sync_idx;
+static XToGLSync *current_sync;
+static guint warmup_syncs;

I don't like this many file scope static variables. I'm not really worried about wanting multiple instances, or the performance penalty of adding a dozen relocations, I just think it's hard to figure out what's going on - it's not clear what are function locals, and what is state. (When programming in C++ I'd always by preference use either the m_member or member_ conventions for marking member variables.)

Can we add a MetaSyncRing structure? It can be a single instance that's entirely local to this file.

@@ +72,3 @@
+
+static int xsync_event_base;
+static int xsync_error_base;

These 5 should all drop out with other suggestions I've made.

@@ +89,3 @@
+static GLsync (*_glImportSync) (GLenum external_sync_type,
+                                GLintptr external_sync,
+                                GLbitfield flags);

I don't mind these as file-scope statics - I'm more OK with file-scope functions (or function pointers) than member variables.

But I can *EASILY* imagine GL header files that #define _glWaitSync ..... causing us much confusion and grief. '_gl' is not our namespace to play with.

My preference is to simply use get_integerv, get_proc_address, delete_sync ...

@@ +94,3 @@
+load_symbol (void        *handle,
+             const char  *name,
+             void       **func)

This can be simplified by using cogl_get_proc_address() - the docs for cogl_get_proc_address() say not to use it for core GL symbols, however, the rational is that unlike glXGetProcAddress() other backends for cogl_get_proc_address() don't support querying for core symbols. This is GLX specific code. Then you can just have a single load_gl_symbol(), and avoid introducing dlsym or a more portable g_module_symbol(). [Do we still care about non-dlysm platforms? Not sure]

@@ +315,3 @@
+        XEvent event;
+        XIfEvent (xdisplay, &event, alarm_event_predicate, (XPointer) self);
+        sync_handle_event (self, (XSyncAlarmNotifyEvent *) &event);

The fact we trigger alarms before deleting them certainly has to be commented. As far as I can tell, the thought is that in the xtogl_reboot() case we're panicking, we don't know what's wrong, but maybe if we trigger all the fences, we'll get ourself out of a stuck GPU state.

@@ +335,3 @@
+
+gboolean
+xtoglsync_init (Display *dpy)

xtoglsync is unreadable, and by GNOME conventions XToGLSync woudl become x_to_gl_sync_ which is also a horrible namespace as well. I don't think it's that valuable to keep this code close to the Compiz code - there's already significant deviation. I'm suggesting MetaSyncRing and MetaSync

 xtoglsync_init(Display *dpy) => meta_sync_ring_init() 
   [can use meta_display_get() in the code]
 xtoglsync_update() => meta_sync_ring_after_frame()
   [this has no meaning except for "this stuff you do after each frame" as far as I can see]
 xtoglsync_insert_wait => meta_sync_insert_wait()

@@ +403,3 @@
+  if (warmup_syncs >= NUM_SYNCS / 2)
+    {
+      guint reset_sync_idx = (current_sync_idx + (NUM_SYNCS / 2)) % NUM_SYNCS;

A comment is needed with the definition of NUM_SYNCS that it must be even, or, better, write this as:
 
 current_sync_index + NUM_SYNCS - NUM_SYNCS / 2

@@ +404,3 @@
+    {
+      guint reset_sync_idx = (current_sync_idx + (NUM_SYNCS / 2)) % NUM_SYNCS;
+      XToGLSync* sync_to_reset = syncs_array[reset_sync_idx];

The top of this file definitely needs a block comment explaining the scheme where we:

 XSyncTriggerFence and glWaitSync a fence
 NUM_SYNCS / 2 frames later, expect that the fence is triggered and we can XSyncResetFence
 NUM_SYNCS / 2 frames later, expect that the fence is reset and we can XSyncTriggerFence and glWaitSync

using glClientWaitSync() and serials (or alarms if my suggestion doesn't work out) to double-check those expectations.
Comment 31 Jasper St. Pierre (not reading bugmail) 2014-05-10 00:05:56 UTC
(In reply to comment #30)
>  xtoglsync_init(Display *dpy) => meta_sync_ring_init() 
>    [can use meta_display_get() in the code]

It should be either meta_backend_x11_get_xdisplay, or clutter_x11_get_default_xdisplay. Yeah, this is "wrong" in the code right now, I just noticed. Yes, it doesn't really matter because this never runs as a Wayland compositor in any mode.

(We should probably add checks for if (compositor->sync_ring) ... or something and make sure it's only run in X11 compositor mode)
Comment 32 Rui Matos 2014-05-12 23:19:24 UTC
Created attachment 276413 [details] [review]
compositor: Sync X drawing only once per frame

--

rebased
Comment 33 Rui Matos 2014-05-12 23:19:35 UTC
Created attachment 276414 [details] [review]
display: Add public getters for sync extension presence and event base
Comment 34 Rui Matos 2014-05-12 23:22:42 UTC
Created attachment 276415 [details] [review]
compositor: Add support for GL_EXT_x11_sync_object

--

Thanks for the review. I think this addresses every point but please
check the comments and feel free to suggest improvements.
Comment 35 Rui Matos 2014-05-12 23:40:25 UTC
BTW, I've now tested this on two different nvidia GPUs, a NVS 135M (G86) and a GTX 675MX (GK104) both with driver version 337.19 and it does run fine and fixes the redrawing problems.

The weaker gpu sometimes (like once or twice a day) exhibits long stalls for a few while but eventually recovers. When this happens I don't see any of the warnings that you'd expect from the patch. James Jones said that this could be a bug in the driver.

(In reply to comment #24)
> Something is wrong with this .. it seems to mess up cogls clip stack and I am
> getting tons of warnings:
> 
> (gnome-shell:29762): Cogl-WARNING **: ./driver/gl/cogl-clip-stack-gl.c:419: GL
> error (1281): Invalid value
>  Window manager warning: Log level 16:
> (compositor/x-to-gl-sync.c:220):sync_check_update_finished: runtime check
> failed: (status != GL_WAIT_FAILED)

I don't get these warnings on either system I tested. James also mentioned that they had a bug report about this extension for Tesla generation GPUs. Isn't yours one of these? Perhaps it's related to that?
Comment 36 drago01 2014-05-13 06:02:08 UTC
(In reply to comment #35)
> BTW, I've now tested this on two different nvidia GPUs, a NVS 135M (G86) and a
> GTX 675MX (GK104) both with driver version 337.19 and it does run fine and
> fixes the redrawing problems.
> 
> The weaker gpu sometimes (like once or twice a day) exhibits long stalls for a
> few while but eventually recovers. When this happens I don't see any of the
> warnings that you'd expect from the patch. James Jones said that this could be
> a bug in the driver.
> 
> (In reply to comment #24)
> > Something is wrong with this .. it seems to mess up cogls clip stack and I am
> > getting tons of warnings:
> > 
> > (gnome-shell:29762): Cogl-WARNING **: ./driver/gl/cogl-clip-stack-gl.c:419: GL
> > error (1281): Invalid value
> >  Window manager warning: Log level 16:
> > (compositor/x-to-gl-sync.c:220):sync_check_update_finished: runtime check
> > failed: (status != GL_WAIT_FAILED)
> 
> I don't get these warnings on either system I tested. James also mentioned that
> they had a bug report about this extension for Tesla generation GPUs. Isn't
> yours one of these? Perhaps it's related to that?

Maybe need to check with James about that. If that is the case we should simply not do it on that GPU the bug that it tries to fix does not reproduce here it just makes things worse. Its not like it occasionally prints one warning its an endless stream of warnings. And given that support for those GPUS (anything pre Fermi) is going to move to legacy support soon I am not sure NVIDIA would ever fix that bug (legacy mostly only gets support for new X/kernel versions).
Comment 37 drago01 2014-05-13 06:11:32 UTC
(In reply to comment #36)
> (In reply to comment #35)
> > BTW, I've now tested this on two different nvidia GPUs, a NVS 135M (G86) and a
> > GTX 675MX (GK104) both with driver version 337.19 and it does run fine and
> > fixes the redrawing problems.
> > 
> > The weaker gpu sometimes (like once or twice a day) exhibits long stalls for a
> > few while but eventually recovers. When this happens I don't see any of the
> > warnings that you'd expect from the patch. James Jones said that this could be
> > a bug in the driver.
> > 
> > (In reply to comment #24)
> > > Something is wrong with this .. it seems to mess up cogls clip stack and I am
> > > getting tons of warnings:
> > > 
> > > (gnome-shell:29762): Cogl-WARNING **: ./driver/gl/cogl-clip-stack-gl.c:419: GL
> > > error (1281): Invalid value
> > >  Window manager warning: Log level 16:
> > > (compositor/x-to-gl-sync.c:220):sync_check_update_finished: runtime check
> > > failed: (status != GL_WAIT_FAILED)
> > 
> > I don't get these warnings on either system I tested. James also mentioned that
> > they had a bug report about this extension for Tesla generation GPUs. Isn't
> > yours one of these? Perhaps it's related to that?
> 
> Maybe need to check with James about that. If that is the case we should simply
> not do it on that GPU the bug that it tries to fix does not reproduce here it
> just makes things worse. Its not like it occasionally prints one warning its an
> endless stream of warnings. And given that support for those GPUS (anything pre
> Fermi) is going to move to legacy support soon I am not sure NVIDIA would ever
> fix that bug (legacy mostly only gets support for new X/kernel versions).

Alternatively we could do something like "if we hit that error condition n times in a row set x11_sync_object to FALSE".
Comment 38 Rui Matos 2014-05-13 13:58:49 UTC
Created attachment 276454 [details] [review]
MetaSyncRing: disable after a number of reboot attempts

If we have had to reboot this number of times, something is definitely
wrong and we're likely to just make things worse by continuing to try.

Let's err on the side of caution, disable ourselves and fallback to
the XSync() path in the compositor.
Comment 39 drago01 2014-05-13 16:18:02 UTC
(In reply to comment #38)
> Created an attachment (id=276454) [details] [review]
> MetaSyncRing: disable after a number of reboot attempts
> 
> If we have had to reboot this number of times, something is definitely
> wrong and we're likely to just make things worse by continuing to try.
> 
> Let's err on the side of caution, disable ourselves and fallback to
> the XSync() path in the compositor.

Yeah with this the warnings stop after the initial spam:

Window manager warning: Log level 16: (compositor/meta-sync-ring.c:233):meta_sync_check_update_finished: runtime check failed: (status != GL_WAIT_FAILED)
Window manager warning: MetaSyncRing: Timed out waiting for sync object.
Window manager warning: MetaSyncRing: Too many reboots -- disabling


And seems better than a GPU based hardcoding i.e when/if it gets fixed it should "just work".
Comment 40 drago01 2014-05-14 09:07:03 UTC
Review of attachment 276413 [details] [review]:

Looks good.
Comment 41 drago01 2014-05-14 09:07:46 UTC
Review of attachment 276414 [details] [review]:

OK.
Comment 42 drago01 2014-05-14 10:01:19 UTC
Review of attachment 276415 [details] [review]:

Code looks good cannot really verify whether it works ok or not because of the bug on my hw / cannot reproduce the glitches that it fixes.
Comment 43 drago01 2014-05-14 10:01:53 UTC
Review of attachment 276454 [details] [review]:

LG.
Comment 44 johan 2014-06-04 07:56:11 UTC
Haven't seen these patches being merged yet; will they make it into 3.14?
Comment 45 Rui Matos 2014-06-04 11:44:00 UTC
I've recently hit an issue, after upgrading to rawhide, that I believe is what drago01 reported in comment 24. 

glImportSync() errors out with GL_OUT_OF_MEMORY. I'm in touch with NVIDIA to get to the bottom of it, but we can't merge this yet.
Comment 46 drago01 2014-06-20 20:09:50 UTC
*** Bug 664858 has been marked as a duplicate of this bug. ***
Comment 47 Joran Martinière 2014-07-03 17:42:43 UTC
Any news concerning merging those patches?
Will it come with GNOME 3.14 or will there be a fix for 3.12?

Thanks in advance
Comment 48 Mihails Strasuns 2014-07-04 00:26:35 UTC
Any updates? This bug makes GNOME literally unusable on certain hardware.
Comment 49 t.jp 2014-07-04 09:01:35 UTC
(In reply to comment #48)
> Any updates? This bug makes GNOME literally unusable on certain hardware.

It helps to start nvidia-settings and set your card to "Prefer Maximum Performance" in the PowerMizer. I don't have any refreshing problems this way and it also makes the expose animation look smooth when hitting the Super-Key.
On my system Gnome only seems to act strange when my GTX 580 downclocks to 50 Mhz which is the lowest performance level.
Comment 50 Mihails Strasuns 2014-07-04 11:10:44 UTC
I have tried it and while it has greatly improved Gnome Shell animation and some of glitches in GTK applications, console ones I run in gnome-terminal still have redrawing issues - vim being my primary concern.
Comment 51 Rui Matos 2014-07-04 11:21:23 UTC
(In reply to comment #50)
> I have tried it and while it has greatly improved Gnome Shell animation and
> some of glitches in GTK applications, console ones I run in gnome-terminal
> still have redrawing issues - vim being my primary concern.

Right. Overriding the power management setting to always force the GPU to work at maximum frequency is just a work around that might improve things because it changes the timings and thus the race condition might tilt more often to hide the bug.

The patches here do fix the race properly but as I said, they are not working  in Fedora rawhide due to what seems to be a driver bug. I've sent all the info I could gather about it to NVIDIA and if/when they fix it we'll land the patches in mutter. This bug will be closed when that happens.
Comment 52 Mihails Strasuns 2014-07-04 11:24:49 UTC
Got it, thanks!
Comment 53 drago01 2014-07-04 11:59:33 UTC
Review of attachment 276415 [details] [review]:

::: src/compositor/meta-sync-ring.c
@@ +143,3 @@
+  for (i = 0; i < num_extensions; ++i)
+    {
+      const char *ext = meta_gl_get_stringi (GL_EXTENSIONS, i);

Don't do this ... Owen did the same thing for ARB_timers and then we noticed that this is the "GL3 way" we do not require OpenGL3 though so use the old glGetString to get the extension list.
Comment 54 Joran Martinière 2014-07-05 12:51:23 UTC
Hi,
I just posted on devtalk.nvidia in order to ask for a quick fix. If you will add your two cents...
Comment 56 Florian Müllner 2014-08-19 22:18:10 UTC
Comment on attachment 276413 [details] [review]
compositor: Sync X drawing only once per frame

Attachment 276413 [details] pushed as f55737e - compositor: Sync X drawing only once per frame
(Jasper pushed this a while ago, marking as such)
Comment 57 Joran Martinière 2014-10-11 12:17:50 UTC
Is there anything new about this bug?
I don't want to press anybody, it's just that nVidia's release notes are not very clear on whether they are working on this or not...
Comment 58 Rui Matos 2014-10-13 11:21:47 UTC
Driver 343.22 still errors out during initialization i.e. glImportSync() returns GL_OUT_OF_MEMORY. When that's fixed and we can test that the patch fixes the redrawing issue the bug will be closed.
Comment 59 Rui Matos 2014-10-20 11:10:33 UTC
*** Bug 738814 has been marked as a duplicate of this bug. ***
Comment 60 agv 2014-11-12 20:32:15 UTC
Hello, my question might be a little bit naive but I was wondering why exactly is mutter having this problem? Because other WMs do not seem to face similar issues. What is the big difference between "them" and mutter? 

I understand that there exist special synchronization primitive you guys are trying to work with, but are there any alternatives to that?
Comment 61 drago01 2014-11-12 20:43:22 UTC
(In reply to comment #60)
> Hello, my question might be a little bit naive but I was wondering why exactly
> is mutter having this problem? Because other WMs do not seem to face similar
> issues. What is the big difference between "them" and mutter? 

Every opengl based compositor is having this issue to some extent. NVIDIA developed the extensions when the issues first showed up with compiz. It is a race so it depends on timing conditions; for instance on my system I am not seeing any of those issues (GTX 285).

> I understand that there exist special synchronization primitive you guys are
> trying to work with, but are there any alternatives to that?

Not really no, unfortunately.
Comment 62 agv 2014-11-12 21:24:55 UTC
I see. I also wanted to point out my observation I've made once upon a time: having visible window on the screen that is updated every frame "fixes" the issue - i.e. even at the lowest performance level everything is smooth. Not sure if this is useful piece of information though - probably not.
Comment 63 Rui Matos 2015-03-06 15:04:46 UTC
Created attachment 298717 [details] [review]
sync-ring: Use the backend X connection to create fences

--

James pointed out to me that the GL_OUT_OF_MEMORY errors are just a
symptom of this patch trying to create the X fence on one X connection
(from the frontend) and using it on GL which is going through the
backend X connection, so basically a race which I didn't realize was
there when porting the 1st version of the patch which I did for a pre
frontend/backend split mutter to newer mutter versions.

This additional patch fixes that by using the backend X connection
everywhere in meta-sync-ring.c which is really the right thing to do
since this is a backend feature and, in fact, all of this code should
move to backends/x11/ at some point.

Currently compositor.c is processing events from the frontend though
so it's not easy to move this wholesale to the backend. In fact, I'm
not 100% sure that we are race free here (even in the XSync() path?)
because damage events come in from the frontend connection.

This should probably be re-worked for 3.18.

Anyway, I don't think this patch and also attachment 276454 [details] [review] make sense
in themselves since this feature was never pushed yet so I'm also
attaching a single patch with everything squashed. This one is just to
demonstrate what the GL_OUT_OF_MEMORY problem was.
Comment 64 Rui Matos 2015-03-06 15:12:07 UTC
Created attachment 298718 [details] [review]
compositor: Add support for GL_EXT_x11_sync_object

--

So, NVIDIA released driver version 346.47 recently which includes some
fixes for the hangs that I was seeing.

Unfortunately it seems those fixes are not enough at least on my
hardware and I'm still suffering hangs with this patch.

I'd like to ask anyone following this bug if you can give this a try
and report whether it works fine for you and which driver version and
GPU you're using. This patch should apply fine on either 3.14 or
master and you don't need any other patch.

For those using F21 or F22, you can just try installing one the RPMS:

http://koji.fedoraproject.org/koji/taskinfo?taskID=9148085 (F21)
http://koji.fedoraproject.org/koji/taskinfo?taskID=9156341 (F22)

Note that if your session hangs you might not be able to switch VTs so
you'll either have to hard reboot or connect through SSH and

$ killall -9 gnome-shell
Comment 65 agv 2015-03-06 16:37:46 UTC
(In reply to Rui Matos from comment #64)
> Created attachment 298718 [details] [review] [review]
> compositor: Add support for GL_EXT_x11_sync_object
> 
> --
> 
> So, NVIDIA released driver version 346.47 recently which includes some
> fixes for the hangs that I was seeing.
> 
> Unfortunately it seems those fixes are not enough at least on my
> hardware and I'm still suffering hangs with this patch.
> 
> I'd like to ask anyone following this bug if you can give this a try
> and report whether it works fine for you and which driver version and
> GPU you're using. This patch should apply fine on either 3.14 or
> master and you don't need any other patch.
> 
> For those using F21 or F22, you can just try installing one the RPMS:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=9148085 (F21)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=9156341 (F22)
> 
> Note that if your session hangs you might not be able to switch VTs so
> you'll either have to hard reboot or connect through SSH and
> 
> $ killall -9 gnome-shell

I have just tested your patch. It seems to me that it solves the issue. Redraws are done without unnecessary stalls (However sometimes it feels like there is still some microscopic delay, but it is really *negligible*).  

I have GTX460 and 346.47 driver version, mutter 3.14.
Comment 66 t.jp 2015-03-06 18:16:22 UTC
(In reply to Rui Matos from comment #64)
> Created attachment 298718 [details] [review] [review]
> compositor: Add support for GL_EXT_x11_sync_object
> 
> --
> 
> So, NVIDIA released driver version 346.47 recently which includes some
> fixes for the hangs that I was seeing.
> 
> Unfortunately it seems those fixes are not enough at least on my
> hardware and I'm still suffering hangs with this patch.
> 
> I'd like to ask anyone following this bug if you can give this a try
> and report whether it works fine for you and which driver version and
> GPU you're using. This patch should apply fine on either 3.14 or
> master and you don't need any other patch.
> 
> For those using F21 or F22, you can just try installing one the RPMS:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=9148085 (F21)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=9156341 (F22)
> 
> Note that if your session hangs you might not be able to switch VTs so
> you'll either have to hard reboot or connect through SSH and
> 
> $ killall -9 gnome-shell

I tested this on Arch Linux with Gnome 3.14.3 on my GTX 580 (346.47) and it works. When I first reported this bug (on Gnome 3.10/3.12) it was pretty severe, often characters would not show up at all unless I clicked somewhere. In the meantime the problem has become less pronounced. It still happened when I wait until the card downclocks to 50 Mhz and then rename a folder and type very fast. The last character would only show up after some 100-200 ms. Now I can't reproduce this anymore with this patch.
Comment 67 drago01 2015-03-06 18:27:24 UTC
Review of attachment 298718 [details] [review]:

Code looks good to me ... unfortunately I can not test it any more because my card died but tests by other seem to confirm that it works.
Comment 68 t.jp 2015-03-07 10:45:11 UTC
I think I ran into some problems with this patch. I usually lock my desktop and leave the PC running. When I came back I tried to swipe up the lockscreen but nothing happened. I could move the mouse though. The image got corrupted and I switched to a different tty (which took very long). When I switched back to tty1 I saw my password screen to unlock my session. This happened two times since I applied this patch. This morning I rebooted my PC, because the session wouldn't unlock and switching tty's didn't work but Gnome would not start up. I couldn't find anything in the journals. So I did "pacman -Suu" to automatically downgrade to the unpatched version of Mutter and everything works now.
Comment 69 Joran Martinière 2015-03-07 12:20:07 UTC
(In reply to t.jp from comment #66)
> (In reply to Rui Matos from comment #64)
> > Created attachment 298718 [details] [review] [review] [review]
> > compositor: Add support for GL_EXT_x11_sync_object
> > 
> > --
> > 
> > So, NVIDIA released driver version 346.47 recently which includes some
> > fixes for the hangs that I was seeing.
> > 
> > Unfortunately it seems those fixes are not enough at least on my
> > hardware and I'm still suffering hangs with this patch.
> > 
> > I'd like to ask anyone following this bug if you can give this a try
> > and report whether it works fine for you and which driver version and
> > GPU you're using. This patch should apply fine on either 3.14 or
> > master and you don't need any other patch.
> > 
> > For those using F21 or F22, you can just try installing one the RPMS:
> > 
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=9148085 (F21)
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=9156341 (F22)
> > 
> > Note that if your session hangs you might not be able to switch VTs so
> > you'll either have to hard reboot or connect through SSH and
> > 
> > $ killall -9 gnome-shell
> 
> I tested this on Arch Linux with Gnome 3.14.3 on my GTX 580 (346.47) and it
> works. When I first reported this bug (on Gnome 3.10/3.12) it was pretty
> severe, often characters would not show up at all unless I clicked
> somewhere. In the meantime the problem has become less pronounced. It still
> happened when I wait until the card downclocks to 50 Mhz and then rename a
> folder and type very fast. The last character would only show up after some
> 100-200 ms. Now I can't reproduce this anymore with this patch.

I can't get to compile it, it complains about
[code]
./.libs/libmutter.so: undefined reference to `meta_sync_ring_insert_wait'
./.libs/libmutter.so: undefined reference to `meta_sync_ring_handle_event'
./.libs/libmutter.so: undefined reference to `meta_sync_ring_after_frame'
./.libs/libmutter.so: undefined reference to `meta_sync_ring_destroy'
./.libs/libmutter.so: undefined reference to `meta_sync_ring_init'
[/code]
I absolutely don't know what's gone wrong...
Could you post your PKGBUILD?
Comment 70 Rui Matos 2015-03-07 14:03:51 UTC
(In reply to t.jp from comment #68)
> I think I ran into some problems with this patch. I usually lock my desktop
> and leave the PC running. When I came back I tried to swipe up the
> lockscreen but nothing happened. I could move the mouse though. The image
> got corrupted and I switched to a different tty (which took very long). When
> I switched back to tty1 I saw my password screen to unlock my session. This
> happened two times since I applied this patch.

Your description somewhat matches my experience except for image corruption, that's something I've never seen with this code. Can you describe how it looks? Or even better, can you reproduce and take a picture?

Also, please post your GPU and driver version.
Comment 71 Rui Matos 2015-03-07 14:06:55 UTC
(In reply to Joran Martinière from comment #69)
> I can't get to compile it, it complains about
> [code]
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_insert_wait'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_handle_event'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_after_frame'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_destroy'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_init'
> [/code]

If you're applying the patch on a tarball you need to run 'autoreconf -i -f' before 'configure'.
Comment 72 t.jp 2015-03-07 14:54:26 UTC
(In reply to Joran Martinière from comment #69)
> (In reply to t.jp from comment #66)
> > (In reply to Rui Matos from comment #64)
> > > Created attachment 298718 [details] [review] [review] [review] [review]
> > > compositor: Add support for GL_EXT_x11_sync_object
> > > 
> > > --
> > > 
> > > So, NVIDIA released driver version 346.47 recently which includes some
> > > fixes for the hangs that I was seeing.
> > > 
> > > Unfortunately it seems those fixes are not enough at least on my
> > > hardware and I'm still suffering hangs with this patch.
> > > 
> > > I'd like to ask anyone following this bug if you can give this a try
> > > and report whether it works fine for you and which driver version and
> > > GPU you're using. This patch should apply fine on either 3.14 or
> > > master and you don't need any other patch.
> > > 
> > > For those using F21 or F22, you can just try installing one the RPMS:
> > > 
> > > http://koji.fedoraproject.org/koji/taskinfo?taskID=9148085 (F21)
> > > http://koji.fedoraproject.org/koji/taskinfo?taskID=9156341 (F22)
> > > 
> > > Note that if your session hangs you might not be able to switch VTs so
> > > you'll either have to hard reboot or connect through SSH and
> > > 
> > > $ killall -9 gnome-shell
> > 
> > I tested this on Arch Linux with Gnome 3.14.3 on my GTX 580 (346.47) and it
> > works. When I first reported this bug (on Gnome 3.10/3.12) it was pretty
> > severe, often characters would not show up at all unless I clicked
> > somewhere. In the meantime the problem has become less pronounced. It still
> > happened when I wait until the card downclocks to 50 Mhz and then rename a
> > folder and type very fast. The last character would only show up after some
> > 100-200 ms. Now I can't reproduce this anymore with this patch.
> 
> I can't get to compile it, it complains about
> [code]
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_insert_wait'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_handle_event'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_after_frame'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_destroy'
> ./.libs/libmutter.so: undefined reference to `meta_sync_ring_init'
> [/code]
> I absolutely don't know what's gone wrong...
> Could you post your PKGBUILD?

Download your source files for this package:
$ abs extra/mutter

cd into your ABS directory.

Replace the PKGBUILD with this one: https://gist.github.com/blackout24/28b09920805770079abd

Place this file called mutter.patch into the same directory of your PKGBUILD and mutter.install file, so it does the patching on the fly.
https://gist.github.com/blackout24/67719743a4dd80cf8290

$ makepkg -si

@Rui Matos

It looked like when the lookscreen swipes upwards to revel the screen for password entry, but parts of it stayed at the bottom. On the left side parts of the dash appeared, but I could not interact with anything until I switched to a different tty and back, which took very long. This is on my GTX 580 with 346.47 driver.
Comment 73 ebkalderon 2015-03-08 06:49:11 UTC
I would like to concur with most of the reported behavior in comment #3 (https://bugzilla.gnome.org/show_bug.cgi?id=728464#c3), especially in regards to Gedit's cursor not redrawing correctly while editing text.

I have GTK+ 3.14.9-1 on Arch Linux x86-64, and I am using version 346.47-3 of the Nvidia driver.
Comment 74 Michael Chapman 2015-03-08 08:46:47 UTC
(In reply to Rui Matos from comment #64)
> Created attachment 298718 [details] [review] [review]
> compositor: Add support for GL_EXT_x11_sync_object
> 
> --
> 
> So, NVIDIA released driver version 346.47 recently which includes some
> fixes for the hangs that I was seeing.
> 
> Unfortunately it seems those fixes are not enough at least on my
> hardware and I'm still suffering hangs with this patch.

Same here. I used your F21 package, and I'm running the 346.47 NVIDIA driver.

So far I've only experienced hangs while running the MythTV frontend, and I've found that the hang can be broken by killing mythfrontend. Sometimes things are *just* responsive enough that I can switch to another VT to do that, but more often than not I need to SSH in from another workstation.
Comment 75 t.jp 2015-03-08 11:28:57 UTC
(In reply to Rui Matos from comment #70)
> (In reply to t.jp from comment #68)
> > I think I ran into some problems with this patch. I usually lock my desktop
> > and leave the PC running. When I came back I tried to swipe up the
> > lockscreen but nothing happened. I could move the mouse though. The image
> > got corrupted and I switched to a different tty (which took very long). When
> > I switched back to tty1 I saw my password screen to unlock my session. This
> > happened two times since I applied this patch.
> 
> Your description somewhat matches my experience except for image corruption,
> that's something I've never seen with this code. Can you describe how it
> looks? Or even better, can you reproduce and take a picture?
> 
> Also, please post your GPU and driver version.

I have not found a reliable way to reproduce this. When I press Ctrl+L and wait for the monitor to go into standby and wake it up after 5 seconds everything is fine. I did the same and waited 30 minutes and everything was fine. Last night I locked my screen and just woke it up again and the image is corrupted, but not as badly as it once was. You can see that the Dropbox icon is stretchered and there seems to be a shadow of the clock stuck on the screen.  
http://imgur.com/lBF8UQl  

Again I couldn't unlock it with my mouse. The mouse cursor was fine and freely moved around. When I switched to TTY2 and back I was at my password entry screen. I entered my password, but the cursor didn't move. I swtiched ttys and after some time my desktop appeared and everything is fine.
Comment 76 Joran Martinière 2015-03-09 16:13:14 UTC
(In reply to Rui Matos from comment #71)
> If you're applying the patch on a tarball you need to run 'autoreconf -i -f'
> before 'configure'.
This did the trick, thanks.

To t.jp@gmx.de:
Thank you for the complete PKGBUILD and all, in the end I had only forgot the whole autoreconf thing in my own PKGBUILD.

About the patch:
I took some time to test it, and in most cases it gets completely rid of the bug on Arch Linux, kernel 3.18.6, GNOME 3.14.3, nvidia 346.47 and Asus GTX 660. 

However, I got the lock-screen error too, and when I switched to TTY2 and then back to my X session, GNOME gave me the "Something went wrong" screen and I had to log out.

I also noticed that some graphical glitches still happen in LibreOffice 4.4.1 on some UI elements, mostly the "Font" and "Style" tool-bar menus, as well the OK/Apply/Cancel buttons at the bottom of dialogue windows.

Looks like we're getting closer to a solution to this problem, thank you very much for your commitment.
Comment 77 Joran Martinière 2015-03-17 21:33:33 UTC
Today I experienced the lock-screen bug 3 times in a row.
I had this message on TTY2:
NVRM: Xid (PCI:0000:01:00): 31, Ch 00000003, engmask 00000101, intr 10000000

Don't know if it may prove useful, I never had this message before.
Comment 78 Joran Martinière 2015-04-10 16:52:43 UTC
With GNOME 3.16 now available in Arch repos, I could try the patch with mutter 3.16

The screen freezing problem is still present.
I found a reliable way to trigger the bug, sadly it uses proprietary software.
When I launch a game of Counter Strike: Global Offensive and then try to switch to desktop, with the patch it freezes immediately. It seems that the freeze gets stronger with time, i.e if it just starts to freeze I can switch to tty2 without waiting too much, but if I wait some more, switching will become very tedious.

I hope you'll find some helpful information and finish the patch (I'm really looking forward to it...)
Comment 79 Rui Matos 2015-04-10 17:04:53 UTC
(In reply to Joran Martinière from comment #78)
> I hope you'll find some helpful information and finish the patch (I'm really
> looking forward to it...)

There are still bugs in the driver and nvidia is aware of it.
Comment 80 Rui Matos 2015-05-25 20:26:52 UTC
*** Bug 749863 has been marked as a duplicate of this bug. ***
Comment 81 dblade 2015-05-30 08:33:41 UTC
Kernel: 4.0.4-2-ARCH
nVidia 560: Version 352.09
Gnome:  3.16.2

While I edit text my GPU goes to minimum (50MHz) and I cannot edit Text with a Linux machine :) Not good since I do not see the same issue with other GEs. Ofc I could always set my GPU to the Maximum usage to "fix" it. Then everything should be fine except my power usage and noise. Is it possible that something requires more GPU power to draw the changes, but does not request that from the Video Card?
Comment 82 Gwendal 2015-07-07 09:59:21 UTC
Same issues here: editing text becomes impossible when the GPU goes to minimum. Most of the problem disappear when I set my GPU to the Maximum usage, but some of the flickering remains.
Comment 83 t.jp 2015-07-07 15:22:34 UTC
I added "export CLUTTER_PAINT=continuous-redraw" to my /etc/profile this has the same effect for me as setting the card to performance-mode, but doesn't burn as much power and keeps the PC quiet. There do seem to be some downsides though. On the lock screen the clock numbers are a little bit cropped on the right. Another person also said that it makes the system unstable for him so use with caution.
Comment 84 Rui Matos 2015-07-17 16:42:34 UTC
I chatted with nvidia today and they are actively looking at this again. We agreed that, since the hangs apparently only happen on some GPUs, we should try and include this patch for now in master so that it gets wider real world testing.

If before the 3.18 release we hear that the problem is too widespread we'll revert the patch. Does anyone oppose this plan?
Comment 85 drago01 2015-07-18 06:41:26 UTC
(In reply to Rui Matos from comment #84)
> I chatted with nvidia today and they are actively looking at this again. We
> agreed that, since the hangs apparently only happen on some GPUs, we should
> try and include this patch for now in master so that it gets wider real
> world testing.
> 
> If before the 3.18 release we hear that the problem is too widespread we'll
> revert the patch. Does anyone oppose this plan?

Sounds good to me. We need to somehow make progress on this.
Comment 86 İsmail Dönmez 2015-07-29 10:34:45 UTC
(In reply to Rui Matos from comment #84)
> I chatted with nvidia today and they are actively looking at this again. We
> agreed that, since the hangs apparently only happen on some GPUs, we should
> try and include this patch for now in master so that it gets wider real
> world testing.
> 
> If before the 3.18 release we hear that the problem is too widespread we'll
> revert the patch. Does anyone oppose this plan?

Any news on this? Is the patch on master yet?
Comment 87 drago01 2015-07-29 11:28:06 UTC
(In reply to İsmail Dönmez from comment #86)
> (In reply to Rui Matos from comment #84)
> > I chatted with nvidia today and they are actively looking at this again. We
> > agreed that, since the hangs apparently only happen on some GPUs, we should
> > try and include this patch for now in master so that it gets wider real
> > world testing.
> > 
> > If before the 3.18 release we hear that the problem is too widespread we'll
> > revert the patch. Does anyone oppose this plan?
> 
> Any news on this? Is the patch on master yet?

Nvidia has released a driver which is supposed to fix the bug ... so it should land "soon" once its confirmed.

So should be fixed when 3.18 is out.
Comment 88 İsmail Dönmez 2015-07-29 11:34:12 UTC
(In reply to drago01 from comment #87)
> (In reply to İsmail Dönmez from comment #86)
> > (In reply to Rui Matos from comment #84)
> > > I chatted with nvidia today and they are actively looking at this again. We
> > > agreed that, since the hangs apparently only happen on some GPUs, we should
> > > try and include this patch for now in master so that it gets wider real
> > > world testing.
> > > 
> > > If before the 3.18 release we hear that the problem is too widespread we'll
> > > revert the patch. Does anyone oppose this plan?
> > 
> > Any news on this? Is the patch on master yet?
> 
> Nvidia has released a driver which is supposed to fix the bug ... so it
> should land "soon" once its confirmed.

Probably not released for the legacy gfx cards, because the latest drivers I see is from January 2015, see http://www.nvidia.com/Download/driverResults.aspx/81761/en-us

> So should be fixed when 3.18 is out.
Comment 89 t.jp 2015-07-29 12:10:38 UTC
(In reply to drago01 from comment #87)
> (In reply to İsmail Dönmez from comment #86)
> > (In reply to Rui Matos from comment #84)
> > > I chatted with nvidia today and they are actively looking at this again. We
> > > agreed that, since the hangs apparently only happen on some GPUs, we should
> > > try and include this patch for now in master so that it gets wider real
> > > world testing.
> > > 
> > > If before the 3.18 release we hear that the problem is too widespread we'll
> > > revert the patch. Does anyone oppose this plan?
> > 
> > Any news on this? Is the patch on master yet?
> 
> Nvidia has released a driver which is supposed to fix the bug ... so it
> should land "soon" once its confirmed.
> 
> So should be fixed when 3.18 is out.

So the latest driver is supposed to fix the hangs and image corruptions on a system where the patch has been applied to mutter, right? The latest driver is not supposed to get rid of the problem (the initial redrawing problems) all together on a vanilla Gnome install with the affected cards? Just so that I don't mix this up when I test it out again.
Comment 90 Rui Matos 2015-07-29 12:57:05 UTC
Here's the current status:

Version 352.30 released yesterday contains some fixes but not yet the full fix for the hang uncovered by this patch. I'm told that's still going through their QA process.

I tried that driver and can confirm that there are still hangs with it. But, since the hang fix should be out soon, I decided to not merge the patch yet and instead include it when the fix comes out while blacklisting all driver versions previous to that one.

Note that I don't know if nvidia will fix the issue on their legacy branch.

(In reply to t.jp from comment #89)
> So the latest driver is supposed to fix the hangs and image corruptions on a
> system where the patch has been applied to mutter, right?

Unfortunately, not yet this one.

> The latest driver
> is not supposed to get rid of the problem (the initial redrawing problems)
> all together on a vanilla Gnome install with the affected cards?

The redrawing problem needs this patch to get fixed. In turn, this patch needs the driver to be fixed so that we don't hang the whole desktop.
Comment 91 Aaron Plattner 2015-08-05 00:13:53 UTC
Created attachment 308756 [details] [review]
Fix race between glWaitSync and X fence reset

Hi Rui,

I spent some time trying this patch out and ran into what I hope is the same hang you were looking at and attached a mutter patch that should resolve it.

Author: Aaron Plattner <aplattner@nvidia.com>
Date:   Mon Aug 3 21:15:15 2015 -0700

    compositor: Fix GL_EXT_x11_sync_object race condition
    
    The compositor maintains a ring of shared fences with the X server in order to
    properly synchronize rendering between the X server and the compositor's GPU
    channel.  When all of the fences have been used, the compositor needs to reset
    one so that it can be reused.  It does this by first waiting on the CPU for the
    fence to become triggered, and then sending a request to the X server to reset
    the fence.
    
    If the compositor's GPU channel is busy processing other work (e.g. the desktop
    switcher animation), then the X server may process the reset request before the
    GPU has consumed the fence.  This causes the GPU channel to hang.
    
    Fix the problem by having the compositor's GPU channel trigger its own fence
    after waiting for the X server's fence.  Wait for that fence on the CPU before
    sending the reset request to the X server.  This ensures that the GPU has
    consumed the X11 fence before the server resets it.
    
    Signed-off-by: Aaron Plattner <aplattner@nvidia.com>

 src/compositor/meta-sync-ring.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
Comment 92 İsmail Dönmez 2015-08-05 05:32:25 UTC
(In reply to Aaron Plattner from comment #91)
> Created attachment 308756 [details] [review] [review]
> Fix race between glWaitSync and X fence reset

While you are here, can you please confirm if the old drivers will get a similar fix? This is really a grave issue and I really appreciate an answer here.
Comment 93 Aaron Plattner 2015-08-05 05:53:24 UTC
Hi İsmail. My change is a Mutter patch on top of Rui's patch "compositor: Add support for GL_EXT_x11_sync_object", so I don't think any driver changes are needed, at least for this particular hang.
Comment 94 İsmail Dönmez 2015-08-05 07:42:17 UTC
(In reply to Aaron Plattner from comment #93)
> Hi İsmail. My change is a Mutter patch on top of Rui's patch "compositor:
> Add support for GL_EXT_x11_sync_object", so I don't think any driver changes
> are needed, at least for this particular hang.

Hi Aaron, Rui mentioned (comment #90) that nvidia driver update 352.30 contains a fix for the root of this issue which is needed alongside the mutter patch. So I was wondering if that particular fix on the driver side will be backported to the 340.* series of the driver.

Thanks a lot for your answer!
Comment 95 Joran Martinière 2015-08-05 10:13:30 UTC
Aaron Plattner's patch seems to fix the issue on my system.
I'll keep you informed, many thanks for this!
Comment 96 İsmail Dönmez 2015-08-05 10:18:18 UTC
(In reply to Joran Martinière from comment #95)
> Aaron Plattner's patch seems to fix the issue on my system.
> I'll keep you informed, many thanks for this!

On top of Rui's patch I guess?
Comment 97 İsmail Dönmez 2015-08-05 10:42:37 UTC
I tested the patch along with nvidia 340.76 legacy driver and two observations:

1. Overall it seems to be little bit slower
2. I experienced rendering problems with mutt on gnome-terminal, when you scroll on the list of messages a trail of grey boxes show up on the left. They are gone when you force mutt to refresh the screen.

But the original screen update problem seems to be gone.
Comment 98 Rui Matos 2015-08-05 13:52:02 UTC
Review of attachment 308756 [details] [review]:

Thanks for looking at this Aaron.

::: src/compositor/meta-sync-ring.c
@@ +198,2 @@
   meta_gl_wait_sync (self->glsync, 0, GL_TIMEOUT_IGNORED);
+  self->consumed = meta_gl_fence_sync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

Makes sense, but just to be sure: the order of these two calls matters, right?

@@ +214,3 @@
       break;
     case META_SYNC_STATE_WAITING:
+      status = meta_gl_client_wait_sync (self->consumed, 0, timeout);

Again, to be sure I understand how this works: we're guaranteed here that if the 'consumed' fence is signaled at this point, then the 'glsync' fence is also signaled and it's ok to go ahead and XSyncResetFence() ?

@@ +217,3 @@
       if (status == GL_ALREADY_SIGNALED || status == GL_CONDITION_SATISFIED)
         self->state = META_SYNC_STATE_DONE;
+      meta_gl_delete_sync (self->consumed);

meta_sync_check_update_finished() might be called twice in a row if gl_client_wait_sync() isn't successful the first time so this delete must conditional on the if above.
Comment 99 Rui Matos 2015-08-05 13:58:40 UTC
Created attachment 308790 [details] [review]
fixup! compositor: Fix GL_EXT_x11_sync_object race condition

--

This fixes my comment above regarding the sync deletion and also
renames both the existing 'glsync' fence and the 'consumed' fence to
something that should make it easier to understand (at least IMO)
what's going on and why there was still a race with the single X/GL
fence.

Aaron, if you agree I'll squash this diff with your patch and then
push everything.
Comment 100 Rui Matos 2015-08-05 14:06:29 UTC
(In reply to İsmail Dönmez from comment #94)
> Hi Aaron, Rui mentioned (comment #90) that nvidia driver update 352.30
> contains a fix for the root of this issue which is needed alongside the
> mutter patch.

I said that because I didn't realize that there was still a race in the existing patch and that such a race could cause a hang in the driver, i.e. I assumed the driver had a bug.

In fact, I'm still not sure *what* hangs: Mutter? The X server? The hang is outside mutter's control AFAICT.

In any case, since there was still a legitimate race and fixing that race avoids the hang, we're good.

> So I was wondering if that particular fix on the driver side
> will be backported to the 340.* series of the driver.

I've stress tested this now for more than a day on the 340.76 (legacy) driver and it's working fine.

(In reply to İsmail Dönmez from comment #97)
> 2. I experienced rendering problems with mutt on gnome-terminal, when you
> scroll on the list of messages a trail of grey boxes show up on the left.
> They are gone when you force mutt to refresh the screen.

Are you able to provide us a picture of this? I haven't seen any rendering issues and I use gnome-terminal intensively.
Comment 101 Aaron Plattner 2015-08-05 17:53:46 UTC
Review of attachment 308790 [details] [review]:

Thanks Rui, this looks good to me.
Comment 102 Aaron Plattner 2015-08-05 17:58:11 UTC
Rui, yes, the order matters.  The second fence won't get signaled until the GPU advances past the WaitSync on the shared X11 fence, so you know that if the second fence is signaled, that the first fence was signaled too *and* the GPU channel advanced past the WaitSync.

The hang that occurred with this race is in Mutter's GPU channel.  It gets stuck on the WaitSync because the fence it's waiting for was already reset by the X server.  Eventually, the GPU watchdog kicks in an kills the compositor's GPU channel and OpenGL tries to recover it.  If the compositor were using GL_ARB_robustness, I think this would have been delivered as one of the *_CONTEXT_RESET_ARB events, but I didn't verify that.  It's essentially equivalent to the hang you get if you try to run an infinite loop pixel shader.
Comment 103 İsmail Dönmez 2015-08-06 06:56:39 UTC
Created attachment 308838 [details]
Rendering problem with gnome-terminall/mutt combo

Notice the rendering problem on the left, those are left-overs from scrolling. And forcing mutt to refresh with CTRL-L fixes the issue.
Comment 104 İsmail Dönmez 2015-08-06 06:58:57 UTC
Created attachment 308839 [details]
Rendering problem with gnome-terminall/mutt combo

Notice the grey boxes on the left. Forcing mutt to refresh with CTRL-L fixes the problem.
Comment 105 Rui Matos 2015-08-07 15:32:23 UTC
(In reply to İsmail Dönmez from comment #104)
> Notice the grey boxes on the left. Forcing mutt to refresh with CTRL-L fixes
> the problem.

Does this happen only with these patches applied? And does it happen with any other application?

In any case, please file a new bug for this issue. This one has grown too much already and I'd like to close it since the original issue seems finally fixed.
Comment 106 İsmail Dönmez 2015-08-07 15:39:25 UTC
(In reply to Rui Matos from comment #105)
> (In reply to İsmail Dönmez from comment #104)
> > Notice the grey boxes on the left. Forcing mutt to refresh with CTRL-L fixes
> > the problem.
> 
> Does this happen only with these patches applied? And does it happen with
> any other application?
> 
> In any case, please file a new bug for this issue. This one has grown too
> much already and I'd like to close it since the original issue seems finally
> fixed.

Just merge the final fix into mutter.git and I'll report a new bug if necessary. Thanks for your work on this.
Comment 107 Rui Matos 2015-08-07 15:43:41 UTC
Attachment 298718 [details] pushed as 39763d4 - compositor: Add support for GL_EXT_x11_sync_object
Comment 108 Rui Matos 2015-08-11 17:37:38 UTC
Seems like we still have missed redraws in some cases. Patch incoming
Comment 109 Rui Matos 2015-08-11 17:38:02 UTC
Created attachment 309084 [details] [review]
x11: Move damage handling to the backend

Since mutter has two X connections and does damage handling on the
frontend while fence triggering is done on the backend, we have a race
between XDamageSubtract() and XSyncFenceTrigger() causing missed
redraws in the GL_EXT_X11_sync_object path.

If the fence trigger gets processed first by the server, any client
drawing that happens between that and the damage subtract being
processed and is completely contained in the last damage event box
that mutter got, won't be included in the current frame nor will it
cause a new damage event.

A simple fix for this would be XSync()ing on the frontend connection
after doing all the damage subtracts but that would add a round trip
on every frame again which defeats the asynchronous design of X
fences.

By moving all our damage handling to the backend we automatically get
the right ordering between damage subtracts and fence triggers.
Comment 110 Jasper St. Pierre (not reading bugmail) 2015-08-11 18:30:01 UTC
Review of attachment 309084 [details] [review]:

er, doesn't this mean that we won't properly get damage events for Xwayland windows?

We show windows from the front-end on the backend, thus we have to listen to events for windows on the front-end...
Comment 111 Rui Matos 2015-08-12 09:58:49 UTC
(In reply to Jasper St. Pierre from comment #110)
> Review of attachment 309084 [details] [review] [review]:
> 
> er, doesn't this mean that we won't properly get damage events for Xwayland
> windows?

This is done through wl_surface.damage requests AFAICT. I did test this patch on a wayland session and it seemed to work.

> We show windows from the front-end on the backend, thus we have to listen to
> events for windows on the front-end...

In a X only compositor this distinction is unfortunately not that clear...
Comment 112 Rui Matos 2015-08-12 14:28:30 UTC
Created attachment 309156 [details] [review]
compositor: Handle fences in the frontend X connection

--

That said, moving fences to the frontend results in a much simpler
patch which seems to work equally well so perhaps this is a better
approach.

The only drawback is that we need to roundtrip for each fence creation
but that's a one time only thing at startup so not much of an issue
IMO.
Comment 113 Jasper St. Pierre (not reading bugmail) 2015-08-12 21:23:51 UTC
Review of attachment 309156 [details] [review]:

This looks good to me.
Comment 114 Rui Matos 2015-08-13 12:20:37 UTC
Pushing an amended version that does a single XSync() instead of one
per fence.

Attachment 309156 [details] pushed as 299ed42 - compositor: Handle fences in the frontend X connection
Comment 115 İsmail Dönmez 2015-08-13 12:34:12 UTC
Can we possibly get a 3.16 backport for these fixes?
Comment 116 Rui Matos 2015-08-13 12:57:03 UTC
(In reply to İsmail Dönmez from comment #115)
> Can we possibly get a 3.16 backport for these fixes?

Yeah, why not?

   8b75b41..461f193  gnome-3-14 -> gnome-3-14
   833c6e2..916070c  gnome-3-16 -> gnome-3-16

I don't know of any plans to do further formal releases from those branches though.
Comment 117 İsmail Dönmez 2015-08-13 13:00:00 UTC
(In reply to Rui Matos from comment #116)
> (In reply to İsmail Dönmez from comment #115)
> > Can we possibly get a 3.16 backport for these fixes?
> 
> Yeah, why not?
> 
>    8b75b41..461f193  gnome-3-14 -> gnome-3-14
>    833c6e2..916070c  gnome-3-16 -> gnome-3-16
> 
> I don't know of any plans to do further formal releases from those branches
> though.

Thanks a lot!
Comment 118 Justin Garrison 2015-09-06 06:20:23 UTC
> Version 352.30 released yesterday contains some fixes but not yet the full fix for the hang uncovered by this patch.

If 352.30 contains a partial fix do we know what version should have the full fix? I saw 355.11 came out this week but I didn't see any reference to the bug being fixed in this changelog nor the 355.06 beta release. http://www.nvidia.com/Download/driverResults.aspx/90393/en-us
Comment 119 t.jp 2015-09-06 11:20:39 UTC
(In reply to Justin Garrison from comment #118)
> > Version 352.30 released yesterday contains some fixes but not yet the full fix for the hang uncovered by this patch.
> 
> If 352.30 contains a partial fix do we know what version should have the
> full fix? I saw 355.11 came out this week but I didn't see any reference to
> the bug being fixed in this changelog nor the 355.06 beta release.
> http://www.nvidia.com/Download/driverResults.aspx/90393/en-us

If you have 352.30 or newer and a patched mutter this issue will be completely solved for you. I haven't had any hangs since I patched it weeks ago.

See here:
https://devtalk.nvidia.com/default/topic/729908/linux/-gt-334-21-redrawing-problems-in-gnome-3-10-3-12-gtx-580/7/
Comment 120 Rui Matos 2015-09-07 11:25:15 UTC
(In reply to Justin Garrison from comment #118)
> If 352.30 contains a partial fix do we know what version should have the
> full fix? I saw 355.11 came out this week but I didn't see any reference to
> the bug being fixed in this changelog nor the 355.06 beta release.

I'm sorry that I mislead people with that remark. I was assuming that the bug was in the driver because I couldn't see any evidence otherwise.

So, to be clear, there was *no* bug in the driver causing this. In fact, current mutter (with these patches applied) is working fine even with the older 340.xx driver branch.