Bug 745734 – Closing an app may crash gnome wayland session

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 745734 - Closing an app may crash gnome wayland session


Summary:	Closing an app may crash gnome wayland session


Status:	RESOLVED FIXED

Product:	mutter
Classification:	Core
Component:	wayland
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	mutter-maint
QA Contact:	mutter-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-03-06 12:12 UTC by Vadim Rutkovsky
Modified:	2015-03-16 17:06 UTC

See Also:
GNOME target:	3.16
GNOME version:	---

Attachments
wayland: Protect against unordered destruction of surface resources (3.24 KB, patch) 2015-03-10 12:55 UTC, Carlos Garnacho	none	Details \| Review
I tried the destructor callback approach, but the extra wl_listeners and (1.56 KB, patch) 2015-03-11 16:03 UTC, Carlos Garnacho	none	Details \| Review

Description Vadim Rutkovsky 2015-03-06 12:12:25 UTC

QXL driver, continuous 20150306.2
gnome-shell 3.15.91-12-g986b14fb1b0d70d01adef44e5358ec72407f1a93
mutter 3.15.91-4-g94c3c8f412ce997f1ff5fadb6f9a78f7ba5891c3

Start gda-browser-5.0 and click Cancel button / Quit from gapp menu

+ Trace 234792

Thread 12 (Thread 0x7f4d67fff700 (LWP 1738))

#0 syscall
at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
#1 g_cond_wait_until
at ../../glib/gthread-posix.c line 1437
#2 g_async_queue_pop_intern_unlocked
at ../../glib/gasyncqueue.c line 422
#3 g_async_queue_timeout_pop
at ../../glib/gasyncqueue.c line 543
#4 g_thread_pool_wait_for_new_pool
at ../../glib/gthreadpool.c line 167
#5 g_thread_pool_thread_proxy
at ../../glib/gthreadpool.c line 364
#6 g_thread_proxy
at ../../glib/gthread.c line 764
#7 start_thread
at pthread_create.c line 313
#8 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Thread 1 (Thread 0x7f4daaffb980 (LWP 1435))

#0 meta_wayland_compositor_destroy_frame_callbacks
at ../../src/wayland/meta-wayland.c line 216
#1 xdg_surface_destructor
at ../../src/wayland/meta-wayland-surface.c line 745
#2 destroy_resource
at ../src/wayland-server.c line 532
#3 for_each_helper
at ../src/wayland-util.c line 356
#4 wl_map_for_each
at ../src/wayland-util.c line 362
#5 wl_client_destroy
at ../src/wayland-server.c line 670
#6 wl_client_connection_data
at ../src/wayland-server.c line 245
#7 wl_event_loop_dispatch
at ../src/event-loop.c line 419
#8 wayland_event_source_dispatch
at ../../src/wayland/meta-wayland.c line 85
#9 g_main_dispatch
at ../../glib/gmain.c line 3122
#10 g_main_context_dispatch
at ../../glib/gmain.c line 3737
#11 g_main_context_iterate
at ../../glib/gmain.c line 3808
#12 g_main_loop_run
at ../../glib/gmain.c line 4002
#13 meta_run
at ../../src/core/main.c line 437
#14 main
at ../../src/main.c line 463

Comment 1 Vadim Rutkovsky 2015-03-06 12:33:41 UTC

Other affected apps: nm-connection-editor, epiphany, clocks

Unaffected: gpk-application

Comment 2 Jonas Ådahl 2015-03-06 15:24:50 UTC

Looks like it could be a regression from https://bugzilla.gnome.org/show_bug.cgi?id=745163

Comment 3 Carlos Garnacho 2015-03-10 12:54:42 UTC

Hmm, this was amusing. Bug #745163 is not the cause, but apparently introduced an almost sure crash in this situation.

What's happening here is that mutter receives wl_client_destroy(), this call attempts to destroy all client resources, although in an unordered manner. When this results in the wl_surface being destroyed before any role, the still living resources are left with a dangling pointer as user data.

I'm attaching a patch preparing mutter for that

Comment 4 Carlos Garnacho 2015-03-10 12:55:41 UTC

Created attachment 298986 [details] [review]
wayland: Protect against unordered destruction of surface resources

If the wl_surface resource happens to be destroyed before any other
role resource, the destructor for the latter will attempt to
access/modify random memory.

Fix this by ensuring the resources pending destruction are left with
NULL data after freeing the MetaWaylandSurface, and preparing the
destructors for this situation.

Comment 5 Jonas Ådahl 2015-03-10 13:07:22 UTC

Review of attachment 298986 [details] [review]:

::: src/wayland/meta-wayland-surface.c
@@ +1625,3 @@
 
+  if (!surface)
+    return;

Looks like a couple of these (subsurface and popup, at least) will leak this way. I think a better method for handling this situation would be to add destroy listeners to surface extensions and handle destruction properly even in this cases.

Comment 6 Carlos Garnacho 2015-03-11 10:07:22 UTC

(In reply to Jonas Ådahl from comment #5)
> Review of attachment 298986 [details] [review] [review]:
> 
> ::: src/wayland/meta-wayland-surface.c
> @@ +1625,3 @@
>  
> +  if (!surface)
> +    return;
> 
> Looks like a couple of these (subsurface and popup, at least) will leak this
> way. I think a better method for handling this situation would be to add
> destroy listeners to surface extensions and handle destruction properly even
> in this cases.

Oops right, I was expecting further resource destruction to eventually tear down the parent surface and free all child related data. But eg. subsurfaces indeed have additional data that would be left there. I guess that approach would also prepare for misbehaving clients, even if unintended.

Comment 7 Carlos Garnacho 2015-03-11 16:03:30 UTC

Created attachment 299111 [details] [review]
I tried the destructor callback approach, but the extra wl_listeners and

cross-checking made things somewhat complex. 

I realized then that we can just destroy the resources (and their data with
it) there, AFAICS the resources will be removed before wl_client_destroy()
gets to these.

With this, if some client destroys the wl_surface before its role, it will
surely get an "invalid object" error when destroying the role. Docs 
generally state "always destroy foo_surface before its wl_surface", so
we're doing nothing outside the book here.

--

wayland: Protect against unordered destruction of surface resources

If the wl_surface resource happens to be destroyed before any other
role resource, the destructor for the latter will attempt to
access/modify random memory.

Fix this by ensuring the associated resources are destroyed on the
wl_surface destructor, this will free all associated memory and
remove the resources ahead of their imminent destruction.

Comment 8 Carlos Garnacho 2015-03-16 16:45:25 UTC

FWIW, what triggers the bug is:

- wl_client_destroy() destroys resources, although not in any special order
- wl_surface_destructor() happens to be called before any other role destructor, the MetaWaylandSurface is freed, but the other resources outlive the wl_subsurface/data destruction
- When the role that defines this wl_surface is freed, it tries to access the MetaWaylandSurface struct (set as its user data, but not owned by it), and unset the relevant fields from the struct. As the MetaWaylandSurface is already freed, this results in invalid reads/writes, crashes...

The patch in comment #7 ensures that the server-side resource for the surface role is destroyed together in wl_surface_destructor(), this ensures xdg_surface_destructor/wl_subsurface_destructor/etc are called before the MetaWaylandSurface is freed. 

This would happen to cause "invalid object" errors if clients are just destroying things in the wrong order, but AFAICT all surface roles recommend to destroy the role before its supporting wl_surface, so it wouldn't be mutter which started the misbehavior anyway.

Comment 9 Carlos Garnacho 2015-03-16 17:06:38 UTC

The patch is now on master