Bug 788860 – gnome-shell crashed with SIGSEGV in meta_workspace_get_work_area_for_monitor()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 788860 - gnome-shell crashed with SIGSEGV in meta_workspace_get_work_area_for_monitor()


Summary:	gnome-shell crashed with SIGSEGV in meta_workspace_get_work_area_for_monitor()


Status:	RESOLVED FIXED

Product:	mutter
Classification:	Core
Component:	general
Version:	3.26.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	mutter-maint
QA Contact:	mutter-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-10-12 05:45 UTC by Marco Trevisan (Treviño)
Modified:	2017-10-24 05:46 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Stacktrace (5.75 KB, text/plain) 2017-10-12 05:45 UTC, Marco Trevisan (Treviño)		Details
GDB session introspecting data (5.26 KB, text/plain) 2017-10-12 05:46 UTC, Marco Trevisan (Treviño)		Details
MetaMonitorManager: update the internal monitor state before backend (1.33 KB, patch) 2017-10-13 00:15 UTC, Marco Trevisan (Treviño)	none	Details \| Review
MetaMonitorManager: update the internal monitor state before ui scale (1.92 KB, patch) 2017-10-13 00:26 UTC, Marco Trevisan (Treviño)	none	Details \| Review
MetaMonitorManager: add 'monitors-updated-internal' signal (6.21 KB, patch) 2017-10-13 01:03 UTC, Marco Trevisan (Treviño)	none	Details \| Review
MetaMonitorManager: add 'monitors-updated-internal' signal (6.21 KB, patch) 2017-10-13 01:07 UTC, Marco Trevisan (Treviño)	none	Details \| Review
MetaScreen: update internal state on 'monitors-changed-internal' (3.18 KB, patch) 2017-10-13 01:08 UTC, Marco Trevisan (Treviño)	none	Details \| Review
Workspace: ensure that workarea data is not null when fetching by monitor num (885 bytes, patch) 2017-10-13 01:08 UTC, Marco Trevisan (Treviño)	none	Details \| Review
Workspace: ensure that workarea data is not null when fetching by monitor num (1021 bytes, patch) 2017-10-13 04:13 UTC, Marco Trevisan (Treviño)	none	Details \| Review
backends: add 'monitors-updated-internal' signal to only update internal state (8.71 KB, patch) 2017-10-13 05:09 UTC, Marco Trevisan (Treviño)	none	Details \| Review
monitor-manager: use g_return_val_if_fail if trying to fetch an invalid monitor (1.08 KB, patch) 2017-10-13 05:10 UTC, Marco Trevisan (Treviño)	committed	Details \| Review
meta-backend: move the cursor render update on screen changes here (1.74 KB, patch) 2017-10-13 05:13 UTC, Marco Trevisan (Treviño)	none	Details \| Review
workspace: ensure that workarea data is valid when fetching by monitor num (1.20 KB, patch) 2017-10-13 05:16 UTC, Marco Trevisan (Treviño)	committed	Details \| Review
backends: add 'monitors-updated-internal' signal to only update internal state (8.71 KB, patch) 2017-10-13 05:46 UTC, Marco Trevisan (Treviño)	committed	Details \| Review
backend: move the cursor render update on screen changes here (1.73 KB, patch) 2017-10-13 05:47 UTC, Marco Trevisan (Treviño)	committed	Details \| Review

Description Marco Trevisan (Treviño) 2017-10-12 05:45:10 UTC

Created attachment 361399 [details]
Stacktrace

In ubuntu we're facing this crash multiple times and in various scenarios (see https://pad.lv/1722811 and duplicates): at login, after a suspend/resume cycle or changing monitors settings.

As you can see in the attached Stacktrace, the crash happens here:
 - https://github.com/GNOME/mutter/blob/gnome-3-26/src/core/workspace.c#L1109

Now, fixing the crash itself is quite easy (data is NULL there, and in any case checked since which_monitor might also point to a random number which could have no data associated for real), but going to the root cause seems a bit more tricky.

In fact, the previous ensure_work_areas_validated should ensure that monitor 0 (for which the stacktrace is), unless `workspace->work_areas_invalid` is not properly unset.

Thus I guess this is just happening because the workspace is actually invalidated when "monitors-changed" happens, if this makes sense.

Comment 1 Marco Trevisan (Treviño) 2017-10-12 05:46:48 UTC

Created attachment 361400 [details]
GDB session introspecting data

Here's a gdb session with some more infos about the state using the coredumps you can find in the ubuntu bug.

Comment 2 Jonas Ådahl 2017-10-12 05:49:41 UTC

Same as bug 788607?

Comment 3 Daniel van Vugt 2017-10-12 06:14:53 UTC

No, different I think.

Bug 788607 is asserting (signal 6) in meta_workspace_get_work_area_for_monitor whereas this bug is segfaulting (signal 11) in the same function.

Comment 4 Jonas Ådahl 2017-10-12 06:21:57 UTC

Just to summarize what me and Marco talked about on IRC:

The most likely cause is that a 'monitors-changed' signal handler in gnome-shell is handled before the one in screen.c. The handler in gnome-shell queries workspace state, but the workspace state is invalid and is not fixed until the screen.c 'monitors-changed' handler is called.

There are three alternatives that should avoid that issue:

1) Explicitly update the monitors state handled by screen.c before emitting the 'monitors-changed' signal.

2) Add a 'monitors-changed-private' signal that is emitted before 'monitors-changed' that is used by mutter internally.

3) Rely on gnome-shell always using connect_after()

1) Has the benefit that the call order will be very clear. We already partly do this; we call meta_backend_update_monitors() before emitting the signal.
2) Has the benefit of still using a signal, which is quite convenient at times. Listeners must be aware that they cannot rely on certain state however, which is bad.
3) I think is not reasonable, correct usage would be too much of an easter egg.

Comment 5 Marco Trevisan (Treviño) 2017-10-12 08:12:41 UTC

New update before my bed-time (well it should be already be the case).

Actually this seems already be the case, since "monitors-changed" from screen is the only thing that the shell listens and that happens after that reload_logical_monitors (which invalidates all the workareas) is called.

So... Still something magic hides.

Comment 6 Marco Trevisan (Treviño) 2017-10-12 08:21:17 UTC

JS stacktrace is just:

ott 12 10:15:47 ubuntu-vmware org.gnome.Shell.desktop[104232]: #0 0x7ffcd9a1af00 b   resource:///org/gnome/shell/ui/layout.js:144 (0x7fd51c02bb38 @ 250)
ott 12 10:15:47 ubuntu-vmware org.gnome.Shell.desktop[104232]: #1 0x7ffcd9a1af70 I   resource:///org/gnome/gjs/modules/_legacy.js:82 (0x7fd53c0c2c48 @ 71)
ott 12 10:15:47 ubuntu-vmware org.gnome.Shell.desktop[104232]: #2 0x7ffcd9a1c490 I   resource:///org/gnome/shell/ui/layout.js:208 (0x7fd51c02bd58 @ 62)

As expected triggered by this call https://github.com/GNOME/gnome-shell/blob/gnome-3-26/js/ui/layout.js#L144

Comment 7 Jonas Ådahl 2017-10-12 09:07:40 UTC

So it comes from an allocation stage; I guess from the clutter stage resize. That means we should probably call screen.c's reload_logical_monitors() before the backend resizes the stage. I'm not sure whether we should just put the 'on_monitors_changed()' calls earlier than that, would have to check whether it happens to rely on the stage size being up to date or not.

Comment 8 Marco Trevisan (Treviño) 2017-10-13 00:15:24 UTC

Created attachment 361466 [details] [review]
MetaMonitorManager: update the internal monitor state before backend

Updating backend might lead to updating the stage and scaling which might
make some clients to get inconsistent data from screen.

Comment 9 Marco Trevisan (Treviño) 2017-10-13 00:18:28 UTC

So this is happening because meta_backend_monitors_changed (and meta_settings_update_ui_scaling_factor exactly) gets called when the screen is not yet updated. And this might make the UI to rebuild with inconsistent data.

The attached patch just invert the order of the things, but maybe we could use different solutions which I'm proposing too.

Comment 10 Marco Trevisan (Treviño) 2017-10-13 00:26:10 UTC

Created attachment 361467 [details] [review]
MetaMonitorManager: update the internal monitor state before ui scale

Updating ui scale might lead to updating views which might
make some clients to reallocate geting inconsistent data from screen.

Comment 11 Marco Trevisan (Treviño) 2017-10-13 01:03:45 UTC

Created attachment 361472 [details] [review]
MetaMonitorManager: add 'monitors-updated-internal' signal

Adding an internal signal and use it to update the internal state before
emitting "monitors-changed" which will be emitted by the screen too.

Comment 12 Marco Trevisan (Treviño) 2017-10-13 01:07:56 UTC

Created attachment 361473 [details] [review]
MetaMonitorManager: add 'monitors-updated-internal' signal

Adding an internal signal and use it to update the internal state before
emitting "monitors-changed" which will be emitted by the screen too.

Comment 13 Marco Trevisan (Treviño) 2017-10-13 01:08:11 UTC

Created attachment 361474 [details] [review]
MetaScreen: update internal state on 'monitors-changed-internal'

Use MetaMonitorManager 'monitors-changed' signal only as indication that
everything has been updated internally and that thus we can notify
our clients too.

Comment 14 Marco Trevisan (Treviño) 2017-10-13 01:08:21 UTC

Created attachment 361475 [details] [review]
Workspace: ensure that workarea data is not null when fetching by monitor num

Comment 15 Jonas Ådahl 2017-10-13 03:34:26 UTC

Review of attachment 361473 [details] [review]:

I think this and the next commit should be squashed, since this actual change of this is closer to the description of the next patch (update internal state on the new signal).

Commit message subject scope is a bit off as well, should probably be just scoped with just "backend: ..."

::: src/backends/meta-monitor-manager.c
@@ +2630,3 @@
 meta_monitor_manager_notify_monitors_changed (MetaMonitorManager *manager)
 {
+  MetaSettings *settings = meta_backend_get_settings (manager->backend);

style nit: newline after declarations.

@@ +2635,3 @@
+  /* Order is important here, we need to emit this signal first to update
+   * MetaScreen (and workspaces) internal state first, then we update the scale
+   * factor to rebuild the views or we might experience crashes like bug #788860  */

I prefer leaving these types of documentation blurbs in the commit message. Otherwise we'll end up with motivation for too many things everywhere.

@@ +2640,3 @@
+  g_signal_emit (manager, signals[MONITORS_CHANGED_INTERNAL], 0);
+
+  meta_settings_update_ui_scaling_factor (settings);

Maybe just make connect this function to the 'monitors-changed-internal' signal (using connect_after)?

Comment 16 Jonas Ådahl 2017-10-13 03:35:37 UTC

Review of attachment 361474 [details] [review]:

As mentioned on the other patch, makes more sense to squash these two, as they do more or less the same change to every "internal state" thing (screen, wayland-output, constraints, ...).

::: src/backends/meta-backend.c
@@ +198,3 @@
     }
+
+  meta_cursor_renderer_force_update (priv->cursor_renderer);

Calling this here makes a lot more sense indeed. This could be a separate commit though, as it isn't really related to the commit message.

::: src/core/screen.c
@@ +2275,2 @@
 {
+  /* Update the internal state */

nit: comment not needed, the fact that this is the internal states in the function name

Comment 17 Jonas Ådahl 2017-10-13 03:36:06 UTC

Review of attachment 361475 [details] [review]:

::: src/core/workspace.c
@@ +1107,3 @@
   data = meta_workspace_get_logical_monitor_data (workspace, logical_monitor);
 
+  g_return_if_fail (data != NULL);

This will likely cause the caller to play end up with uninitialized data, which is probably harder to debug. Maybe just make this an g_assert()? Alternatively do

if (!data)
  {
    g_warning ("...");
    *area = (MetaRectangle) { 0 };
  }

Comment 18 Jonas Ådahl 2017-10-13 03:38:07 UTC

Review of attachment 361467 [details] [review]:

Marking as rejected, as a better solution was posted.

::: src/backends/meta-monitor-manager.c
@@ +2631,3 @@
   g_signal_emit_by_name (manager, "monitors-changed");
+
+  meta_settings_update_ui_scaling_factor (settings);

I don't like the fact that we'd first tell the world, then update the scaling factor. The 'monitors-changed' listener outside of mutter should only be invoked with that signal when the mutter state is up to date.

Comment 19 Jonas Ådahl 2017-10-13 03:38:48 UTC

Review of attachment 361466 [details] [review]:

Marking as rejected as a better solution was also posted.

::: src/backends/meta-monitor-manager.c
@@ +2628,3 @@
   g_signal_emit_by_name (manager, "monitors-changed");
+
+  meta_backend_monitors_changed (manager->backend);

This also seems off; we shouldn't tell the world about the change before actually having updated our own state.

Comment 20 Marco Trevisan (Treviño) 2017-10-13 03:45:17 UTC

Review of attachment 361473 [details] [review]:

::: src/backends/meta-monitor-manager.c
@@ +2635,3 @@
+  /* Order is important here, we need to emit this signal first to update
+   * MetaScreen (and workspaces) internal state first, then we update the scale
+   * factor to rebuild the views or we might experience crashes like bug #788860  */

I don't put many comments around normally, I thought it was just needed here because someone could be tempted to reorder things as normally we emit the signals as last thing, but it made more sense in the previous implementation, so I can do that.

@@ +2640,3 @@
+  g_signal_emit (manager, signals[MONITORS_CHANGED_INTERNAL], 0);
+
+  meta_settings_update_ui_scaling_factor (settings);

It was exaclty my idea... But I avoided to do this after our discussion about the fragility of connect-after, but in terms of code clean-less I'd preferred it too.

Comment 21 Marco Trevisan (Treviño) 2017-10-13 03:46:40 UTC

Review of attachment 361474 [details] [review]:

::: src/backends/meta-backend.c
@@ +198,3 @@
     }
+
+  meta_cursor_renderer_force_update (priv->cursor_renderer);

So it was my idea too, then... I preferred not to add even more patches to this bug until you didn't choose the preferred method.

But indeed it's something to split.

Comment 22 Jonas Ådahl 2017-10-13 03:50:45 UTC

(In reply to Marco Trevisan (Treviño) from comment #20)
> Review of attachment 361473 [details] [review] [review]:
> 
> ::: src/backends/meta-monitor-manager.c
> @@ +2635,3 @@
> +  /* Order is important here, we need to emit this signal first to update
> +   * MetaScreen (and workspaces) internal state first, then we update the
> scale
> +   * factor to rebuild the views or we might experience crashes like bug
> #788860  */
> 
> I don't put many comments around normally, I thought it was just needed here
> because someone could be tempted to reorder things as normally we emit the
> signals as last thing, but it made more sense in the previous
> implementation, so I can do that.

I guess here it is relatively useful, but FWIW, the comment will be wrong once we land the no-xwayland patches (since there will be no more MetaScreen), so in this case I'd say the fact that we have two signals is documentation enough in the code (while blurbing about the reason about all of this in the commit message).

> 
> @@ +2640,3 @@
> +  g_signal_emit (manager, signals[MONITORS_CHANGED_INTERNAL], 0);
> +
> +  meta_settings_update_ui_scaling_factor (settings);
> 
> It was exaclty my idea... But I avoided to do this after our discussion
> about the fragility of connect-after, but in terms of code clean-less I'd
> preferred it too.

My concern about connect-after fragility was on using it to determine order on the external signal.

Comment 23 Marco Trevisan (Treviño) 2017-10-13 03:52:25 UTC

Review of attachment 361475 [details] [review]:

::: src/core/workspace.c
@@ +1107,3 @@
   data = meta_workspace_get_logical_monitor_data (workspace, logical_monitor);
 
+  g_return_if_fail (data != NULL);

That's how I did initially too... Then I went simpler.

What about just a g_return_if_reached instead or you prefer a more explaininig g_warning?

Comment 24 Jonas Ådahl 2017-10-13 04:00:04 UTC

(In reply to Marco Trevisan (Treviño) from comment #23)
> Review of attachment 361475 [details] [review] [review]:
> 
> ::: src/core/workspace.c
> @@ +1107,3 @@
>    data = meta_workspace_get_logical_monitor_data (workspace,
> logical_monitor);
>  
> +  g_return_if_fail (data != NULL);
> 
> That's how I did initially too... Then I went simpler.
> 
> What about just a g_return_if_reached instead or you prefer a more
> explaininig g_warning?

Note that g_return_if_reached() will print and return only if debug macros are enabled, so if somenone builds without, we'd continue and then segfault. We need an actual "return" if we want to actually handle it.

Comment 25 Marco Trevisan (Treviño) 2017-10-13 04:13:36 UTC

Created attachment 361477 [details] [review]
Workspace: ensure that workarea data is not null when fetching by monitor num

Comment 26 Jonas Ådahl 2017-10-13 04:24:03 UTC

Review of attachment 361477 [details] [review]:

commit subject nit: don't capitalize the prefix (Should be "workspace: Blah blah")

Also, the change is rather "Warn when trying to fetch out of date state" as we now warn instead of crash (which I guess is reasonable because this might be called from Javascript).

With the commit message fixed this looks good to me.

Comment 27 Marco Trevisan (Treviño) 2017-10-13 04:27:40 UTC

(In reply to Jonas Ådahl from comment #26)
> Review of attachment 361477 [details] [review] [review]:
> 
> commit subject nit: don't capitalize the prefix (Should be "workspace: Blah
> blah")
> 
> Also, the change is rather "Warn when trying to fetch out of date state" as
> we now warn instead of crash (which I guess is reasonable because this might
> be called from Javascript).
> 
> With the commit message fixed this looks good to me.

Ok, but this could be also happen on invalid monitor number, which... Still could be an invalid state. But it will also cause a g_warning from meta_monitor_manager_get_logical_monitor_from_number

Comment 28 Marco Trevisan (Treviño) 2017-10-13 05:09:05 UTC

Created attachment 361478 [details] [review]
backends: add 'monitors-updated-internal' signal to only update internal state

Adding an internal signal and use it to update the internal state before
emitting "monitors-changed" which will be repeated by the screen to the world.

Comment 29 Marco Trevisan (Treviño) 2017-10-13 05:10:45 UTC

Created attachment 361479 [details] [review]
monitor-manager: use g_return_val_if_fail if trying to fetch an invalid monitor

Comment 30 Marco Trevisan (Treviño) 2017-10-13 05:13:16 UTC

Created attachment 361480 [details] [review]
meta-backend: move the cursor render update on screen changes here

Comment 31 Marco Trevisan (Treviño) 2017-10-13 05:16:29 UTC

Created attachment 361481 [details] [review]
workspace: ensure that workarea data is valid when fetching by monitor num

Comment 32 Jonas Ådahl 2017-10-13 05:34:47 UTC

Review of attachment 361478 [details] [review]:

Apart from a variable naming nit, looks good to me.

::: src/backends/meta-settings.c
@@ +345,3 @@
 
+static void
+on_monitors_changed (MetaMonitorManager *monitors,

nit: s/monitors/monitor_manager/ ("monitors" implies its a list of monitors)

Comment 33 Jonas Ådahl 2017-10-13 05:35:27 UTC

Review of attachment 361479 [details] [review]:

Yea, lets do this.

Comment 34 Jonas Ådahl 2017-10-13 05:39:04 UTC

Review of attachment 361480 [details] [review]:

commit subject nit: We usually use either of these two styles of commit subjects:

"scope: Short description" where scope can be e.g. "monitor-manager", or "backend" or "wayland" etc.

or

"MetaObject: Short description" if it's relevant to a specific type only

the first one is mostly better because its shorter.

With that fixed, this lgtm.

Comment 35 Jonas Ådahl 2017-10-13 05:39:41 UTC

Review of attachment 361481 [details] [review]:

lgtm.

Comment 36 Marco Trevisan (Treviño) 2017-10-13 05:46:32 UTC

Created attachment 361482 [details] [review]
backends: add 'monitors-updated-internal' signal to only update internal state

Adding an internal signal and use it to update the internal state before
emitting "monitors-changed" which will be repeated by the screen to the world.

Comment 37 Marco Trevisan (Treviño) 2017-10-13 05:47:19 UTC

Created attachment 361483 [details] [review]
backend: move the cursor render update on screen changes here

Comment 38 Marco Trevisan (Treviño) 2017-10-13 05:50:32 UTC

Attachment 361479 [details] pushed as f044511 - monitor-manager: use g_return_val_if_fail if trying to fetch an invalid monitor
Attachment 361481 [details] pushed as 556136d - workspace: ensure that workarea data is valid when fetching by monitor num
Attachment 361482 [details] pushed as b31e545 - backends: add 'monitors-updated-internal' signal to only update internal state
Attachment 361483 [details] pushed as a8c80cc - backend: move the cursor render update on screen changes here

Comment 39 Anatol Pomozov 2017-10-24 05:46:20 UTC

I've got similar issue with monitor is not waking up. I see following messages in systemd log:

Oct 23 22:34:24 argo kernel: [drm:intel_dp_start_link_train [i915]] *ERROR* failed to get link status
Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed
Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed