GNOME Bugzilla – Bug 788860
gnome-shell crashed with SIGSEGV in meta_workspace_get_work_area_for_monitor()
Last modified: 2017-10-24 05:46:20 UTC
Created attachment 361399 [details] Stacktrace In ubuntu we're facing this crash multiple times and in various scenarios (see https://pad.lv/1722811 and duplicates): at login, after a suspend/resume cycle or changing monitors settings. As you can see in the attached Stacktrace, the crash happens here: - https://github.com/GNOME/mutter/blob/gnome-3-26/src/core/workspace.c#L1109 Now, fixing the crash itself is quite easy (data is NULL there, and in any case checked since which_monitor might also point to a random number which could have no data associated for real), but going to the root cause seems a bit more tricky. In fact, the previous ensure_work_areas_validated should ensure that monitor 0 (for which the stacktrace is), unless `workspace->work_areas_invalid` is not properly unset. Thus I guess this is just happening because the workspace is actually invalidated when "monitors-changed" happens, if this makes sense.
Created attachment 361400 [details] GDB session introspecting data Here's a gdb session with some more infos about the state using the coredumps you can find in the ubuntu bug.
Same as bug 788607?
No, different I think. Bug 788607 is asserting (signal 6) in meta_workspace_get_work_area_for_monitor whereas this bug is segfaulting (signal 11) in the same function.
Just to summarize what me and Marco talked about on IRC: The most likely cause is that a 'monitors-changed' signal handler in gnome-shell is handled before the one in screen.c. The handler in gnome-shell queries workspace state, but the workspace state is invalid and is not fixed until the screen.c 'monitors-changed' handler is called. There are three alternatives that should avoid that issue: 1) Explicitly update the monitors state handled by screen.c before emitting the 'monitors-changed' signal. 2) Add a 'monitors-changed-private' signal that is emitted before 'monitors-changed' that is used by mutter internally. 3) Rely on gnome-shell always using connect_after() 1) Has the benefit that the call order will be very clear. We already partly do this; we call meta_backend_update_monitors() before emitting the signal. 2) Has the benefit of still using a signal, which is quite convenient at times. Listeners must be aware that they cannot rely on certain state however, which is bad. 3) I think is not reasonable, correct usage would be too much of an easter egg.
New update before my bed-time (well it should be already be the case). Actually this seems already be the case, since "monitors-changed" from screen is the only thing that the shell listens and that happens after that reload_logical_monitors (which invalidates all the workareas) is called. So... Still something magic hides.
JS stacktrace is just: ott 12 10:15:47 ubuntu-vmware org.gnome.Shell.desktop[104232]: #0 0x7ffcd9a1af00 b resource:///org/gnome/shell/ui/layout.js:144 (0x7fd51c02bb38 @ 250) ott 12 10:15:47 ubuntu-vmware org.gnome.Shell.desktop[104232]: #1 0x7ffcd9a1af70 I resource:///org/gnome/gjs/modules/_legacy.js:82 (0x7fd53c0c2c48 @ 71) ott 12 10:15:47 ubuntu-vmware org.gnome.Shell.desktop[104232]: #2 0x7ffcd9a1c490 I resource:///org/gnome/shell/ui/layout.js:208 (0x7fd51c02bd58 @ 62) As expected triggered by this call https://github.com/GNOME/gnome-shell/blob/gnome-3-26/js/ui/layout.js#L144
So it comes from an allocation stage; I guess from the clutter stage resize. That means we should probably call screen.c's reload_logical_monitors() before the backend resizes the stage. I'm not sure whether we should just put the 'on_monitors_changed()' calls earlier than that, would have to check whether it happens to rely on the stage size being up to date or not.
Created attachment 361466 [details] [review] MetaMonitorManager: update the internal monitor state before backend Updating backend might lead to updating the stage and scaling which might make some clients to get inconsistent data from screen.
So this is happening because meta_backend_monitors_changed (and meta_settings_update_ui_scaling_factor exactly) gets called when the screen is not yet updated. And this might make the UI to rebuild with inconsistent data. The attached patch just invert the order of the things, but maybe we could use different solutions which I'm proposing too.
Created attachment 361467 [details] [review] MetaMonitorManager: update the internal monitor state before ui scale Updating ui scale might lead to updating views which might make some clients to reallocate geting inconsistent data from screen.
Created attachment 361472 [details] [review] MetaMonitorManager: add 'monitors-updated-internal' signal Adding an internal signal and use it to update the internal state before emitting "monitors-changed" which will be emitted by the screen too.
Created attachment 361473 [details] [review] MetaMonitorManager: add 'monitors-updated-internal' signal Adding an internal signal and use it to update the internal state before emitting "monitors-changed" which will be emitted by the screen too.
Created attachment 361474 [details] [review] MetaScreen: update internal state on 'monitors-changed-internal' Use MetaMonitorManager 'monitors-changed' signal only as indication that everything has been updated internally and that thus we can notify our clients too.
Created attachment 361475 [details] [review] Workspace: ensure that workarea data is not null when fetching by monitor num
Review of attachment 361473 [details] [review]: I think this and the next commit should be squashed, since this actual change of this is closer to the description of the next patch (update internal state on the new signal). Commit message subject scope is a bit off as well, should probably be just scoped with just "backend: ..." ::: src/backends/meta-monitor-manager.c @@ +2630,3 @@ meta_monitor_manager_notify_monitors_changed (MetaMonitorManager *manager) { + MetaSettings *settings = meta_backend_get_settings (manager->backend); style nit: newline after declarations. @@ +2635,3 @@ + /* Order is important here, we need to emit this signal first to update + * MetaScreen (and workspaces) internal state first, then we update the scale + * factor to rebuild the views or we might experience crashes like bug #788860 */ I prefer leaving these types of documentation blurbs in the commit message. Otherwise we'll end up with motivation for too many things everywhere. @@ +2640,3 @@ + g_signal_emit (manager, signals[MONITORS_CHANGED_INTERNAL], 0); + + meta_settings_update_ui_scaling_factor (settings); Maybe just make connect this function to the 'monitors-changed-internal' signal (using connect_after)?
Review of attachment 361474 [details] [review]: As mentioned on the other patch, makes more sense to squash these two, as they do more or less the same change to every "internal state" thing (screen, wayland-output, constraints, ...). ::: src/backends/meta-backend.c @@ +198,3 @@ } + + meta_cursor_renderer_force_update (priv->cursor_renderer); Calling this here makes a lot more sense indeed. This could be a separate commit though, as it isn't really related to the commit message. ::: src/core/screen.c @@ +2275,2 @@ { + /* Update the internal state */ nit: comment not needed, the fact that this is the internal states in the function name
Review of attachment 361475 [details] [review]: ::: src/core/workspace.c @@ +1107,3 @@ data = meta_workspace_get_logical_monitor_data (workspace, logical_monitor); + g_return_if_fail (data != NULL); This will likely cause the caller to play end up with uninitialized data, which is probably harder to debug. Maybe just make this an g_assert()? Alternatively do if (!data) { g_warning ("..."); *area = (MetaRectangle) { 0 }; }
Review of attachment 361467 [details] [review]: Marking as rejected, as a better solution was posted. ::: src/backends/meta-monitor-manager.c @@ +2631,3 @@ g_signal_emit_by_name (manager, "monitors-changed"); + + meta_settings_update_ui_scaling_factor (settings); I don't like the fact that we'd first tell the world, then update the scaling factor. The 'monitors-changed' listener outside of mutter should only be invoked with that signal when the mutter state is up to date.
Review of attachment 361466 [details] [review]: Marking as rejected as a better solution was also posted. ::: src/backends/meta-monitor-manager.c @@ +2628,3 @@ g_signal_emit_by_name (manager, "monitors-changed"); + + meta_backend_monitors_changed (manager->backend); This also seems off; we shouldn't tell the world about the change before actually having updated our own state.
Review of attachment 361473 [details] [review]: ::: src/backends/meta-monitor-manager.c @@ +2635,3 @@ + /* Order is important here, we need to emit this signal first to update + * MetaScreen (and workspaces) internal state first, then we update the scale + * factor to rebuild the views or we might experience crashes like bug #788860 */ I don't put many comments around normally, I thought it was just needed here because someone could be tempted to reorder things as normally we emit the signals as last thing, but it made more sense in the previous implementation, so I can do that. @@ +2640,3 @@ + g_signal_emit (manager, signals[MONITORS_CHANGED_INTERNAL], 0); + + meta_settings_update_ui_scaling_factor (settings); It was exaclty my idea... But I avoided to do this after our discussion about the fragility of connect-after, but in terms of code clean-less I'd preferred it too.
Review of attachment 361474 [details] [review]: ::: src/backends/meta-backend.c @@ +198,3 @@ } + + meta_cursor_renderer_force_update (priv->cursor_renderer); So it was my idea too, then... I preferred not to add even more patches to this bug until you didn't choose the preferred method. But indeed it's something to split.
(In reply to Marco Trevisan (Treviño) from comment #20) > Review of attachment 361473 [details] [review] [review]: > > ::: src/backends/meta-monitor-manager.c > @@ +2635,3 @@ > + /* Order is important here, we need to emit this signal first to update > + * MetaScreen (and workspaces) internal state first, then we update the > scale > + * factor to rebuild the views or we might experience crashes like bug > #788860 */ > > I don't put many comments around normally, I thought it was just needed here > because someone could be tempted to reorder things as normally we emit the > signals as last thing, but it made more sense in the previous > implementation, so I can do that. I guess here it is relatively useful, but FWIW, the comment will be wrong once we land the no-xwayland patches (since there will be no more MetaScreen), so in this case I'd say the fact that we have two signals is documentation enough in the code (while blurbing about the reason about all of this in the commit message). > > @@ +2640,3 @@ > + g_signal_emit (manager, signals[MONITORS_CHANGED_INTERNAL], 0); > + > + meta_settings_update_ui_scaling_factor (settings); > > It was exaclty my idea... But I avoided to do this after our discussion > about the fragility of connect-after, but in terms of code clean-less I'd > preferred it too. My concern about connect-after fragility was on using it to determine order on the external signal.
Review of attachment 361475 [details] [review]: ::: src/core/workspace.c @@ +1107,3 @@ data = meta_workspace_get_logical_monitor_data (workspace, logical_monitor); + g_return_if_fail (data != NULL); That's how I did initially too... Then I went simpler. What about just a g_return_if_reached instead or you prefer a more explaininig g_warning?
(In reply to Marco Trevisan (Treviño) from comment #23) > Review of attachment 361475 [details] [review] [review]: > > ::: src/core/workspace.c > @@ +1107,3 @@ > data = meta_workspace_get_logical_monitor_data (workspace, > logical_monitor); > > + g_return_if_fail (data != NULL); > > That's how I did initially too... Then I went simpler. > > What about just a g_return_if_reached instead or you prefer a more > explaininig g_warning? Note that g_return_if_reached() will print and return only if debug macros are enabled, so if somenone builds without, we'd continue and then segfault. We need an actual "return" if we want to actually handle it.
Created attachment 361477 [details] [review] Workspace: ensure that workarea data is not null when fetching by monitor num
Review of attachment 361477 [details] [review]: commit subject nit: don't capitalize the prefix (Should be "workspace: Blah blah") Also, the change is rather "Warn when trying to fetch out of date state" as we now warn instead of crash (which I guess is reasonable because this might be called from Javascript). With the commit message fixed this looks good to me.
(In reply to Jonas Ådahl from comment #26) > Review of attachment 361477 [details] [review] [review]: > > commit subject nit: don't capitalize the prefix (Should be "workspace: Blah > blah") > > Also, the change is rather "Warn when trying to fetch out of date state" as > we now warn instead of crash (which I guess is reasonable because this might > be called from Javascript). > > With the commit message fixed this looks good to me. Ok, but this could be also happen on invalid monitor number, which... Still could be an invalid state. But it will also cause a g_warning from meta_monitor_manager_get_logical_monitor_from_number
Created attachment 361478 [details] [review] backends: add 'monitors-updated-internal' signal to only update internal state Adding an internal signal and use it to update the internal state before emitting "monitors-changed" which will be repeated by the screen to the world.
Created attachment 361479 [details] [review] monitor-manager: use g_return_val_if_fail if trying to fetch an invalid monitor
Created attachment 361480 [details] [review] meta-backend: move the cursor render update on screen changes here
Created attachment 361481 [details] [review] workspace: ensure that workarea data is valid when fetching by monitor num
Review of attachment 361478 [details] [review]: Apart from a variable naming nit, looks good to me. ::: src/backends/meta-settings.c @@ +345,3 @@ +static void +on_monitors_changed (MetaMonitorManager *monitors, nit: s/monitors/monitor_manager/ ("monitors" implies its a list of monitors)
Review of attachment 361479 [details] [review]: Yea, lets do this.
Review of attachment 361480 [details] [review]: commit subject nit: We usually use either of these two styles of commit subjects: "scope: Short description" where scope can be e.g. "monitor-manager", or "backend" or "wayland" etc. or "MetaObject: Short description" if it's relevant to a specific type only the first one is mostly better because its shorter. With that fixed, this lgtm.
Review of attachment 361481 [details] [review]: lgtm.
Created attachment 361482 [details] [review] backends: add 'monitors-updated-internal' signal to only update internal state Adding an internal signal and use it to update the internal state before emitting "monitors-changed" which will be repeated by the screen to the world.
Created attachment 361483 [details] [review] backend: move the cursor render update on screen changes here
Attachment 361479 [details] pushed as f044511 - monitor-manager: use g_return_val_if_fail if trying to fetch an invalid monitor Attachment 361481 [details] pushed as 556136d - workspace: ensure that workarea data is valid when fetching by monitor num Attachment 361482 [details] pushed as b31e545 - backends: add 'monitors-updated-internal' signal to only update internal state Attachment 361483 [details] pushed as a8c80cc - backend: move the cursor render update on screen changes here
I've got similar issue with monitor is not waking up. I see following messages in systemd log: Oct 23 22:34:24 argo kernel: [drm:intel_dp_start_link_train [i915]] *ERROR* failed to get link status Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_monitor_manager_get_logical_monitor_from_number: assertion '(unsigned int) number < g_list_length (manager->logical_monitors)' failed Oct 23 22:34:24 argo gnome-shell[747]: meta_workspace_get_work_area_for_monitor: assertion 'logical_monitor != NULL' failed