GNOME Bugzilla – Bug 768204
gdm segfault in libmutter.so.0.0.0 with nvidia-blob
Last modified: 2021-07-05 13:51:00 UTC
Created attachment 330607 [details] journal stacktrace I can start the gnome desktop with startx command, but gdm fails to start. Running X. Laptop with nvidia-optimous. Running Archlinux with: mutter 3.20.3-1 nvidia 367.27-1 gdm 3.20.1-1 gnome-shell 3.20.3-1 glibc 2.23-5 xorg-server 1.18.3-2 Prevously updated a bug in gdm: https://bugzilla.gnome.org/show_bug.cgi?id=765197 Have a thread on the Archlinux forum: https://bbs.archlinux.org/viewtopic.php?id=211304
Stack trace of thread 2679: #0 0x00007fd581bfd147 n/a (libmutter.so.0) #1 0x00007fd581c09745 n/a (libmutter.so.0) That's not very helpful I'm afraid - can you get a stacktrace with debug symbols?
Thanks for taking the time to report this. Unfortunately, that stack trace is missing some elements that will help a lot to solve the problem, so it will be hard for the developers to fix that crash. Can you get us a stack trace with debugging symbols? Please see https://wiki.gnome.org/Community/GettingInTouch/Bugzilla/GettingTraces for more information on how to do so and reopen this bug report. Thanks in advance!
Created attachment 330609 [details] journal stacktrace Thanks for the quick responses. I rebuild mutter with debug symbols (I think) I see center_pointer, which makes me mention that I have a laptop with an external monitor attached to the nvidia hdmi output. I'll post my /var/lib/gdm/.config/monitors.xml as well.
Created attachment 330610 [details] monitor.xml from /var/lib/gdm/.config Could it be that this file is the underlying reason for the segfault?
Created attachment 330617 [details] coredump Figured out how do produce a better coredump. I also tried to delete the monitors.xml from the gdm users home folder, but it did not change anything.
That's better, thanks. The coredump points to center_pointer()[0], but it's still not quite clear what is going wrong - I'd say either meta_monitor_manager_get_primary_index() is out of range for the returned array, or the "primary" variable ends up as NULL/invalid memory. Can you find the actual line that is crashing by running "coredumpctl gcc" on the coredump? [0] https://git.gnome.org/browse/mutter/tree/src/backends/meta-backend.c#n103
He meant "coredumpctl gdb".
Created attachment 330673 [details] [review] Silly patch backends/meta-backend.c :113 I'm not a C developer but tried this little patch. I think I made things much worse though.
(In reply to sveinelo from comment #8) > I think I made things much worse though. Yup :-) If you don't have at least two monitors, you'll always return an invalid monitor index. And if (1,1) is not located on any monitor, there could be issues as well.
*** Bug 770495 has been marked as a duplicate of this bug. ***
I see this as well, but a completely different configuration - desktop machine - i5 6500 with a GTX 750 Ti video card - single monitor
For me, this turned out as a timing problem. On an Nvidia optimus system, after X starts you have to issue xrandr --setprovideroutputsource modesetting NVIDIA-0 xrandr --auto With gdm, this was first done with the /etc/gdm/Init/Default script which was then disabled so one had to use an autostart .desktop file in /usr/share/gdm/greeter/autostart With gdm-3.20 autostarts are now executed much later in the session setup so there's no useable X-Screen when mutter starts so it crashes. For now I worked around this by hacking with gdm-x-session to start a script right after xserver startup. Like this: --- gdm-3.20.1/daemon/gdm-x-session.c.orig 2016-04-19 07:00:04.000000000 +0200 +++ gdm-3.20.1/daemon/gdm-x-session.c 2016-10-05 13:03:43.435007698 +0200 @@ -567,6 +567,12 @@ g_subprocess_launcher_setenv (launcher, "WINDOWPATH", vt, TRUE); } + subprocess = g_subprocess_launcher_spawn (launcher, + &error, + "/etc/X11/optimus.sh", + state->session_command, + NULL); + if (run_script) { subprocess = g_subprocess_launcher_spawn (launcher, &error,
I tested the patch on my archlinux installation and I can confirm that it works on my machine too. I finally have a login-screen and a lock-screen on my laptop again. So the issue should be fixed in gdm and not in mutter I assume, and then it's this whole thing of not running scripts as root, or have them at all during gdm startup? I picked that up somewhere when I tried to investigate this issue. Thanks for the patch Maik! I'll enjoy it (-:
(In reply to sveinelo from comment #13) > So the issue should be fixed in gdm and not in mutter I assume, and then > it's this whole thing of not running scripts as root, or have them at all > during gdm startup? Yes, the problem is with gdm and the lost feature that it's no longer possible to execute scripts *early* enough (i.e. before gnome-shell/mutter starts) to prepare the output offloading. So we should poke the gdm folks. Robert Munteanu's problem seems to be a different thing since he doesn't use prime offload (at least I think so). So there's just the same backtrace. Maybe also because mutter can't detect any outputs but for a different reason.
New bug opened with gdm https://bugzilla.gnome.org/show_bug.cgi?id=772470
(In reply to Maik Freudenberg from comment #14) > Robert Munteanu's problem seems to be a different thing since he doesn't use > prime offload (at least I think so). So there's just the same backtrace. > Maybe also because mutter can't detect any outputs but for a different > reason. Right, I don't use prime offload. I only have a NVidia video card and the integrated GPU on my i5-6500 is disabled from BIOS. If there's any debugging I can do to help distinguish the two bugs ( patches to gdm/mutter included ) please let me know.
Normally, your original bug report has to be unduped and reopened? While we're here you can as well post your Xorg logs and xorg.conf if you have any for starters. Maybe there are some oddities visible concerning outputs.
Digging into this a bit further revealed that the patch is unneeded. 3 years ago a new autostart-phase keyword 'DisplayServer' for gdm autostart was introduced but never documented. So all it needs is to add the line X-GNOME-Autostart-Phase=DisplayServer to the autostart .desktop file: [Desktop Entry] Type=Application Name=Optimus Exec=/etc/X11/optimus.sh NoDisplay=true X-GNOME-Autostart-Phase=DisplayServer Things like this just suck.
For the record, I moved away my whole /etc/xdg/autostart directory and it made no difference.
Created attachment 337659 [details] Various logs related to nvidia and Xorg Here's the debug information gathered by nvidia-bug-report.sh . It's pretty comprehensive, but in case you need any information, including coredumps, please let me know. I'm anxious to get this solved and would be happy to help in any way that I can.
Created attachment 337688 [details] dmesg and journal logs
X-GNOME-Autostart-Phase=DisplayServer - not helped. This line exist in this article - https://wiki.archlinux.org/index.php/NVIDIA_Optimus /usr/share/gdm/greeter/autostart/optimus.desktop [Desktop Entry] Type=Application Name=Optimus Exec=sh -c "xrandr --setprovideroutputsource modesetting NVIDIA-0; xrandr --auto" NoDisplay=true X-GNOME-Autostart-Phase=DisplayServer And after entering the login and password i have black screen I attached log files.
@dima-gr: the autostart .desktop file is only for preparing the xserver for the gdm login session. As you can enter your credentials this works. The user session runs its own xserver, this has to be prepared for offloading as well. This is (currently) best done by adding the xrandr commands to /etc/gdm/Xsession. Alternatively, you could copy the autostart .desktop file into (every) user's autostart directory. @Robert Munteanu: you're running the xserver currently autoconfigured. This loads a lot of unnecessary/unwanted drivers. Please use nvidia-xconfig according to http://us.download.nvidia.com/XFree86/Linux-x86/367.57/README/editxconfig.html to generate a xorg.conf file. Probably won't help but get some better logfiles.
Adding the xrandr commands to /etc/gdm/Xsession not helped, but maybe i am not there adding this commands. Copying the autostart .desktop file into user's autostart directory solved my problem. Thanks for your help Maik.
Don't know about Robert's final decision but to adress the mutter developers: people will come back reporting the same crap every release you do since whatever goes wrong in monitor-manager it will always crash at center_pointer. As it has been ever since. So the _real_ bug concerning mutter is to check for sensible return values and tell the user via log instead of doing stupid things with NULL. And if somebody offers a patch that does nothing but check return values, why does it take you half a year to commit it and only two weeks to do some 'refacturing' and nullify it? Aditionally, the two generic gnome bugs (shoul write a script to add it to every component's bug reports): 1. Lack of documentation 2. Lack of communication Sincerely, LastV8
Created attachment 338271 [details] Various logs related to nvidia and Xorg
(In reply to Maik Freudenberg from comment #23) > @Robert Munteanu: you're running the xserver currently autoconfigured. This > loads a lot of unnecessary/unwanted drivers. Please use nvidia-xconfig > according to > http://us.download.nvidia.com/XFree86/Linux-x86/367.57/README/editxconfig. > html > to generate a xorg.conf file. Probably won't help but get some better > logfiles. Updated a new nvidia-bugreport.log.gz file, maybe that helps someone. (In reply to Maik Freudenberg from comment #25) > So the _real_ bug concerning mutter is to check for sensible return values > and tell the user via log instead of doing stupid things with NULL. And if > somebody offers a patch that does nothing but check return values, why does > it take you half a year to commit it and only two weeks to do some > 'refacturing' and nullify it? Is there such a patch floating around that can add more error reporting? I'd actually try and apply it, if only to get better information for myself.
I'll think of something as this annoyes me to the blood for a long time. Might take a week or two.
Created attachment 339031 [details] [review] Debug patch for output detection For debug purposes only. Does not catch or fix anything but prints out some infos about output detection on xrandr path. Get infos as root: journalctl -b --no-pager |grep Mutter
The problem has nothing to do with nvidia, as it occurs on my small laptop without any nvidia graphic card.
What happens to the laptop is a total lost of session while using video readers (vlc and totem, probably gnome-mplayer and mplayer)
journalctl -b --no-pager |grep libmutter instead
meta-monitor-manager-private.h line 310 : (in the struct) "int primary_monitor_index" meta-monitor-manager.c lines 69-72 : static void meta_monitor_manager_init (MetaMonitorManager *manager) { } lines 1329-1332 : meta_monitor_manager_get_primary_index (MetaMonitorManager *manager) { return manager->primary_monitor_index; } It seems that primary_monitor_index is used but not set ? Or where is it set ? Is that true ? lines 278-280: if (info->is_primary) manager->primary_monitor_index = info->number; } This is the only place where primary_monitor_index is set. An "else" in case info->is_NOT_primary ? What about info->number ? lines 84... 100 in construct_tile_monitor info.number = monitor_infos->len; lines 174... 219 in make_logical_config info.number = monitor_infos->len; meta-monitor-manager-private.h line 210 : struct MetaMonitorInfo has no "len" (gth) name in structure after that I am lost !
(In reply to Ralph Mytown from comment #33) > It seems that primary_monitor_index is used but not set ? Or where is it set > ? > Is that true ? The struct is initialized to 0 on object contruction: https://git.gnome.org//browse/glib/tree/gobject/gtype.c#n1848 > lines 278-280: > if (info->is_primary) > manager->primary_monitor_index = info->number; > } > > This is the only place where primary_monitor_index is set. > An "else" in case info->is_NOT_primary ? No, otherwise the value would only be correct if the primary monitor index is the last one in the loop or the random value assigned in the else part (0?) accidentally matches the correct index. > What about info->number ? > > [...] > meta-monitor-manager-private.h line 210 : > > struct MetaMonitorInfo > has no "len" (gth) name in structure That's because monitor_info*s* is a GArray (of MetaMonitorInfo).
Note that in my particular scenario the problem only manifests itself when using the openSUSE packaged RPM drivers, but not when installing the driver manually via the .run file https://bugzilla.opensuse.org/show_bug.cgi?id=995924 So in my case the problem in probably not related to gdm/mutter but to RPM packaging and any information that I've contributed is probably not meaningful to the actual root cause.
monitor_infos->len; line 56 typedef struct _MetaMonitorInfo MetaMonitorInfo; line 210 struct _MetaMonitorInfo { int number; int xinerama_index; MetaRectangle rect; /* for tiled monitors these are calculated, from untiled just copied */ float refresh_rate; int width_mm; int height_mm; gboolean is_primary; gboolean is_presentation; /* XXX: not yet used */ gboolean in_fullscreen; int scale; ... glong winsys_id; guint32 tile_group_id; int monitor_winsys_xid; int n_outputs; MetaOutput *outputs[META_MAX_OUTPUTS_PER_MONITOR]; }; line 276: in struct _MetaMonitorManager ... (line 308): MetaMonitorInfo *monitor_infos; So I maintain what I said: the struct(ure) of MetaMonitorInfo has no parameter named "len"(gth) and it is probably why It bugs ! and I do not see any GArray but pointers on line 308 ! Is all "mutter" programmed like this ? Thank you
(In reply to Ralph Mytown from comment #36) > So I maintain what I said: > the struct(ure) of MetaMonitorInfo has no parameter named "len"(gth) and it > is probably why It bugs ! You can maintain what you like, but monitor_infos is declared here: https://git.gnome.org//browse/mutter/tree/src/backends/meta-monitor-manager.c#n177 It is not a MetaMonitorInfo, but a GArray, and that's where the len member comes from.
Sorry for not getting back to you earlier. Monitor configuration changed substantially in 3.26, any chance you can try with that version?
Rui, no, it is not fixed in 3.26. Please read my comment#25 about what I think is the real bug with this: https://bugzilla.gnome.org/show_bug.cgi?id=768204#c25 See this for 3.26 https://devtalk.nvidia.com/default/topic/1024318/linux/-solved-nvidia-prime-on-dual-gpu-configuration-giving-a-blank-screen/ This time, it was triggered by a subtle change in glib's(?) .desktop file parsing. In effect, the x-screen wasn't set up so again, libmutter crashed at center_pointer because the monitor was NULL. The hell, will you ever learn to check for NULL and put out a useful debug message like 'Sorry, no screen found, stopping here.' instead of just 'boah, crash'
(In reply to Florian Müllner from comment #9) > (In reply to sveinelo from comment #8) > > I think I made things much worse though. > > Yup :-) > > If you don't have at least two monitors, you'll always return an invalid > monitor index. And if (1,1) is not located on any monitor, there could be > issues as well. According to my coredump using mutter-3.24.4, it's crashing on line 133: 121 static void 122 center_pointer (MetaBackend *backend) 123 { 124 MetaBackendPrivate *priv = meta_backend_get_instance_private (backend); 125 MetaMonitorManager *monitor_manager = priv->monitor_manager; 126 MetaLogicalMonitor *primary; 127 128 primary = 129 meta_monitor_manager_get_primary_logical_monitor (monitor_manager); 130 131 meta_backend_warp_pointer (backend, 132 primary->rect.x + primary->rect.width / 2, 133 primary->rect.y + primary->rect.height / 2); 134 } which is because primary is NULL: Core was generated by `/usr/bin/gnome-shell'. Program terminated with signal SIGSEGV, Segmentation fault.
+ Trace 238085
$1 = (MetaLogicalMonitor *) 0x0 I am not using the NVIDIA card in this system, so I do not think that the bug is related to that. FWIW, I am on a Lenovo P50 with Intel graphics.
I should also mention that this occurs when waking up from DPMS. I know that it's when it is waking up rather than sleeping because of the times in journalctl when gnome-shell crashes: Oct 20 12:07:10 p50 kernel: gnome-shell[640]: segfault at 0 ip 00007f24a805e29e sp 00007fff2df2ffa0 error 4 in libmutter-0.so.0.0.0[7f24a7f7a000+13e000]
(In reply to Matt Turner from comment #40) > (In reply to Florian Müllner from comment #9) > > (In reply to sveinelo from comment #8) > > > I think I made things much worse though. > > > > Yup :-) > > > > If you don't have at least two monitors, you'll always return an invalid > > monitor index. And if (1,1) is not located on any monitor, there could be > > issues as well. > > According to my coredump using mutter-3.24.4, it's crashing on line 133: > > 121 static void > 122 center_pointer (MetaBackend *backend) > 123 { > 124 MetaBackendPrivate *priv = meta_backend_get_instance_private (backend); > 125 MetaMonitorManager *monitor_manager = priv->monitor_manager; > 126 MetaLogicalMonitor *primary; > 127 > 128 primary = > 129 meta_monitor_manager_get_primary_logical_monitor (monitor_manager); > 130 > 131 meta_backend_warp_pointer (backend, > 132 primary->rect.x + primary->rect.width / 2, > 133 primary->rect.y + primary->rect.height / 2); > 134 } This crash (handle being headless (without monitor)) has been fixed in 3.26.
Jonas, as i mentioned in comment#39 this is still happening with 3.26. No offense, but iirc you're a wayland guy and we're talking about the xrandr path. Could you please point me to the commit where the check has been implemented?
I'm also seeing a bug which meets this description, and I'm using the "binary blob" nvidia-340 drivers. I have added details to this Ubuntu bug (potential duplicate): https://bugs.launchpad.net/ubuntu/+source/gnome-shell/+bug/911591
I am on Ubuntu 17.10 using the following package versions: mutter 3.26.2-0ubuntu0.1 mutter-common 3.26.2-0ubuntu0.1 libmutter-1-0 3.26.2-0ubuntu0.1 gdm3 3.26.1-3ubuntu3 libgdm1 3.26.1-3ubuntu3 gnome-shell 3.26.1-0ubuntu5 gnome-shell-common 3.26.1-0ubuntu5 nvidia-340 340.104-0ubuntu2 libxrandr2 2:1.5.1-1 xorg 1:7.7+19ubuntu3 xserver-xorg 1:7.7+19ubuntu3 This is with a single HDMI display connected to the Nvidia card. After reading a bit of the conversation here, it looks like maybe the Nvidia + gnome-shell + mutter combination isn't to blame after all? As mentioned in the Launchpad bug (https://bugs.launchpad.net/ubuntu/+source/gnome-shell/+bug/911591), I see errors in dmesg that mention libmutter: [15040.472843] gnome-shell[32127]: segfault at 8 ip 00007f327f29d5d0 sp 00007ffeccf94a38 error 4 in libmutter-1.so.0.0.0[7f327f243000+142000] The crash usually appears related to OpenGL window resize events. I have also noted that xrandr display resolution changes can trigger the crash. Reproduction steps that work for me are to use xrandr to resize a couple times one immediately after another. Some OpenGL games & applications may be able to reproduce due to resizing behavior. Reproduction Command: # Use output and modes that are valid for you... xrandr --output HDMI-0 --mode 1920x1080 ; xrandr --output HDMI-0 --mode 1920x1080 ; xrandr --output HDMI-0 --mode 800x600 ; xrandr --output HDMI-0 --mode 1920x1080 I've tested this a couple times and it looks like I am able to trigger the crash just with fast xrandr commands.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/mutter/-/issues/ Thank you for your understanding and your help.