Bug 782688 – Crashes trying to set keyboard map when logging in

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 782688 - Crashes trying to set keyboard map when logging in


Summary:	Crashes trying to set keyboard map when logging in


Status:	RESOLVED OBSOLETE

Product:	mutter
Classification:	Core
Component:	wayland
Version:	3.28.x
Hardware:	Other Linux

Importance:	Normal blocker
Target Milestone:	---
Assigned To:	mutter-maint
QA Contact:	mutter-maint

URL:
Whiteboard:

Duplicates:	787422 792284 (view as bug list)
Depends on:
Blocks:

Reported:	2017-05-16 11:51 UTC by Maël Lavault
Modified:	2019-10-07 14:35 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Full backtrace (6.27 KB, text/plain) 2017-05-22 07:44 UTC, Maël Lavault		Details
libxkbcommon-warn-if-not-enoent.patch (595 bytes, patch) 2017-05-22 09:10 UTC, Jonas Ådahl	none	Details \| Review
libxkbcommon-list-open-files-on-error.patch (1.87 KB, patch) 2017-05-22 13:00 UTC, Jonas Ådahl	none	Details \| Review
Too many open files (171.82 KB, text/plain) 2017-05-30 08:29 UTC, Maël Lavault		Details
background: free WallClock explicitly when it is not needed (911 bytes, patch) 2017-06-11 02:15 UTC, Hyungwon Hwang	none	Details \| Review
[DEBUG] wallclock: Move the removal of GSource to dispose() (3.37 KB, patch) 2017-06-11 02:16 UTC, Hyungwon Hwang	rejected	Details \| Review
wallclock: Move the removal of GSource to dispose() (1.62 KB, patch) 2017-06-11 08:11 UTC, Hyungwon Hwang	rejected	Details \| Review

Description Maël Lavault 2017-05-16 11:51:11 UTC

I just got a gnome-shell crash, trying to get my 3 displays to work (one does not display anything anymore).

Unrecoverable failure in required component org.gnome.Shell.desktop

Process 1968 (gnome-shell) crashed in xkb_keymap_ref()

Process 1968 (gnome-shell) of user 1000 dumped core.

Stack trace of thread 1968:
#0  0x00007fdcee1e6b53 xkb_keymap_ref (libxkbcommon.so.0)
#1  0x00007fdceed8fe1a clutter_evdev_set_keyboard_map (libmutter-clutter-0.so)
#2  0x00007fdcefd7ff63 meta_backend_native_set_keymap (libmutter-0.so.0)
#3  0x00007fdce83dbbde ffi_call_unix64 (libffi.so.6)
#4  0x00007fdce83db54f ffi_call (libffi.so.6)
#5  0x00007fdcf2ad42ec n/a (libgjs.so.0)
#6  0x00007fdcf2ad5a96 n/a (libgjs.so.0)
#7  0x00007fdcf40c4f85 n/a (n/a)
#8  0x000055a83c5db7b0 n/a (n/a)
#9  0x00007fdcc1021a7d n/a (n/a)

Comment 1 Maël Lavault 2017-05-16 11:56:00 UTC

I got another trace a few moments later:

Process 3230 (gnome-shell) of user 1000 dumped core.

Stack trace of thread 3230:
#0  0x00007feadd108e81 _g_log_abort (libglib-2.0.so.0)
#1  0x00007feadd109ebc g_log_default_handler (libglib-2.0.so.0)
#2  0x0000562b573ca945 default_log_handler (gnome-shell)
#3  0x00007feadd10a14d g_logv (libglib-2.0.so.0)
#4  0x00007feadd10a2bf g_log (libglib-2.0.so.0)
#5  0x00007feae1b56a5e x_io_error (libmutter-0.so.0)
#6  0x00007feadbb52a5e _XIOError (libX11.so.6)
#7  0x00007feadbb50422 _XReadEvents (libX11.so.6)
#8  0x00007feadbb37b54 XIfEvent (libX11.so.6)
#9  0x00007feae1b1e7cb meta_display_get_current_time_roundtrip (libmutter-0.so.0)
#10 0x00007feae1b6a4f1 meta_wayland_surface_destroy_window (libmutter-0.so.0)
#11 0x00007feae1b6ad10 wl_surface_destructor (libmutter-0.so.0)
#12 0x00007feadca8ff80 destroy_resource (libwayland-server.so.0)
#13 0x00007feade756b89 wl_map_for_each (libwayland-client.so.0)
#14 0x00007feadca9006d wl_client_destroy (libwayland-server.so.0)
#15 0x00007feadca90128 wl_client_connection_data (libwayland-server.so.0)
#16 0x00007feadca91c52 wl_event_loop_dispatch (libwayland-server.so.0)
#17 0x00007feae1b56317 wayland_event_source_dispatch (libmutter-0.so.0)
#18 0x00007feadd103277 g_main_context_dispatch (libglib-2.0.so.0)
#19 0x00007feadd103618 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
#20 0x00007feadd103932 g_main_loop_run (libglib-2.0.so.0)
#21 0x00007feae1b28bbc meta_run (libmutter-0.so.0)
#22 0x0000562b573ca4a7 main (gnome-shell)
#23 0x00007feadb53f5fe __libc_start_main (libc.so.6)
#24 0x0000562b573ca5ba _start (gnome-shell)

Stack trace of thread 3234:
#0  0x00007feadb623ced poll (libc.so.6)
#1  0x00007feadd103599 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
#2  0x00007feadd1036ac g_main_context_iteration (libglib-2.0.so.0)
#3  0x00007feab6632f3d dconf_gdbus_worker_thread (libdconfsettings.so)
#4  0x00007feadd12a586 g_thread_proxy (libglib-2.0.so.0)
#5  0x00007feadb8f736d start_thread (libpthread.so.0)
#6  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3267:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3268:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3270:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3271:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3272:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3273:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 8274:
#0  0x00007feadb62a7a9 syscall (libc.so.6)
#1  0x00007feadd1487da g_cond_wait_until (libglib-2.0.so.0)
#2  0x00007feadd0d7b31 g_async_queue_pop_intern_unlocked (libglib-2.0.so.0)
#3  0x00007feadd12af24 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#4  0x00007feadd12a586 g_thread_proxy (libglib-2.0.so.0)
#5  0x00007feadb8f736d start_thread (libpthread.so.0)
#6  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3231:
#0  0x00007feadb623ced poll (libc.so.6)
#1  0x00007feadd103599 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
#2  0x00007feadd1036ac g_main_context_iteration (libglib-2.0.so.0)
#3  0x00007feadd1036f1 glib_worker_main (libglib-2.0.so.0)
#4  0x00007feadd12a586 g_thread_proxy (libglib-2.0.so.0)
#5  0x00007feadb8f736d start_thread (libpthread.so.0)
#6  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3269:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 8215:
#0  0x00007feadb62a7a9 syscall (libc.so.6)
#1  0x00007feadd1487da g_cond_wait_until (libglib-2.0.so.0)
#2  0x00007feadd0d7b31 g_async_queue_pop_intern_unlocked (libglib-2.0.so.0)
#3  0x00007feadd12af24 g_thread_pool_thread_proxy (libglib-2.0.so.0)
#4  0x00007feadd12a586 g_thread_proxy (libglib-2.0.so.0)
#5  0x00007feadb8f736d start_thread (libpthread.so.0)
#6  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3232:
#0  0x00007feadb623ced poll (libc.so.6)
#1  0x00007feadd103599 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
#2  0x00007feadd103932 g_main_loop_run (libglib-2.0.so.0)
#3  0x00007feadec28b16 gdbus_shared_thread_func (libgio-2.0.so.0)
#4  0x00007feadd12a586 g_thread_proxy (libglib-2.0.so.0)
#5  0x00007feadb8f736d start_thread (libpthread.so.0)
#6  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3266:
#0  0x00007feadb623ced poll (libc.so.6)
#1  0x00007feae5124b71 poll_func (libpulse.so.0)
#2  0x00007feae5116530 pa_mainloop_poll (libpulse.so.0)
#3  0x00007feae5116bc0 pa_mainloop_iterate (libpulse.so.0)
#4  0x00007feae5116c50 pa_mainloop_run (libpulse.so.0)
#5  0x00007feae5124ab9 thread (libpulse.so.0)
#6  0x00007feadb2f1078 internal_thread_func (libpulsecommon-10.0.so)
#7  0x00007feadb8f736d start_thread (libpthread.so.0)
#8  0x00007feadb62fe0f __clone (libc.so.6)

Stack trace of thread 3274:
#0  0x00007feadb8fd7db pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007fead5ff9580 PR_WaitCondVar (libnspr4.so)
#2  0x00007fead9cc80b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
#3  0x00007fead5ffeecb _pt_root (libnspr4.so)
#4  0x00007feadb8f736d start_thread (libpthread.so.0)
#5  0x00007feadb62fe0f __clone (libc.so.6)

Comment 2 Jonas Ådahl 2017-05-16 12:06:43 UTC

Any messages in the journal from around the time of the first crash? It looks like you tried to set a keymap that xkbcommon did not understand.

The second one looks like an Xwayland crash.

Comment 3 Maël Lavault 2017-05-16 12:12:18 UTC

I didn't do anything related to keymap.

I have this message which might be related to my apple keyboard:

apple 0005:05AC:0256.0009: unknown main item tag 0x0

Also some thunderbolt errors:

thunderbolt 0000:07:00.0: resetting error on 0:c.

Comment 4 Maël Lavault 2017-05-22 07:12:20 UTC

I get the first one (xkb_keymap_ref issue) a lot theses days, mostly on login, right after entering my password. My session try to load and then crashes and go back to the login screen. Usually I can login the second time without issues. I have the same stacktrace each time.

Comment 5 Jonas Ådahl 2017-05-22 07:27:54 UTC

Could you get a full backtrace next time? If coredumpctl is used, get it by:

coredumpctl gdb <pid-of-coredump-entry>
then type: "backtrace full"

Whats your input settings? I.e. what input method, keyboard layouts etc have you configureD?

Comment 6 Maël Lavault 2017-05-22 07:44:50 UTC

Created attachment 352343 [details]
Full backtrace

Here is the full backtrace, fresh from this morning :)

Comment 7 Maël Lavault 2017-05-22 07:46:15 UTC

My keyboard layout is Swedish, I have only one input source. For the input method I'm not sure, I didn't change anything so it should be the default one in Fedora 26.

Comment 8 Jonas Ådahl 2017-05-22 08:15:02 UTC

Can't reproduce this, even with identical input to what seems to be passed according to that backtrace.

Also, by looking at all the different code paths leading to the crash, you should have gotten something in the log prefixed with

"xkbcommon: ERROR: "

Did you check "journalctl --system" too?

FWIW, the crash can easily be avoided with a simple NULL check, and probably should, as the call might come from the outside (outside of libmutter), but still want to find out the cause to fix it properly.

Comment 9 Maël Lavault 2017-05-22 08:21:25 UTC

Yes you are right, here are the errors:

xkbcommon: ERROR: Couldn't look up rules 'evdev', model 'pc105+inet', layout 'se,us', variant ',', options ''
xkbcommon: ERROR: 	/home/mlavault/.xkb
xkbcommon: ERROR: 1 include paths could not be added:
xkbcommon: ERROR: 	/usr/share/X11/xkb
xkbcommon: ERROR: 1 include paths searched:
xkbcommon: ERROR: Couldn't find file "rules/evdev" in include paths

Comment 10 Jonas Ådahl 2017-05-22 08:47:34 UTC

I assume you used journalctl -r here? (it looks reversed, in the future, it might be a good idea to paste from journalctl -e instead, to get the in the correct order).

Anyway, does the /usr/share/X11/xkb/rules/evdev file exist? Whats the files permission?

Comment 11 Maël Lavault 2017-05-22 08:50:16 UTC

Yes sorry about that, journalctl is still pretty new to me.

Yes the file exists, here are the permissions:

-rw-r--r-- 1 root root 42941 12 mai   15:06 /usr/share/X11/xkb/rules/evdev

Comment 12 Jonas Ådahl 2017-05-22 09:10:00 UTC

Created attachment 352346 [details] [review]
libxkbcommon-warn-if-not-enoent.patch

Do you have the possibility to test the attached libxkbcommon patch? It'll print any error "fopen()" returns when trying to open the keymap as long as it's not ENOENT (i.e. File not found). As the file exist, and has the adequate permissions, it must be some other error.

Comment 13 Maël Lavault 2017-05-22 09:19:12 UTC

I can try, I can compile xkbcommon with the patch applied, but then how would I install it ?

Comment 14 Jonas Ådahl 2017-05-22 09:25:39 UTC

A risky method would be to override the system one by setting the prefix to /usr. You can also build an RPM with the patch and install that. Otherwise I can look into creating a copr repo with that patch that you can temporarily enable.

Comment 15 Maël Lavault 2017-05-22 09:29:26 UTC

If you could create a copr repo that would be fantastic ! I'm afraid to break something if I do it myself. Thanks a lot for your time and reactiveness on this issue, it is hugely appreciated :)

Comment 16 Jonas Ådahl 2017-05-22 10:01:42 UTC

Here it is: https://copr.fedorainfracloud.org/coprs/jadahl/gnomebug-782688/

You can either download the RPM manually and rpm -i it from here:
https://copr-be.cloud.fedoraproject.org/results/jadahl/gnomebug-782688/fedora-26-x86_64/00555099-libxkbcommon/

Or add the copr repo and AFAIK just update:
dnf config-manager --add-repo https://copr.fedorainfracloud.org/coprs/jadahl/gnomebug-782688/repo/fedora-26/jadahl-gnomebug-782688-fedora-26.repo

Using the second method, just don't forget to remove it later as I won't update it!

Comment 17 Jonas Ådahl 2017-05-22 10:02:28 UTC

You'll also have to restart gdm for this. Easiest way to do that is to reboot.

Comment 18 Maël Lavault 2017-05-22 10:10:23 UTC

Thanks ! So the real error is this one:

xkbcommon: ERROR: Couldn't open file "/usr/share/X11/xkb/rules/evdev": Too many open files

Which seems weird to me. I don't see why it would load such a huge ammount of files.

Comment 19 Jonas Ådahl 2017-05-22 10:16:07 UTC

That was my suspicion, as we have the similar issue in bug 782690 except no way to reproduce it. If you don't mind, I'll create a new libxkbcommon package that'll print the content of /proc/<pid/fd (open files) to the log when it fails to open rules/evdev due to too many open files.

Comment 20 Maël Lavault 2017-05-22 10:17:20 UTC

Of course, happy to help !

Comment 21 Jonas Ådahl 2017-05-22 11:02:32 UTC

A new build is ready with libxkbcommon-0.7.1-5.fc26. It should print lots of stuff (what it does is run "ls -l /proc/<pid>/fd") into the journal when it happens. Copy all of it and attach it as a text file to this bug.

Comment 22 Maël Lavault 2017-05-22 11:54:17 UTC

I tried the new build and cannot reproduce anymore. I don't see any messages and don't seems to have the crash. Did you add some safety checks ?

Comment 23 Jonas Ådahl 2017-05-22 13:00:27 UTC

Created attachment 352351 [details] [review]
libxkbcommon-list-open-files-on-error.patch

(In reply to Maël Lavault from comment #22)
> I tried the new build and cannot reproduce anymore. I don't see any messages
> and don't seems to have the crash. Did you add some safety checks ?

No more checks. The only difference is that, if fopen fails with "too many open files", "ls -l /proc/..." will be run and the output printed to stdout (i.e. the journal). It should crash the same way it did before.

Comment 24 Maël Lavault 2017-05-22 13:03:23 UTC

Ok, I'll monitor this in the coming day and keep you updated as soon as I get another crash

Comment 25 Maël Lavault 2017-05-26 11:40:52 UTC

For some reason I haven't been able to reproduce the crash lately...

Comment 26 Maël Lavault 2017-05-30 08:29:15 UTC

Created attachment 352854 [details]
Too many open files

Ok I finally managed to reproduce it. See attached file.

Comment 27 Maël Lavault 2017-05-30 08:29:55 UTC

I have application that use autostart. It might be related to atom starting up.

Comment 28 Jonas Ådahl 2017-05-31 01:55:11 UTC

(In reply to Maël Lavault from comment #26)
> Created attachment 352854 [details]
> Too many open files
> 
> Ok I finally managed to reproduce it. See attached file.

Ok, so there seems to be an abnormal amount of timerfds (847 open timerfd instances). Now we "just" need to who is the one opening all these timerfds.

Comment 29 Maël Lavault 2017-06-05 09:56:19 UTC

How can I find this information ?

Comment 30 Jonas Ådahl 2017-06-05 09:59:03 UTC

(In reply to Maël Lavault from comment #29)
> How can I find this information ?

This is a problem with gnome-shell or one of its dependencies. I haven't managed to find time to look further than that there are timerfds in libgnome-desktop (related to time and date), libinput and some place more IIRC.

Comment 31 Hyungwon Hwang 2017-06-11 02:15:53 UTC

Created attachment 353550 [details] [review]
background: free WallClock explicitly when it is not needed

Comment 32 Hyungwon Hwang 2017-06-11 02:16:59 UTC

Created attachment 353551 [details] [review]
[DEBUG] wallclock: Move the removal of GSource to dispose()

This patch includes some line for printing debug messages.

Comment 33 Hyungwon Hwang 2017-06-11 02:29:04 UTC

I am not sure that Maël experienced the same use-case of me. But it could be reproduced by changing the VTs (Ctrl+Alt+F5 - Ctrl+Alt+F1 (2 more timerfd created) ...)

This symptom has gone after applying these patches.

But there is one thing strange for me. Even after I explicitly called <run_dispose() in gnome-shell> to call <dispose() of wallclock in gnome-desktop>, <destroy() of wallclock in gnome-desktop> is not called. Isn't the Gobject's destroy() called automatically after dispose() is called?

You might check it easily with the debug messages I inserted.

Comment 34 Jonas Ådahl 2017-06-11 05:22:16 UTC

(In reply to Hyungwon Hwang from comment #33)
> I am not sure that Maël experienced the same use-case of me. But it could be
> reproduced by changing the VTs (Ctrl+Alt+F5 - Ctrl+Alt+F1 (2 more timerfd
> created) ...)
> 
> This symptom has gone after applying these patches.
> 
> But there is one thing strange for me. Even after I explicitly called
> <run_dispose() in gnome-shell> to call <dispose() of wallclock in
> gnome-desktop>, <destroy() of wallclock in gnome-desktop> is not called.
> Isn't the Gobject's destroy() called automatically after dispose() is called?
> 
> You might check it easily with the debug messages I inserted.

Both dispose and finalize is called when the GObject is destroyed. I assume this is done by the javascript garbage collector, when the JS object doesn't have any references left to it.

Are the finalize function of the wall clock never called at all for you? If so, it sounds like the Background JS object references are never properly unset thus never destroyed by the GC.

Comment 35 Hyungwon Hwang 2017-06-11 08:11:20 UTC

Yes. At that time, it wasn't called at all.

This time, I tried to call GC explicitly by running imports.system.gc() in looking glass. I could see that destroy() was called when GC run.

I guess that the problem at the first time happened when a lot of timerfd were created but GC didn't run, because timerfd uses not much memory.

In my opinion, freeing timerfd by calling run_dispose() is a safe way for avoiding this kind of situation. It would be good to get an advice from gnome-desktop developers.

Comment 36 Hyungwon Hwang 2017-06-11 08:11:55 UTC

Created attachment 353558 [details] [review]
wallclock: Move the removal of GSource to dispose()

Comment 37 Jonas Ådahl 2017-06-11 09:01:43 UTC

Yea, does indeed make sense. I can see another place where the wall clock has an explicit run_dispose(), so I guess it was originally intended as a counter measure for delayed GC.

However, it seems that in bug 780861, the content of dispose was moved to finalize as part of a crash fix, so need to make sure we don't introduce any crash that that bug fixed.

Comment 38 Bastien Nocera 2017-06-11 12:19:39 UTC

Review of attachment 353558 [details] [review]:

This would revert 1329895396bae1999a9a90d0b27fe260e4a0d693. See https://bugzilla.gnome.org/show_bug.cgi?id=780861

I don't think this makes any sense. If the problem is that GnomeWallClock uses too much resources internally, maybe you'd want to share instances of it, rather than create them and try to find a way to dispose of them.

This is a problem to be solved in JS, not in the C code...

Comment 39 Bastien Nocera 2017-06-11 12:20:37 UTC

Review of attachment 353551 [details] [review]:

Rejecting the debug patch.

Comment 40 Bastien Nocera 2017-06-11 12:26:25 UTC

(In reply to Bastien Nocera from comment #38)
> Review of attachment 353558 [details] [review] [review]:
> 
> This would revert 1329895396bae1999a9a90d0b27fe260e4a0d693. See
> https://bugzilla.gnome.org/show_bug.cgi?id=780861
> 
> I don't think this makes any sense. If the problem is that GnomeWallClock
> uses too much resources internally, maybe you'd want to share instances of
> it, rather than create them and try to find a way to dispose of them.
> 
> This is a problem to be solved in JS, not in the C code...

There's 4 instances of GnomeWallClock, only two different types. I don't think that's what's leaking the timerfds. Might be worth making those objects singletons if it's going to be a resource problem.

Comment 41 Jonas Ådahl 2017-06-11 13:25:35 UTC

Ignoring the other places in gnome-shell, there are 1 instance of GnomeWallClock per background, and one background per logical monitor, and they are destroyed and regenerated each time the monitor configuration changes (e.g. on VT switches). It doesn't explain why 847 timerfds leak though, unless we somehow GC extremely rarely.

Anayhow, using a single wall clock for all of gnome-shell would at least avoid any chance of the wall clock being the cause of the timerfd leak.

Comment 42 Hyungwon Hwang 2017-06-13 15:50:14 UTC

Yes. It seems not the right way to hit the root cause. Even though 847 timerfds were leaked from GnomeWallClock, I had to find why the code was called that much and why the GC didn't work well.

What about to gather more info about it?

Maël, do you still experience it these days? If so, could you tell me more about the situation when it happens?

Comment 43 Maël Lavault 2017-06-13 15:57:18 UTC

So it only seems to happen after boot, right when I try to login for the first time. I cannot reproduce it 100% of the time, it is doesn't happens a lot and when it does, gnome shell crashes and go back to login screen where i can login again. It usually works well the second time.

I suspect it might have something to do with displayport support, I get a log of bugs from it (crashes, flickering, screen that goes black, ...). It might crash silently and recreate display a lot, which would lead to timerfd leaks (this is just an hypothesis). I have 3 screens on 2 minidp port (with daisy chaining) and the internal screen of my macbook pro which is deactivated (but still shows a grey background somehow).

Comment 44 Jonas Ådahl 2017-06-14 06:36:19 UTC

We can check whether that is the issue. What you'd need to do is to run

udevadm monitor -s drm > udev-drm.log

from another VT, then try to log in from GDM and see if it reproduces. You should get log entries for every 'hotplug' event in the same way as mutter would see them.

Comment 45 Daniel Playfair Cal 2017-07-24 13:26:37 UTC

Hey, I just found this thread after trying to debug the same problem here: https://bugzilla.redhat.com/show_bug.cgi?id=1441490

If its useful, I can reproduce it consistently by running gnome-shell inside valgrind on wayland.

Also, I found that commenting out all references to this._clock in js/ui/background.js from gnome-desktop prevented the problem from hapenning.

Comment 46 Daniel Playfair Cal 2017-09-25 23:25:38 UTC

I think I found the underlying cause - see bug 788110.

The issue is that sometimes the "changed" event is emitted by Gio.Settings objects even though a change has not occurred. The logic in background.js assumes that this is not the case so sometimes when too many of these events occur it enters an infinite loop which happens to instantiate a WallClock along the way.

Comment 47 Christian Stadelmann 2017-11-29 22:22:07 UTC

I'm also seeing this bug, see https://bugzilla.redhat.com/show_bug.cgi?id=1507656 for details. This issue has been present for a few releases, at least since 3.22.

I get this crash sporadically on login, maybe every 1 or 2 out of 10 tries.

(In reply to Jonas Ådahl from comment #2)
> Any messages in the journal from around the time of the first crash? It
> looks like you tried to set a keymap that xkbcommon did not understand.

Usually, logging in works fine with exactly the same configuration. My active keyboard layout is:

$ localectl
   System Locale: LANG=de_DE.UTF-8
       VC Keymap: de-neo
      X11 Layout: de,de
     X11 Variant: neo,nodeadkeys

furthermore, there is another German (de nodeadkeys) and en-us layout. Still I get this:

+ Trace 238203

#2 meta_backend_native_set_keymap
at backends/native/meta-backend-native.c line 496


so it looks like something in the API makes 4 layouts out of 3. This looks wrong to me. Also, the fact that "keymap" is 0x0 here looks wrong too.

> The second one looks like an Xwayland crash.

I'm seeing the Xwayland crash too, see https://bugzilla.redhat.com/show_bug.cgi?id=1510078.

(In reply to Maël Lavault from comment #4)
> I get the first one (xkb_keymap_ref issue) a lot theses days, mostly on
> login, right after entering my password. My session try to load and then
> crashes and go back to the login screen. Usually I can login the second time
> without issues. I have the same stacktrace each time.

Same here.

(In reply to Maël Lavault from comment #43)
> So it only seems to happen after boot, right when I try to login for the
> first time. I cannot reproduce it 100% of the time, it is doesn't happens a
> lot and when it does, gnome shell crashes and go back to login screen where
> i can login again. It usually works well the second time.

Same here.

> I suspect it might have something to do with displayport support, I get a
> log of bugs from it (crashes, flickering, screen that goes black, ...). It
> might crash silently and recreate display a lot, which would lead to timerfd
> leaks (this is just an hypothesis). I have 3 screens on 2 minidp port (with
> daisy chaining) and the internal screen of my macbook pro which is
> deactivated (but still shows a grey background somehow).

I doubt this bug is related to your monitor setup. I have one single fullHD (1920x1080p) HDMI monitor connected to an Intel iGPU.

(In reply to Daniel Playfair Cal from comment #46)
> I think I found the underlying cause - see bug 788110.
> 
> The issue is that sometimes the "changed" event is emitted by Gio.Settings
> objects even though a change has not occurred. The logic in background.js
> assumes that this is not the case so sometimes when too many of these events
> occur it enters an infinite loop which happens to instantiate a WallClock
> along the way.

That makes sense. My log files also look like gnome-shell is looping extension disabling/enabling. Most notably, I'm getting thousands of messages like this, but with different signal ID and sometimes different instance pointer:
> gnome-shell[2656]: gsignal.c:2641: instance '0x55c51f11b0d0' has no handler with id

Comment 48 Daniel Playfair Cal 2017-11-29 22:40:09 UTC

Hmm,

I think there are two reasons that spurious changed events are emitted.

One is that when a setting is changed (in the same process, e.g. from JS), dconf does not check if the value is different before emitting a changed signal. There is a patch to address that behaviour here: https://bugzilla.gnome.org/show_bug.cgi?id=789639. Perhaps that will solve this problem?

The other is that when another process sets a setting (whether or not the value changes) and a watch request for a key is in progress, dconf emits a changed signal for all keys. There is a bug here: https://bugzilla.gnome.org/show_bug.cgi?id=790640 and I am experimenting with different ways to patch it. Its a slow process though, since I'm new to dconf/gsettings/DBus etc.

I don't recognise that gsignal message, but I often get lots of the same warning from some piece of C code that is part of whatever infinite loop I've ended up in.

Comment 49 Peter 2018-01-06 23:15:23 UTC

*** Bug 792284 has been marked as a duplicate of this bug. ***

Comment 50 Peter 2018-01-06 23:17:55 UTC

I'm sorry. I already had this bug opened some days ago and forgot about the tab. In addidtion to what I have written in #792284 I can add, that I'm using the interal display of my X220 (LVDS) and an external monitor (VGA).

Thanks

Comment 51 Daniel Playfair Cal 2018-01-06 23:43:57 UTC

Peter, could you please try the two patches here: https://bugzilla.gnome.org/show_bug.cgi?id=789639

Attachments 365361 and 365362 - I think they will fix it.

Other interesting info:
 - Are you using the BTRFS filesystem? specifically for "~/.config/dconf"?
 - If you run the shell in valgrind, does the crash occur consistently?

Comment 52 Peter 2018-01-06 23:56:55 UTC

Thanks! Sound interesting. I cannot tell you when I will have time to test, but I will report then. I'm using EXT4 with data=ordered and noatime.

Comment 53 Peter 2018-02-04 16:38:24 UTC

After this happend once again, I tried to provocate that issue with valgrind as Daniel mentioned:

XDG_SESSION_TYPE=wayland valgrind --leak-check=no --log-file=gnomevalgrind.txt gnome-shell

Of course slow, but didn't lead to the (expected) crash. Should I use another way to use Valgrind and GNOME-Shell?

Comment 54 Daniel Playfair Cal 2018-02-12 03:09:49 UTC

Did the shell successfully start and become interactive? I had to recompile mesa with the --enable-valgrind option, otherwise the valgrind log was filled with massive quantities of spurious warnings from the graphics stack. I also needed the patch from here
 (https://bugzilla.gnome.org/show_bug.cgi?id=790640, attachment 366936 [details] [review]) to prevent infinite loops in valgrind when starting the shell (depends on what extensions you have installed).

Otherwise maybe the slowness caused by valgrind prevents whatever race condition causes the issue :(

There's also newer versions of those patches I mentioned before which fixed this problem for me: https://bugzilla.gnome.org/show_bug.cgi?id=789639 (attachments 366937,366938). If you're experiencing this crash regularly perhaps you could try running with them and seeing if they still occur?

Comment 55 Debarshi Ray 2018-08-08 06:07:22 UTC

*** Bug 787422 has been marked as a duplicate of this bug. ***

Comment 56 Debarshi Ray 2018-08-08 06:09:43 UTC

Bumped the version to something slightly more recent to avoid coming across as an obsolete bug.

Comment 57 André Klapper 2018-08-30 20:11:57 UTC

Still happens to me on gnome-shell-3.28.3-1.fc28.x86_64
and mutter-3.28.3-3.fc28.x86_64. Same stacktrace as in comment 6.

This was also reported in https://gitlab.gnome.org/GNOME/gnome-shell/issues/118 which was moved to https://gitlab.gnome.org/GNOME/mutter/issues/76#note_290964 which points to https://gitlab.gnome.org/GNOME/dconf/merge_requests/1

+ Trace 238682

Thread 1 (Thread 0x7f00e7f84240 (LWP 4768))

#0 xkb_keymap_ref
at src/keymap.c line 59
#1 clutter_evdev_set_keyboard_map
at evdev/clutter-device-manager-evdev.c line 2397
#2 meta_backend_native_set_keymap
at backends/native/meta-backend-native.c line 427
#3 ffi_call_unix64
at ../src/x86/unix64.S line 76
#4 ffi_call
at ../src/x86/ffi64.c line 525
#5 gjs_invoke_c_function(JSContext*, Function*, JS::HandleObject, JS::HandleValueArray const&, mozilla::Maybe<JS::MutableHandle<JS::Value> >, GIArgument*)
at gi/function.cpp line 1088
#6 function_call(JSContext*, unsigned int, JS::Value*)
at /usr/include/c++/8/new line 169
#7 0x00003f9f49625810 in
#8 0x00007fffb053c418 in
#9 0x00007fffb053c3b0 in
#10 0x0000000000000000 in

Comment 58 André Klapper 2019-10-07 14:35:35 UTC

Please reopen if anyone can still reproduce this issue on dconf >= 0.29.1