GNOME Bugzilla – Bug 786186
gnome-shell hang when run in valgrind with taskbar extension
Last modified: 2017-09-26 01:47:11 UTC
When starting a session with gnome-shell running in valgrind and with gnome-shell-extension-taskbar enabled, the shell hangs. The same hang occurs when enabling this extension after starting the shell in valgrind with it disabled. This does not occur when running the shell outside of valgrind, but does make it difficult to identify bugs that occur more often when taskbar is enabled. When using vgdb, it is possible to pause execution during this hang. Running `call gjs_dumpstack()` sometimes prints nothing, and other times causes a segfault.
Created attachment 357456 [details] Valgrind log 1
Created attachment 357457 [details] GDB log 1 after waiting a few minutes, I hit Ctrl+C every few seconds and printed a backtrace and the output of `call gjs_dumpstack()`
Created attachment 357458 [details] Journal log 1
Created attachment 357459 [details] Valgrind log 2
Created attachment 357460 [details] GDB log 2
Created attachment 357461 [details] Journal log 2
Versions: - gjs 1.48.6 - js38 38.8.0 - gnome-shell-taskbar: 56 - gnome-shell: 3.24.3
Looks like an internal error in valgrind - gjs_dumpstack() may very well not work if there is no JS code currently running. Do you know for sure whether there is or not?
Hmm. Is there a way I can run the shell and make it break into a debugger whenever it starts executing javascript? Is it possible to debug like this https://wiki.gnome.org/Projects/GnomeShell/DebuggingJavaScript but inside the shell? (I would probably try setting a breakpoint in all the entry points to the taskbar source and then if it hit them see how far it gets) The C code did seem to be executing though - each time I paused execution and printed a C stacktrace it looked different.
Wow, I had no idea that wiki page existed. In any case, that only works with a specific patched GJS. You could try to build it with the same patches on your distro, but I think that would be a lot of work since the jsrdbg library hasn't been updated to work with anything beyond mozjs 24. I am trying to add a debugger to GJS which would make something like this possible out of the box. (But these crashes need to be solved first...) To break whenever JS starts executing, try breaking on JS_ExecuteScript(), JS::CloneAndExecuteScript(), JS::Evaluate(), JS_CallFunctionName(), and JS_CallFunctionValue(). Those are all the entry points GJS uses, I think. I'm not sure what to do with this bug. Under what circumstances would you consider it fixed?
Hmm Ok, thanks. On further investigation I'm pretty sure it executes some JS, since I can see JS warnings in the logs including from taskbar. After the usual things appear in the logs, I see a apparently never ending stream of entries like this (10-20/minute) (a different address each time): JS object wrapper for GObject 0x1da8fcd0 (GSettings) is being released while toggle references are still pending. This may happen on exit in Gio.Application.vfunc_dbus_unregister(). If you encounter it another situation, please report a GJS bug. Cause for concern? I guess I'd consider it fixed if I'm able to run gnome-shell in valgrind successfully with taskbar enabled. I think this happenned to me the other day without valgrind running (after adding debug prints in the taskbar code) - I will confirm this, its possible I was confused.
I tried adding global.log() statements at the beginning of each function in the taskbar code, and found that when gnome-shell is running in valgrind disable/enable is called repeatedly in what seems to be an infinite loop. The only place enable/disable its called in the taskbar code is in onParamChanged, and I know it does not get called more than twice from there during login because I put debug prints at the start and end of that function also. So something else outside the taskbar extension is calling the taskbar extension's disable-enable-disable-enable...etc in an infinite loop. When valgrind is not enabled, it also does this (4 times just now) but usually still logs in after that. Perhaps its some kind of race condition that is triggered when valgrind makes everything slower than usual?
Note that gjs_dumpstack() will also not work if a garbage collection is in progress - this is a limitation of the JS engine.
Hmm I wonder if bug 782688 is a symptom of the same problem... When I get some time I will try setting a breakpoint in JS_CallFunctionName()/JS_CallFunctionValue() if the function being called is disable() from taskbar. Hopefully I can identify why it is being called so many times.
Created attachment 359381 [details] [review] stack: Use js::DumpBacktrace() for gjs_dumpstack() It would be nice to be able to print a JS stack even if the JS engine is in the middle of a garbage collection. I have tested and js::DumpBacktrace() can do this. The output format is slightly less nice but that's OK because gjs_dumpstack() is only for debugging.
This patch may help with gjs_dumpstack(), apparently it is possible to print a stack trace during GC, just not the way we were doing it before.
Cool, thanks. Looks like that patch is made on the 1.49 branch? I tried to backport it to 1.48 but I ran into: gjs/stack.cpp: In function ‘void gjs_context_print_stack_stderr(GjsContext*)’: gjs/stack.cpp:92:9: error: ‘DumpBacktrace’ is not a member of ‘js’ js::DumpBacktrace(cx, stderr); ^~~~~~~~~~~~~ Which I'm guessing is because its a new feature added between mozjs38 and mozjs52. Looks like gnome 3.26 is coming out soon so I might wait until the packages appear in the arch gnome-unstable repository before I break my system too deeply. In the meantime, is there a simple way I can make the shell crash synchronously from javascript? If I can patch the taskbar code so that it causes a crash say the 4th time disable() is called, I would be able to catch the segfault in GDB, call gjs_dumpstack() see which JS is calling it. Potentially then I could comment that call out so that I can run valgrind to collect a log for bug 785657.
(In reply to Daniel Playfair Cal from comment #17) > Cool, thanks. > > Looks like that patch is made on the 1.49 branch? I tried to backport it to > 1.48 but I ran into: > > gjs/stack.cpp: In function ‘void > gjs_context_print_stack_stderr(GjsContext*)’: > gjs/stack.cpp:92:9: error: ‘DumpBacktrace’ is not a member of ‘js’ > js::DumpBacktrace(cx, stderr); > ^~~~~~~~~~~~~ > > Which I'm guessing is because its a new feature added between mozjs38 and > mozjs52. Looks like gnome 3.26 is coming out soon so I might wait until the > packages appear in the arch gnome-unstable repository before I break my > system too deeply. Looks like you could replace "js::DumpBacktrace(cx, stderr)" with "js_DumpBacktrace(cx)" (note the underscore) in mozjs38. It would print to stdout instead of stderr but hopefully that wouldn't make too much of a difference. > In the meantime, is there a simple way I can make the shell crash > synchronously from javascript? If I can patch the taskbar code so that it > causes a crash say the 4th time disable() is called, I would be able to > catch the segfault in GDB, call gjs_dumpstack() see which JS is calling it. > Potentially then I could comment that call out so that I can run valgrind to > collect a log for bug 785657. You might be able to take advantage of bug 730287 and manufacture a crash with "Gtk.init([])" or "Clutter.init([])". If that doesn't work, I can write a patch that adds a System.crash() method so you can compile it in to your copy of GJS and use that.
Oh, I completely forgot about System.breakpoint(), that will drop you into GDB!
Review of attachment 359381 [details] [review]: Nice!
Comment on attachment 359381 [details] [review] stack: Use js::DumpBacktrace() for gjs_dumpstack() Attachment 359381 [details] pushed as e0ae1d5 - stack: Use js::DumpBacktrace() for gjs_dumpstack() I'm going to release this change in 1.50.1. I'll leave the bug open as NEEDINFO until you get a chance to try out the other suggestions, but if applying this patch without "stderr" works for you then I'll backport it to another 1.48 release for GNOME 3.24.
Thanks! I did get it to compile, but when the crash occurred, although it didn't hang, it also didn't seem to print anything to the gdb console or the systemd log. Hopefully I've just gotten confused about where the streams are going. I will try on gnome 3.26/js52 when I get a chance in case that is an issue. Thanks also for the system.breakpoint tip, I will try to make taskbar call that when it's disabled and hopefully work out why that is happening repeatedly so I can avoid this hang and get a Valgrind log for the crash.
Progress! I tried using system.breakpoint()... Turns out that for some reason, _onVersionValidationChanged() is being called repeatedly, and it disables and re-enables all shell extensions. I don't understand why, but I will try commenting out this line global.settings.connect('changed::' + EXTENSION_DISABLE_VERSION_CHECK_KEY, _onVersionValidationChanged); since I imagine as long as I don't change that setting it isn't necessary, and it looks like that's the only place _onVersionValidationChanged() is called from. https://github.com/GNOME/gnome-shell/commit/1af40b1345441c925f444ca19fe7914893e12863
Very odd. I'd suggest reporting a separate bug for gnome-shell for that. The documentation for "changed" [1] says "emitted if a key has _potentially_ changed" (emphasis mine) so it's probably not correct to reload the extensions unconditionally like that. [1] https://developer.gnome.org/gio/stable/GSettings.html#GSettings-changed
Aha, I think that was the root cause of the hang. There were very many places in gnome-shell and in extensions that used connect() to attach a handler to a Gio.Settings object, and when running in valgrind many of those handlers seem to get called very often when the settings have not in fact changed. In some cases this leads to an infinite loop of sorts, as was also the case on bug 782688. I've written patches to wrap Gio.Settings objects for the purpose of listening to changes in settings, and always check whether the setting has actually changed before running any handlers. This solved the hang in valgrind (And I suspect also the hang which I occasionally experience without using valgrind). This seems to be a general problem - I wonder if there is a good reason why the changed event needs to be emitted when the relevant settings have not changed? It seems like wrapping every handler for all changed events is much more confusing and error prone than just making Gio.Settings do it's own checking. With any luck, I can get that valgrind log now... unless these changes also avoid the situation where the crash occurs.
see bug 788110 for a patch. I had to do something similar for TaskBar (which listens to many settings "changed" events) and also to freon to make the shell start. I'm confused as to how the Gio object is handled in GJS. I can see a 'var Gio;' but I don't see anything defined on it? Is it feasible to patch GJS such that Gio.Settings does not emit "changed" events unless the settings have really changed?
Great work. I look forward to comments on the gnome-shell patch. I don't think the right answer is to patch GJS, since that would mean that Gio.Settings would have different (unexpected) behaviour in different languages. Better would be to patch gnome-shell and other users of Gio.Settings as you have done, if the GIO behaviour is really as intended; or else fix GIO itself so that signals are emitted only when necessary. (I have to say it's suspicious that the spurious signals seem to be emitted only when running under Valgrind, but that's a question for the GLib/GIO mainatainers...) Most of the Gio.* API is defined automatically from the GIO library in C. The Gio object that you see in modules/overrides/Gio.js is just extra stuff defined on top of that. Shall we close this bug then?
Sure, lets close this bug, I agree that this is either a problem with gnome-shell and extensions or with GIO. Thanks for your help :) btw, the spurious signals do occur even without running in valgrind (before patching there were a few callbacks in TaskBar that always occurred a few times on login, whereas after the patch they never occur). I think valgrind just makes it much more likely, to the extent that re-entrancy and infinite loops become common.
*** This bug has been marked as a duplicate of bug 788110 ***