Bug 786186 – gnome-shell hang when run in valgrind with taskbar extension

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 786186 - gnome-shell hang when run in valgrind with taskbar extension


Summary:	gnome-shell hang when run in valgrind with taskbar extension


Status:	RESOLVED DUPLICATE of bug 788110

Product:	gjs
Classification:	Bindings
Component:	general
Version:	1.48.x
Hardware:	Other Linux

Importance:	Normal minor
Target Milestone:	---
Assigned To:	gjs-maint
QA Contact:	gjs-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-08-12 03:38 UTC by Daniel Playfair Cal
Modified:	2017-09-26 01:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Valgrind log 1 (12.19 KB, text/plain) 2017-08-12 03:39 UTC, Daniel Playfair Cal		Details
GDB log 1 (40.13 KB, text/plain) 2017-08-12 03:41 UTC, Daniel Playfair Cal		Details
Journal log 1 (39.67 KB, text/x-vhdl) 2017-08-12 03:47 UTC, Daniel Playfair Cal		Details
Valgrind log 2 (11.98 KB, text/plain) 2017-08-12 03:48 UTC, Daniel Playfair Cal		Details
GDB log 2 (26.51 KB, text/plain) 2017-08-12 03:48 UTC, Daniel Playfair Cal		Details
Journal log 2 (38.89 KB, text/x-vhdl) 2017-08-12 03:49 UTC, Daniel Playfair Cal		Details
stack: Use js::DumpBacktrace() for gjs_dumpstack() (2.70 KB, patch) 2017-09-08 04:13 UTC, Philip Chimento	committed	Details \| Review

Description Daniel Playfair Cal 2017-08-12 03:38:39 UTC

When starting a session with gnome-shell running in valgrind and with gnome-shell-extension-taskbar enabled, the shell hangs. The same hang occurs when enabling this extension after starting the shell in valgrind with it disabled.

This does not occur when running the shell outside of valgrind, but does make it difficult to identify bugs that occur more often when taskbar is enabled.

When using vgdb, it is possible to pause execution during this hang. Running `call gjs_dumpstack()` sometimes prints nothing, and other times causes a segfault.

Comment 1 Daniel Playfair Cal 2017-08-12 03:39:55 UTC

Created attachment 357456 [details]
Valgrind log 1

Comment 2 Daniel Playfair Cal 2017-08-12 03:41:43 UTC

Created attachment 357457 [details]
GDB log 1

after waiting a few minutes, I hit Ctrl+C every few seconds and printed a backtrace and the output of `call gjs_dumpstack()`

Comment 3 Daniel Playfair Cal 2017-08-12 03:47:34 UTC

Created attachment 357458 [details]
Journal log 1

Comment 4 Daniel Playfair Cal 2017-08-12 03:48:07 UTC

Created attachment 357459 [details]
Valgrind log 2

Comment 5 Daniel Playfair Cal 2017-08-12 03:48:35 UTC

Created attachment 357460 [details]
GDB log 2

Comment 6 Daniel Playfair Cal 2017-08-12 03:49:00 UTC

Created attachment 357461 [details]
Journal log 2

Comment 7 Daniel Playfair Cal 2017-08-12 03:53:02 UTC

Versions:
 - gjs 1.48.6
 - js38 38.8.0
 - gnome-shell-taskbar: 56
 - gnome-shell: 3.24.3

Comment 8 Philip Chimento 2017-08-18 22:02:00 UTC

Looks like an internal error in valgrind - gjs_dumpstack() may very well not work if there is no JS code currently running. Do you know for sure whether there is or not?

Comment 9 Daniel Playfair Cal 2017-08-19 04:15:55 UTC

Hmm. Is there a way I can run the shell and make it break into a debugger whenever it starts executing javascript? Is it possible to debug like this https://wiki.gnome.org/Projects/GnomeShell/DebuggingJavaScript but inside the shell? (I would probably try setting a breakpoint in all the entry points to the taskbar source and then if it hit them see how far it gets)

The C code did seem to be executing though - each time I paused execution and printed a C stacktrace it looked different.

Comment 10 Philip Chimento 2017-08-25 04:02:21 UTC

Wow, I had no idea that wiki page existed. In any case, that only works with a specific patched GJS. You could try to build it with the same patches on your distro, but I think that would be a lot of work since the jsrdbg library hasn't been updated to work with anything beyond mozjs 24.

I am trying to add a debugger to GJS which would make something like this possible out of the box. (But these crashes need to be solved first...)

To break whenever JS starts executing, try breaking on JS_ExecuteScript(), JS::CloneAndExecuteScript(), JS::Evaluate(), JS_CallFunctionName(), and JS_CallFunctionValue(). Those are all the entry points GJS uses, I think.

I'm not sure what to do with this bug. Under what circumstances would you consider it fixed?

Comment 11 Daniel Playfair Cal 2017-08-26 05:17:45 UTC

Hmm Ok, thanks.

On further investigation I'm pretty sure it executes some JS, since I can see JS warnings in the logs including from taskbar. After the usual things appear in the logs, I see a apparently never ending stream of entries like this (10-20/minute) (a different address each time):

JS object wrapper for GObject 0x1da8fcd0 (GSettings) is being released while toggle references are still pending. This may happen on exit in Gio.Application.vfunc_dbus_unregister(). If you encounter it another situation, please report a GJS bug.

Cause for concern?

I guess I'd consider it fixed if I'm able to run gnome-shell in valgrind successfully with taskbar enabled.

I think this happenned to me the other day without valgrind running (after adding debug prints in the taskbar code) - I will confirm this, its possible I was confused.

Comment 12 Daniel Playfair Cal 2017-08-27 01:02:37 UTC

I tried adding global.log() statements at the beginning of each function in the taskbar code, and found that when gnome-shell is running in valgrind disable/enable is called repeatedly in what seems to be an infinite loop.

The only place enable/disable its called in the taskbar code is in onParamChanged, and I know it does not get called more than twice from there during login because I put debug prints at the start and end of that function also.

So something else outside the taskbar extension is calling the taskbar extension's disable-enable-disable-enable...etc in an infinite loop.

When valgrind is not enabled, it also does this (4 times just now) but usually still logs in after that.

Perhaps its some kind of race condition that is triggered when valgrind makes everything slower than usual?

Comment 13 Philip Chimento 2017-08-28 04:06:49 UTC

Note that gjs_dumpstack() will also not work if a garbage collection is in progress - this is a limitation of the JS engine.

Comment 14 Daniel Playfair Cal 2017-09-05 22:53:58 UTC

Hmm I wonder if bug 782688 is a symptom of the same problem...

When I get some time I will try setting a breakpoint in JS_CallFunctionName()/JS_CallFunctionValue() if the function being called is disable() from taskbar. Hopefully I can identify why it is being called so many times.

Comment 15 Philip Chimento 2017-09-08 04:13:46 UTC

Created attachment 359381 [details] [review]
stack: Use js::DumpBacktrace() for gjs_dumpstack()

It would be nice to be able to print a JS stack even if the JS engine is
in the middle of a garbage collection. I have tested and
js::DumpBacktrace() can do this. The output format is slightly less nice
but that's OK because gjs_dumpstack() is only for debugging.

Comment 16 Philip Chimento 2017-09-08 04:14:56 UTC

This patch may help with gjs_dumpstack(), apparently it is possible to print a stack trace during GC, just not the way we were doing it before.

Comment 17 Daniel Playfair Cal 2017-09-09 05:38:14 UTC

Cool, thanks.

Looks like that patch is made on the 1.49 branch? I tried to backport it to 1.48 but I ran into:

gjs/stack.cpp: In function ‘void gjs_context_print_stack_stderr(GjsContext*)’:
gjs/stack.cpp:92:9: error: ‘DumpBacktrace’ is not a member of ‘js’
     js::DumpBacktrace(cx, stderr);
         ^~~~~~~~~~~~~

Which I'm guessing is because its a new feature added between mozjs38 and mozjs52. Looks like gnome 3.26 is coming out soon so I might wait until the packages appear in the arch gnome-unstable repository before I break my system too deeply.

In the meantime, is there a simple way I can make the shell crash synchronously from javascript? If I can patch the taskbar code so that it causes a crash say the 4th time disable() is called, I would be able to catch the segfault in GDB, call gjs_dumpstack() see which JS is calling it. Potentially then I could comment that call out so that I can run valgrind to collect a log for bug 785657.

Comment 18 Philip Chimento 2017-09-10 00:47:14 UTC

(In reply to Daniel Playfair Cal from comment #17)
> Cool, thanks.
> 
> Looks like that patch is made on the 1.49 branch? I tried to backport it to
> 1.48 but I ran into:
> 
> gjs/stack.cpp: In function ‘void
> gjs_context_print_stack_stderr(GjsContext*)’:
> gjs/stack.cpp:92:9: error: ‘DumpBacktrace’ is not a member of ‘js’
>      js::DumpBacktrace(cx, stderr);
>          ^~~~~~~~~~~~~
> 
> Which I'm guessing is because its a new feature added between mozjs38 and
> mozjs52. Looks like gnome 3.26 is coming out soon so I might wait until the
> packages appear in the arch gnome-unstable repository before I break my
> system too deeply.

Looks like you could replace "js::DumpBacktrace(cx, stderr)" with "js_DumpBacktrace(cx)" (note the underscore) in mozjs38. It would print to stdout instead of stderr but hopefully that wouldn't make too much of a difference.

> In the meantime, is there a simple way I can make the shell crash
> synchronously from javascript? If I can patch the taskbar code so that it
> causes a crash say the 4th time disable() is called, I would be able to
> catch the segfault in GDB, call gjs_dumpstack() see which JS is calling it.
> Potentially then I could comment that call out so that I can run valgrind to
> collect a log for bug 785657.

You might be able to take advantage of bug 730287 and manufacture a crash with "Gtk.init([])" or "Clutter.init([])". If that doesn't work, I can write a patch that adds a System.crash() method so you can compile it in to your copy of GJS and use that.

Comment 19 Philip Chimento 2017-09-12 17:22:00 UTC

Oh, I completely forgot about System.breakpoint(), that will drop you into GDB!

Comment 20 Cosimo Cecchi 2017-09-19 00:22:26 UTC

Review of attachment 359381 [details] [review]:

Nice!

Comment 21 Philip Chimento 2017-09-19 01:43:57 UTC

Comment on attachment 359381 [details] [review]
stack: Use js::DumpBacktrace() for gjs_dumpstack()

Attachment 359381 [details] pushed as e0ae1d5 - stack: Use js::DumpBacktrace() for gjs_dumpstack()

I'm going to release this change in 1.50.1. I'll leave the bug open as
NEEDINFO until you get a chance to try out the other suggestions, but if
applying this patch without "stderr" works for you then I'll backport it
to another 1.48 release for GNOME 3.24.

Comment 22 Daniel Playfair Cal 2017-09-19 02:50:29 UTC

Thanks!

I did get it to compile, but when the crash occurred, although it didn't hang, it also didn't seem to print anything to the gdb console or the systemd log. Hopefully I've just gotten confused about where the streams are going. I will try on gnome 3.26/js52 when I get a chance in case that is an issue.

Thanks also for the system.breakpoint tip, I will try to make taskbar call that when it's disabled and hopefully work out why that is happening repeatedly so I can avoid this hang and get a Valgrind log for the crash.

Comment 23 Daniel Playfair Cal 2017-09-19 12:57:35 UTC

Progress! I tried using system.breakpoint()...

Turns out that for some reason, _onVersionValidationChanged() is being called repeatedly, and it disables and re-enables all shell extensions. I don't understand why, but I will try commenting out this line

global.settings.connect('changed::' + EXTENSION_DISABLE_VERSION_CHECK_KEY, _onVersionValidationChanged);

since I imagine as long as I don't change that setting it isn't necessary, and it looks like that's the only place _onVersionValidationChanged() is called from.

https://github.com/GNOME/gnome-shell/commit/1af40b1345441c925f444ca19fe7914893e12863

Comment 24 Philip Chimento 2017-09-22 06:54:30 UTC

Very odd. I'd suggest reporting a separate bug for gnome-shell for that. The documentation for "changed" [1] says "emitted if a key has _potentially_ changed" (emphasis mine) so it's probably not correct to reload the extensions unconditionally like that.

[1] https://developer.gnome.org/gio/stable/GSettings.html#GSettings-changed

Comment 25 Daniel Playfair Cal 2017-09-25 12:43:49 UTC

Aha, I think that was the root cause of the hang. There were very many places in gnome-shell and in extensions that used connect() to attach a handler to a Gio.Settings object, and when running in valgrind many of those handlers seem to get called very often when the settings have not in fact changed. In some cases this leads to an infinite loop of sorts, as was also the case on bug 782688.

I've written patches to wrap Gio.Settings objects for the purpose of listening to changes in settings, and always check whether the setting has actually changed before running any handlers. This solved the hang in valgrind (And I suspect also the hang which I occasionally experience without using valgrind).

This seems to be a general problem - I wonder if there is a good reason why the changed event needs to be emitted when the relevant settings have not changed? It seems like wrapping every handler for all changed events is much more confusing and error prone than just making Gio.Settings do it's own checking.

With any luck, I can get that valgrind log now... unless these changes also avoid the situation where the crash occurs.

Comment 26 Daniel Playfair Cal 2017-09-25 23:33:07 UTC

see bug 788110 for a patch. I had to do something similar for TaskBar (which listens to many settings "changed" events) and also to freon to make the shell start.

I'm confused as to how the Gio object is handled in GJS. I can see a 'var Gio;' but I don't see anything defined on it? Is it feasible to patch GJS such that Gio.Settings does not emit "changed" events unless the settings have really changed?

Comment 27 Philip Chimento 2017-09-26 01:07:13 UTC

Great work. I look forward to comments on the gnome-shell patch.

I don't think the right answer is to patch GJS, since that would mean that Gio.Settings would have different (unexpected) behaviour in different languages. Better would be to patch gnome-shell and other users of Gio.Settings as you have done, if the GIO behaviour is really as intended; or else fix GIO itself so that signals are emitted only when necessary.

(I have to say it's suspicious that the spurious signals seem to be emitted only when running under Valgrind, but that's a question for the GLib/GIO mainatainers...)

Most of the Gio.* API is defined automatically from the GIO library in C. The Gio object that you see in modules/overrides/Gio.js is just extra stuff defined on top of that.

Shall we close this bug then?

Comment 28 Daniel Playfair Cal 2017-09-26 01:44:47 UTC

Sure, lets close this bug, I agree that this is either a problem with gnome-shell and extensions or with GIO. Thanks for your help :)

btw, the spurious signals do occur even without running in valgrind (before patching there were a few callbacks in TaskBar that always occurred a few times on login, whereas after the patch they never occur). I think valgrind just makes it much more likely, to the extent that re-entrancy and infinite loops become common.

Comment 29 Daniel Playfair Cal 2017-09-26 01:47:11 UTC


*** This bug has been marked as a duplicate of bug 788110 ***