After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 783935 - GJS crash in needsPostBarrier, possible access from wrong thread
GJS crash in needsPostBarrier, possible access from wrong thread
Status: RESOLVED FIXED
Product: gjs
Classification: Bindings
Component: general
1.48.x
Other Linux
: Normal critical
: ---
Assigned To: gjs-maint
gjs-maint
: 783951 784873 785232 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2017-06-18 19:59 UTC by Philip Chimento
Modified: 2017-08-31 05:21 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
screen log of gdb stacktrace (13.02 KB, text/plain)
2017-06-20 16:47 UTC, fedor
  Details
closure: Prevent use-after-free in closures (2.59 KB, patch)
2017-06-21 02:56 UTC, Philip Chimento
committed Details | Review
Valgrind log (no taskbar, no crash) (18.66 KB, text/plain)
2017-07-24 12:13 UTC, Daniel Playfair Cal
  Details
Valgrind log (no taskbar, other crash) (23.42 KB, text/plain)
2017-07-24 12:29 UTC, Daniel Playfair Cal
  Details
Valgrind log (with taskbar, no crash) (21.20 KB, text/plain)
2017-07-24 12:35 UTC, Daniel Playfair Cal
  Details
Val;grind log (no taskbar, other crash in similar situation) (13.68 KB, text/plain)
2017-07-24 13:15 UTC, Daniel Playfair Cal
  Details
object: Keep proper track of pending closure invalidations (7.84 KB, patch)
2017-07-27 00:32 UTC, Philip Chimento
committed Details | Review

Description Philip Chimento 2017-06-18 19:59:52 UTC
From bug 781799:

> Arch Linux
> gjs 1.48.4-1
> gnome-shell 3.24.2-1
> wayland 1.13.0-1
> js38 38.8.0-3
>
>        Message: Process 852 (gnome-shell) of user 1000 dumped core.
>                 
>                 Stack trace of thread 852:
>                 #0  0x00007f4f051f5735 _ZN2js9GCMethodsIP8JSObjectE16needsPostBarrierES2_ (libgjs.so.0)
>                 #1  0x00007f4f0330f8b5 g_main_context_dispatch (libglib-2.0.so.0)
>                 #2  0x00007f4f0330fc78 n/a (libglib-2.0.so.0)
>                 #3  0x00007f4f0330ff92 g_main_loop_run (libglib-2.0.so.0)
>                 #4  0x00007f4f04ad2fdc meta_run (libmutter-0.so.0)
>                 #5  0x0000000000401ff7 main (gnome-shell)
>                 #6  0x00007f4f02d2243a __libc_start_main (libc.so.6)
>                 #7  0x000000000040212a n/a (gnome-shell)
>
> I got this crash approximately 5-10 minutes after start of gnome session while surfing the web in google-chrome.

Info that would be helpful:

- Backtrace with full debug symbols (source files and lines)
- Is there anything specific that triggers the crash? Any specific browser use?
Comment 1 Christian Stadelmann 2017-06-19 11:09:50 UTC
I don't think I've been running into that crash at all. Thus, I don't have any clue of how to reproduce it.
Comment 2 fedor 2017-06-20 16:47:57 UTC
Created attachment 354111 [details]
screen log of gdb stacktrace
Comment 3 Philip Chimento 2017-06-21 02:56:04 UTC
Created attachment 354133 [details] [review]
closure: Prevent use-after-free in closures

Closures trace the function object that they call on, in order to keep
the function alive during garbage collection. When the closure is
invalidated, we break that link so the function can be garbage collected,
but we must do so in an idle function, since it is illegal to stop
tracing a GC-thing in the middle of GC.

However, this caused a possible use-after-free if the closure was
scheduled to stop tracing the function object, but the last reference on
the closure was dropped before the idle function could be run.

Similar to the recent fix in gi/object.cpp [commit 2593d3d], this avoids
use-after-free by cancelling any pending idle function in the finalize
notifier, and dropping the function object immediately.
Comment 4 Philip Chimento 2017-06-21 02:57:01 UTC
Thanks, that backtrace was a good pointer to where the problem might be. Try this patch?
Comment 5 Cosimo Cecchi 2017-06-21 03:55:05 UTC
Review of attachment 354133 [details] [review]:

Patch looks correct to me.
Comment 6 fedor 2017-06-21 17:33:26 UTC
Review of attachment 354133 [details] [review]:

Applied the patch to 2593d3d commit. Running gnome-shell for several hours without a crash. So I guess it is fixed now.
Comment 7 Philip Chimento 2017-06-21 18:03:27 UTC
Attachment 354133 [details] pushed as 41b78ae - closure: Prevent use-after-free in closures

Thanks! I will release 1.48.5 with this fix later today.
Comment 8 vitalik_p 2017-06-23 16:20:13 UTC
I think bug still here.

I update gjs to 1.48.5 version.

[ 3482.192097] gnome-shell[829]: segfault at 7fd4392fffe8 ip 00007fd4ae910c65 sp 00007ffe1fb573b0 error 4 in libgjs.so.0.0.0[7fd4ae8e8000+bb000]
[ 4917.142960] gnome-shell[3182]: segfault at 7f1d4d5fffe8 ip 00007f1daa689c65 sp 00007ffdebdfa2a0 error 4 in libgjs.so.0.0.0[7f1daa661000+bb000]


00007fd4ae910c65-7fd4ae8e8000=28c65
00007f1daa689c65-7f1daa661000=28c65


$ addr2line -e /usr/lib64/libgjs.so.0.0.0 0x28C65 -fCi
gjs_typecheck_boxed
/usr/include/mozjs-38/js/RootingAPI.h:663
Comment 9 vitalik_p 2017-06-23 20:04:25 UTC


  • #0 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 663
  • #1 JS::Heap<JSObject*>::set(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 296
  • #2 JS::Heap<JSObject*>::operator=(JSObject* const&)
    at /usr/include/mozjs-38/js/RootingAPI.h line 266
  • #3 GjsMaybeOwned<JSObject*>::reset()
    at gjs/jsapi-util-root.h line 267
  • #4 closure_clear_idle(void*)
    at gi/closure.cpp line 133
  • #5 g_main_context_dispatch
  • #6 g_main_context_iterate.isra
  • #7 g_main_loop_run
  • #8 meta_run
  • #9 main

Comment 10 Philip Chimento 2017-06-23 20:18:34 UTC
Might be. Would it be possible to get a valgrind log like the ones on bug 783951?
Comment 11 vitalik_p 2017-06-23 20:55:29 UTC
it's hard, all hangs when i run gnome-shell with valgrind.

This bug reproduced when i run Inkscape(0.92.1). 
Maybe this will help.
Comment 12 Daniel Playfair Cal 2017-06-25 03:30:40 UTC
This bug still happens to me on 1.48.5-1 on Arch Linux

Stack trace of thread 1125:
#0  0x00007f7af7574625 n/a (libgjs.so.0)
#1  0x00007f7af569f8b5 g_main_context_dispatch (libglib-2.0.so.0)
#2  0x00007f7af569fc78 n/a (libglib-2.0.so.0)
#3  0x00007f7af569ff92 g_main_loop_run (libglib-2.0.so.0)
#4  0x00007f7af6e5208c meta_run (libmutter-0.so.0)
#5  0x0000000000401ff7 main (gnome-shell)
#6  0x00007f7af50b243a __libc_start_main (libc.so.6)
#7  0x000000000040212a n/a (gnome-shell)
Stack trace of thread 1126:
#0  0x00007f7af51752bd poll (libc.so.6)
#1  0x00007f7af569fbf9 n/a (libglib-2.0.so.0)
#2  0x00007f7af569fd0c g_main_context_iteration (libglib-2.0.so.0)
#3  0x00007f7af569fd51 n/a (libglib-2.0.so.0)
#4  0x00007f7af56c6ac5 n/a (libglib-2.0.so.0)
#5  0x00007f7af543e297 start_thread (libpthread.so.0)
#6  0x00007f7af517f25f __clone (libc.so.6)
Stack trace of thread 1217:
#0  0x00007f7af544439d pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007f7ae7eb7d10 PR_WaitCondVar (libnspr4.so)
#2  0x00007f7af04a2811 n/a (libmozjs-38.so)
#3  0x00007f7ae7ebd88b n/a (libnspr4.so)
#4  0x00007f7af543e297 start_thread (libpthread.so.0)
#5  0x00007f7af517f25f __clone (libc.so.6)
Stack trace of thread 1222:
#0  0x00007f7af544439d pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007f7ae7eb7d10 PR_WaitCondVar (libnspr4.so)
#2  0x00007f7af04a2811 n/a (libmozjs-38.so)
#3  0x00007f7ae7ebd88b n/a (libnspr4.so)
#4  0x00007f7af543e297 start_thread (libpthread.so.0)
#5  0x00007f7af517f25f __clone (libc.so.6)
Stack trace of thread 1219:
#0  0x00007f7af544439d pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007f7ae7eb7d10 PR_WaitCondVar (libnspr4.so)
#2  0x00007f7af04a2811 n/a (libmozjs-38.so)
#3  0x00007f7ae7ebd88b n/a (libnspr4.so)
#4  0x00007f7af543e297 start_thread (libpthread.so.0)
#5  0x00007f7af517f25f __clone (libc.so.6)

I've compiled gjs with debug flags (by adding --enable-debug to the ./configure line in https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/gjs#n29) And I'll post a better stacktrace when I get it.

I use gnome-shell-extension-taskbar and I'm fairly sure that for me this crash only occurs when it is enabled.
Comment 13 Philip Chimento 2017-06-25 03:55:40 UTC
(In reply to Daniel Playfair Cal from comment #12)
> I've compiled gjs with debug flags (by adding --enable-debug to the
> ./configure line in
> https://git.archlinux.org/svntogit/packages.git/tree/trunk/
> PKGBUILD?h=packages/gjs#n29) And I'll post a better stacktrace when I get it.

That will not do anything for GJS; use CXXFLAGS='-g -O0'. Please compile mozjs38 with --enable-debug if you can, though.
 
> I use gnome-shell-extension-taskbar and I'm fairly sure that for me this
> crash only occurs when it is enabled.

Thanks, both of you. I'll reopen this bug for now.
Comment 14 Philip Chimento 2017-07-06 02:40:37 UTC
Can anyone manage to get a Valgrind log for the crashes that are still happening?
Comment 15 Daniel Playfair Cal 2017-07-06 05:47:32 UTC
I got confused with setting the debug flags - worked out that I had to set the variable while running configure as opposed to make. Anyway here's a stack trace with gjs compiled in debug mode:

#0  0x00007f32594708cc _ZN2js9GCMethodsIP8JSObjectE16needsPostBarrierES2_ (libgjs.so.0)
#1  0x00007f3259470fbe _ZN2JS4HeapIP8JSObjectE3setES2_ (libgjs.so.0)
#2  0x00007f3259470a00 _ZN2JS4HeapIP8JSObjectEaSERKS2_ (libgjs.so.0)
#3  0x00007f3259470ab3 _ZN13GjsMaybeOwnedIP8JSObjectE5resetEv (libgjs.so.0)
#4  0x00007f3259470293 n/a (libgjs.so.0)
#5  0x00007f32575698b5 g_main_context_dispatch (libglib-2.0.so.0)
#6  0x00007f3257569c78 n/a (libglib-2.0.so.0)
#7  0x00007f3257569f92 g_main_loop_run (libglib-2.0.so.0)
#8  0x00007f3258d1c08c meta_run (libmutter-0.so.0)
#9  0x0000000000401ff7 main (gnome-shell)
#10 0x00007f3256f7c43a __libc_start_main (libc.so.6)
#11 0x000000000040212a n/a (gnome-shell)

I will try to get valgrind working on the weekend. Is this the guide I should be following? https://wiki.gnome.org/Projects/GnomeShell/Debugging#Debugging_GNOME_Shell_with_valgrind

I managed to get it running with valgrind before but the crash didn't happen in an entire day (otherwise rare) and I noticed that some things were different about the way the shell was running (e.g. keyboard shortcuts did not work). Maybe I will try some hack like replacing the gnome-shell binary with something thta runs valgrind so I can be sure its running in the same way as gdm starts it.

I got other crashes which prevented boot when I compiled mozjs with debug flags, is that important for this bug?
Comment 16 Philip Chimento 2017-07-13 00:12:41 UTC
(In reply to Daniel Playfair Cal from comment #15)
> I got confused with setting the debug flags - worked out that I had to set
> the variable while running configure as opposed to make. Anyway here's a
> stack trace with gjs compiled in debug mode:
> 
> #0  0x00007f32594708cc _ZN2js9GCMethodsIP8JSObjectE16needsPostBarrierES2_
> (libgjs.so.0)
> #1  0x00007f3259470fbe _ZN2JS4HeapIP8JSObjectE3setES2_ (libgjs.so.0)
> #2  0x00007f3259470a00 _ZN2JS4HeapIP8JSObjectEaSERKS2_ (libgjs.so.0)
> #3  0x00007f3259470ab3 _ZN13GjsMaybeOwnedIP8JSObjectE5resetEv (libgjs.so.0)
> #4  0x00007f3259470293 n/a (libgjs.so.0)
> #5  0x00007f32575698b5 g_main_context_dispatch (libglib-2.0.so.0)
> #6  0x00007f3257569c78 n/a (libglib-2.0.so.0)
> #7  0x00007f3257569f92 g_main_loop_run (libglib-2.0.so.0)
> #8  0x00007f3258d1c08c meta_run (libmutter-0.so.0)
> #9  0x0000000000401ff7 main (gnome-shell)
> #10 0x00007f3256f7c43a __libc_start_main (libc.so.6)
> #11 0x000000000040212a n/a (gnome-shell)

Sorry for the delayed response. I think something is still going wrong with your debuginfo - that stack trace still has mangled C++ symbols.

> I will try to get valgrind working on the weekend. Is this the guide I
> should be following?
> https://wiki.gnome.org/Projects/GnomeShell/
> Debugging#Debugging_GNOME_Shell_with_valgrind

Yes.

> I managed to get it running with valgrind before but the crash didn't happen
> in an entire day (otherwise rare) and I noticed that some things were
> different about the way the shell was running (e.g. keyboard shortcuts did
> not work). Maybe I will try some hack like replacing the gnome-shell binary
> with something thta runs valgrind so I can be sure its running in the same
> way as gdm starts it.

Not sure what to suggest here, maybe one of the gnome-shell hackers can help? Try asking in #gnome-shell on IRC.
 
> I got other crashes which prevented boot when I compiled mozjs with debug
> flags, is that important for this bug?

Can you provide an example?
Comment 17 Philip Chimento 2017-07-13 00:13:15 UTC
*** Bug 784873 has been marked as a duplicate of this bug. ***
Comment 18 Daniel Playfair Cal 2017-07-13 02:36:22 UTC
Hmm ok. That trace is just from systemd logs of coredump - maybe if I get the stracktrace from gdb it will map back to the source? it does at least look different from the one compiled without -g -O0. Maybe the problem is the line with sed: https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/gjs

Thanks, I will ask around and keep experimenting. The biggest problem is that I can't predictably reproduce it so every experiment takes days. I've been running with gdb attached for 3 days now and no crashes...
Comment 19 Florent Thiéry 2017-07-17 13:44:35 UTC
Hi,

Same here (Arch too), rebuilt gjs and js38 (showing diff on PKGBUILD for other potential users)

$ coredumpctl gdb 27368
(gdb) bt
  • #0 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 663
  • #1 JS::Heap<JSObject*>::set(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 296
  • #2 JS::Heap<JSObject*>::operator=(JSObject* const&)
    at /usr/include/mozjs-38/js/RootingAPI.h line 266
  • #3 GjsMaybeOwned<JSObject*>::reset()
    at ./gjs/jsapi-util-root.h line 267
  • #4 closure_clear_idle(void*)
    at gi/closure.cpp line 133
  • #5 g_main_context_dispatch
  • #6 0x00007fbc79abdc88 in
  • #7 g_main_loop_run
  • #8 meta_run
  • #9 main

The crash just seems to happen randomly (i.e. not necessarily when clicking on a gnome-related function), like a race condition or a leak. Some extensions (e.g. system-monitor@paradoxxx.zero.gmail.com) will trigger the crash (or a similar one) faster.

I was only using these extensions when the crash happened:

$ gsettings get org.gnome.shell enabled-extensions (most of them from gnome-shell-extensions 

['hibernate-status@dromi', 'alternate-tab@gnome-shell-extensions.gcampax.github.com', 'drive-menu@gnome-shell-extensions.gcampax.github.com', 'topIcons@adel.gadllah@gmail.com', 'redshift@tommie-lie.de', 'launch-new-instance@gnome-shell-extensions.gcampax.github.com']


packages/gjs/trunk$ svn diff PKGBUILD 
Index: PKGBUILD
===================================================================
--- PKGBUILD	(révision 300688)
+++ PKGBUILD	(copie de travail)
@@ -26,7 +26,8 @@
 
 build() {
   cd $pkgname
-  ./configure --prefix=/usr --disable-static --libexecdir=/usr/lib
+  export CXXFLAGS='-g -O0'
+  ./configure --prefix=/usr --disable-static --libexecdir=/usr/lib --enable-debug-symbols=-gdwarf-2
   sed -i -e 's/ -shared / -Wl,-O1,--as-needed\0/g' libtool
   make
 }

$ makepkg && pacman -U gjs-1.48.5-1-x86_64.pkg.tar.xz

packages/js38/trunk$ svn diff PKGBUILD 
Index: PKGBUILD
===================================================================
--- PKGBUILD	(révision 300681)
+++ PKGBUILD	(copie de travail)
@@ -10,7 +10,7 @@
 license=(MPL)
 depends=(nspr gcc-libs readline zlib icu libffi)
 makedepends=(python2 libffi zip)
-options=(!staticlibs)
+options=(!staticlibs debug)
 source=(https://ftp.mozilla.org/pub/firefox/releases/${pkgver}esr/source/firefox-${pkgver}esr.source.tar.bz2
         mozjs38-fix-tracelogger.patch
         mozjs38-shell-version.patch

$ makepkg && pacman -U js38-debug-38.8.0-3-x86_64.pkg.tar.xz
Comment 20 Florent Thiéry 2017-07-17 14:11:20 UTC
By enabling system-monitor the crash seems a bit different:

(gdb) bt
  • #0 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 663
  • #1 JS::Heap<JSObject*>::set(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 296
  • #2 JS::Heap<JSObject*>::operator=(JSObject* const&)
    at /usr/include/mozjs-38/js/RootingAPI.h line 266
  • #3 GjsMaybeOwned<JSObject*>::reset()
    at ./gjs/jsapi-util-root.h line 267
  • #4 release_native_object(ObjectInstance*)
    at gi/object.cpp line 1258
  • #5 object_instance_finalize(JSFreeOp*, JSObject*)
    at gi/object.cpp line 1663
  • #6 0x00007f71f5f94c73 in
  • #7 0x00007f71f5fefa3c in
  • #8 0x00007f71f5f95f79 in
  • #9 0x00007f71f5fab733 in
  • #10 0x00007f71f5fac172 in
  • #11 0x00007f71f5fadf18 in
  • #12 0x00007f71f5fae8c0 in
  • #13 0x00007f71f5faeb0d in
  • #14 0x00007f71f5faeed4 in
  • #15 gjs_schedule_gc_if_needed(JSContext*)
    at gjs/jsapi-util.cpp line 844
  • #16 gjs_call_function_value(JSContext*, JS::HandleObject, JS::HandleValue, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gjs/jsapi-util.cpp line 719
  • #17 gjs_closure_invoke(GClosure*, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gi/closure.cpp line 239
  • #18 closure_marshal(GClosure*, GValue*, guint, GValue const*, gpointer, gpointer)
    at gi/value.cpp line 273
  • #19 g_closure_invoke
  • #20 0x00007f71fb15d4ae in
  • #21 g_signal_emit_valist
  • #22 g_signal_emit
  • #23 0x00007f71fd253b16 in
  • #24 clutter_actor_get_preferred_width
  • #25 clutter_actor_get_preferred_size
  • #26 clutter_actor_allocate_preferred_size
  • #27 0x00007f71fb9ba6aa in
  • #28 clutter_actor_set_allocation
  • #29 0x00007f71fb9e585b in
  • #30 clutter_actor_allocate
  • #31 0x00007f71fb9e30c2 in
  • #32 0x00007f71fb9e3180 in
  • #33 0x00007f71fb9ced89 in
  • #34 g_main_context_dispatch
  • #35 0x00007f71fae72c88 in
  • #36 g_main_loop_run
  • #37 meta_run
  • #38 main

Comment 21 Daniel Playfair Cal 2017-07-18 03:03:15 UTC
Thanks for the PKGBUILD patch Florent - I'll try exporting CXXFLAGS instead of setting it just for ./configure - I guess make needs it too

I don't know which way around the causation is, but I've noticed that just after this crash I usually (always?) see in the journal (same timestamp, immediately after coredump + stacktrace

"kernel: [drm:intel_fbc_work_fn [i915]] *ERROR* vblank not available for FBC on pipe A"

I'm running 4.11 on Kaby Lake on a laptop with nvidia optimus graphics (but using the integrated graphics only). The crash has always occurred when I have an external screen plugged in.
Comment 22 Florent Thiéry 2017-07-18 12:56:47 UTC
I am also using dual screen on an intel GPU, but i also had the crash at home with a single screen (also intel GPU). I don't have optimus graphics.
Comment 23 Daniel Playfair Cal 2017-07-20 03:22:23 UTC
New trace from today with better symbols:

  • #0 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
  • #1 JS::Heap<JSObject*>::set(JSObject*)
  • #2 JS::Heap<JSObject*>::operator=(JSObject* const&)
  • #3 GjsMaybeOwned<JSObject*>::reset()
  • #4 0x00007f8a4d74c1db in
  • #5 g_main_context_dispatch
  • #6 0x00007f8a4b840c88 in
  • #7 g_main_loop_run
  • #8 meta_run
  • #9 main

Setting up valgrind is my next step (I don't see source files there but I'm guessing a bit of grepping would do it?)
Comment 24 Christian Stadelmann 2017-07-20 17:09:06 UTC
(In reply to Daniel Playfair Cal from comment #23)
> Setting up valgrind is my next step (I don't see source files there but I'm guessing a bit of grepping would do it?)

You could attach gdb to valgrind using valgrinds's "--vgdb=full --vgdb-error=1" command line options (in case you don't know these already).
Comment 25 Daniel Playfair Cal 2017-07-20 22:48:58 UTC
Thanks, will have a look but I'm not sure how that would help

I've tried running gnome-shell directly with valgrind, like so

`G_SLICE=always-malloc G_DEBUG=gc-friendly valgrind --log-file=gnome-shell.valgrind1 gnome-shell --mode=user --wayland`

Unfortunately, this doesn't end up starting the shell (at least not within a few minutes, and it usually takes ~5 seconds. The mouse appears but otherwise the console is still visible and I'm unable to do anything except restart. Obviously this makes it impossible to trigger the bug

Here is a valgrind log produced roughly like that: https://gist.github.com/hedgepigdaniel/9046e32be88966f8ec5fe08adcd83256

`gnome-shell --mode=user --wayland` works fine (although the status bar is not hidpi scaled like it is if gdm starts it and keyboard shortcuts don't work, and so far I haven't been able to reproduce the crash that way). Does anybody know how I can find out what command line gdm or any other display manager would use so that I can start it in the same way with valgrind?

I also tried using the commands from here: https://wiki.gnome.org/Projects/GnomeShell/Debugging#Debugging_GNOME_Shell_with_valgrind

This didn't work because gnome-shell complained that it did not recognise the options -g or --debug-command. Do I need to compile gnome-shell in a special way for this to work?

Thanks Christian - If I understand correctly those options would allow be to use valgrind and GDB at the same time, so at the time of the crash or breakpoints I could inspect values of variables and such? It seems that I still need to get Valgrind to work first.
Comment 26 Christian Stadelmann 2017-07-21 07:23:22 UTC
(In reply to Daniel Playfair Cal from comment #25)
> Here is a valgrind log produced roughly like that:
> https://gist.github.com/hedgepigdaniel/9046e32be88966f8ec5fe08adcd83256

Seems like you are missing some more debuginfo files, e.g. for mesa, mutter-clutter, i965, ….

> `gnome-shell --mode=user --wayland` works fine (although the status bar is
> not hidpi scaled like it is if gdm starts it and keyboard shortcuts don't
> work, and so far I haven't been able to reproduce the crash that way). Does
> anybody know how I can find out what command line gdm or any other display
> manager would use so that I can start it in the same way with valgrind?

In case you are using Fedora: There is a Fedora package named gnome-session-valgrind [1]. It does not work though, you will need to patch it the way I suggested in [2] and [3].

If you're not using Fedora, you can get the sources under [4] and copy those files to the folders where your display manager can find them. Or just have a look at how it is done there. In my case (Fedora, again) this is:
/usr/bin/gnome-valgrind-errors
/usr/bin/gnome-valgrind-errors-postprocess
/usr/bin/gnome-valgrind-leaks
/usr/bin/gnome-valgrind-leaks-postprocess
/usr/share/xsessions/gnome-valgrind-errors.desktop
/usr/share/xsessions/gnome-valgrind-leaks.desktop

Note: I've never got the postprocess step to work.

[1] https://apps.fedoraproject.org/packages/gnome-valgrind-session
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1376444
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1376440
[4] http://hp.cl.no/proj/gnome-valgrind-session/src/

You might also be interested in enabling sysrq's (at least "R" for resetting keyboard input) as Ctrl+Alt+Fn might not work any more if valgrind made gnome-shell stop and wait for you to debug it.


> This didn't work because gnome-shell complained that it did not recognise
> the options -g or --debug-command. Do I need to compile gnome-shell in a
> special way for this to work?

No.

> Thanks Christian - If I understand correctly those options would allow be to
> use valgrind and GDB at the same time, so at the time of the crash or
> breakpoints I could inspect values of variables and such? It seems that I
> still need to get Valgrind to work first.

Yes. You'll need to start valgrind (see above for details how to do that). With the "--vgdb=full --vgdb-error=0" option valgrind will stop very early and tell you to attach gdb, which you might not be able to see because it logs to syslog. Change to another tty and run

$ gdb gnome-shell
(gdb) target remote | vgdb

and you can use gdb as usually. If you continue execution in gdb, it will continue until it found a memory violation, in your case probably a bunch of "Invalid write of size 4" at the beginning.

You might need a valgrind suppression file, although I don't know where to get one for i965. You might want to use the suppression file for Gtk+/GLib, which can be found at https://git.gnome.org/browse/glib/tree/glib.supp.
Comment 27 Mikhail 2017-07-21 15:22:17 UTC
$ coredumpctl gdb
           PID: 2109 (gnome-shell)
           UID: 1000 (mikhail)
           GID: 1000 (mikhail)
        Signal: 11 (SEGV)
     Timestamp: Fri 2017-07-21 18:38:57 +05 (1h 41min ago)
  Command Line: /usr/bin/gnome-shell
    Executable: /usr/bin/gnome-shell
 Control Group: /user.slice/user-1000.slice/session-2.scope
          Unit: session-2.scope
         Slice: user-1000.slice
       Session: 2
     Owner UID: 1000 (mikhail)
       Boot ID: 1066677081e8414dab3faf8dceb6759e
    Machine ID: 75b6ee13430d4cc7923b3637b296deec
      Hostname: localhost.localdomain
       Storage: /var/lib/systemd/coredump/core.gnome-shell.1000.1066677081e8414dab3faf8dceb6759e.2109.1500644337000000.lz4
       Message: Process 2109 (gnome-shell) of user 1000 dumped core.
                
                Stack trace of thread 2109:
                #0  0x00007f8079b4524d n/a (libgjs.so.0)
                #1  0x00007f8071e9df67 g_idle_dispatch (libglib-2.0.so.0)
                #2  0x00007f8071ea1587 g_main_context_dispatch (libglib-2.0.so.0)
                #3  0x00007f8071ea1928 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
                #4  0x00007f8071ea1c42 g_main_loop_run (libglib-2.0.so.0)
                #5  0x00007f8076b1c2ec meta_run (libmutter-0.so.0)
                #6  0x000000f308304407 main (gnome-shell)
                #7  0x00007f80702c600a __libc_start_main (libc.so.6)
                #8  0x000000f30830451a _start (gnome-shell)
                
                Stack trace of thread 2118:
                #0  0x00007f80703b2d1b __poll (libc.so.6)
                #1  0x00007f8071ea18a9 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
                #2  0x00007f8071ea19bc g_main_context_iteration (libglib-2.0.so.0)
                #3  0x00007f8071ea1a01 glib_worker_main (libglib-2.0.so.0)
                #4  0x00007f8071ec9086 g_thread_proxy (libglib-2.0.so.0)
                #5  0x00007f80706933a9 start_thread (libpthread.so.0)
                #6  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2119:
                #0  0x00007f80703b2d1b __poll (libc.so.6)
                #1  0x00007f8071ea18a9 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
                #2  0x00007f8071ea1c42 g_main_loop_run (libglib-2.0.so.0)
                #3  0x00007f80739f8c86 gdbus_shared_thread_func (libgio-2.0.so.0)
                #4  0x00007f8071ec9086 g_thread_proxy (libglib-2.0.so.0)
                #5  0x00007f80706933a9 start_thread (libpthread.so.0)
                #6  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2189:
                #0  0x00007f80703b2d1b __poll (libc.so.6)
                #1  0x00007f807a3b0b71 poll_func (libpulse.so.0)
                #2  0x00007f807a3a2530 pa_mainloop_poll (libpulse.so.0)
                #3  0x00007f807a3a2bc0 pa_mainloop_iterate (libpulse.so.0)
                #4  0x00007f807a3a2c50 pa_mainloop_run (libpulse.so.0)
                #5  0x00007f807a3b0ab9 thread (libpulse.so.0)
                #6  0x00007f8070077078 internal_thread_func (libpulsecommon-10.0.so)
                #7  0x00007f80706933a9 start_thread (libpthread.so.0)
                #8  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2123:
                #0  0x00007f80703b2d1b __poll (libc.so.6)
                #1  0x00007f8071ea18a9 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
                #2  0x00007f8071ea19bc g_main_context_iteration (libglib-2.0.so.0)
                #3  0x00007f8059bc2fed dconf_gdbus_worker_thread (libdconfsettings.so)
                #4  0x00007f8071ec9086 g_thread_proxy (libglib-2.0.so.0)
                #5  0x00007f80706933a9 start_thread (libpthread.so.0)
                #6  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2190:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2191:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2193:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2195:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2196:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2198:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2199:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2200:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2194:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2192:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2197:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)
                
                Stack trace of thread 2201:
                #0  0x00007f80706998eb pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f806af90540 PR_WaitCondVar (libnspr4.so)
                #2  0x00007f806ea4e0b1 _ZN2js12HelperThread10threadLoopEv (libmozjs-38.so)
                #3  0x00007f806af960bb _pt_root (libnspr4.so)
                #4  0x00007f80706933a9 start_thread (libpthread.so.0)
                #5  0x00007f80703bf32f __clone (libc.so.6)

GNU gdb (GDB) Fedora 8.0-17.fc27
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/gnome-shell...Reading symbols from /usr/lib/debug/usr/bin/gnome-shell-3.25.3-1.fc27.x86_64.debug...done.
done.
[New LWP 2109]
[New LWP 2118]
[New LWP 2119]
[New LWP 2189]
[New LWP 2123]
[New LWP 2190]
[New LWP 2191]
[New LWP 2193]
[New LWP 2195]
[New LWP 2196]
[New LWP 2198]
[New LWP 2199]
[New LWP 2200]
[New LWP 2194]
[New LWP 2192]
[New LWP 2197]
[New LWP 2201]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/gnome-shell'.
Program terminated with signal SIGSEGV, Segmentation fault.
  • #0 js::GCMethods<JSObject*>::needsPostBarrier
    at /usr/include/mozjs-38/js/RootingAPI.h line 663

Comment 28 Philip Chimento 2017-07-21 22:00:47 UTC
*** Bug 785232 has been marked as a duplicate of this bug. ***
Comment 29 Philip Chimento 2017-07-21 22:03:01 UTC
There's a Python script on bug 785232 which I just closed as a duplicate; apparently if it's run for long enough it will cause the crash. This might help to make it crash faster.
Comment 30 Daniel Playfair Cal 2017-07-22 12:20:14 UTC
This python script works for me too - takes a minute or two but always reproduces the crash. I tried to install opencv-python in pip but ran into problems and realised that it is already built into the arch opencv package.

I copied the ideas from that fedora package plus Christian's comments and created this one for Arch: https://github.com/hedgepigdaniel/gnome-shell-valgrind. Building and installing that creates an extra session type which you can choose in a display manager which runs gnome as usual but gnome-shell is run in valgrind and logs/dumps to ~/.valgrind-session/

I solved the spam from i965 by compiling mesa with --enable-valgrind as an argument to ./configure in the PKGBUILD. Apparently this adds some kind of annotation for valgrind false positives.

After doing all this, its possible to select this session and it runs and the valgrind logging works. After logging in from gdm, the mouse freezes, then dissappears, then reappears, then the screen stays grey for a while (quite a while), then it goes back to the login screen. A valgrind log and core dump ends up in ~/.valgrind-session/

Unfortunately it looks like different bugs are preventing it from getting far enough to trigger this one.

Here are logs from valgrind and the journal when logging in with Wayland: https://gist.github.com/hedgepigdaniel/00c2792c33c7993d17134a48bfe691d0. 

Looks similar to (https://bugzilla.redhat.com/show_bug.cgi?id=1398142, https://bugzilla.redhat.com/show_bug.cgi?id=1349265, https://bugzilla.redhat.com/show_bug.cgi?id=1441490)

And the same but with X11: https://gist.github.com/hedgepigdaniel/23d1bc18d3abb9c7ae3cc373ab94649a. XWayland crashes in this case, googling the stack trace brings up lots of stuff.

Neither of these bugs occurs (at least not commonly) when not using valgrind - I'm still able to select the normal gnome session and it usually works.
Comment 31 Daniel Playfair Cal 2017-07-23 00:26:53 UTC
Here's a complete journal log of the Wayland crash and the valgrind output with debug symbols: https://gist.github.com/hedgepigdaniel/00c2792c33c7993d17134a48bfe691d0
Comment 32 Daniel Playfair Cal 2017-07-23 00:40:57 UTC
I tried disabling all the shell extensions (including gnome-shell-extension-taskbar which I think makes the problem much more likely). The python script was still able to cause a crash but it took much longer and the trace was different:

  • #0 js::gc::IsInsideNursery(js::gc::Cell const*)
    at /usr/include/mozjs-38/js/HeapAPI.h line 317
  • #1 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 663
  • #2 JS::Heap<JSObject*>::set(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 296
  • #3 JS::Heap<JSObject*>::operator=(JSObject* const&)
    at /usr/include/mozjs-38/js/RootingAPI.h line 266
  • #4 GjsMaybeOwned<JSObject*>::reset()
    at ./gjs/jsapi-util-root.h line 267
  • #5 closure_clear_idle(void*)
    at gi/closure.cpp line 133
  • #6 g_main_context_dispatch
  • #7 0x00007fab28e4ac88 in
  • #8 g_main_loop_run
  • #9 meta_run
    at core/main.c line 648
  • #10 main

Comment 33 Daniel Playfair Cal 2017-07-23 00:46:54 UTC
Journal log of the crash without extensions: https://gist.github.com/hedgepigdaniel/f25219e398568ad1cf3b83dea5206f73
Comment 34 Daniel Playfair Cal 2017-07-24 09:54:38 UTC
I managed to work around the crash starting gnome-shell by patching ui/background.js in gnome-desktop by commenting out everything that touched Gnome.WallClock, which fixed the excessive timer file handles that were causing the crash (see https://bugzilla.redhat.com/show_bug.cgi?id=1441490).

Unfortunately it still didn't initialise correctly - instead there were lots of entries in the log like this and it slowly chewed up more and more memory without initializing the display:

failed to commit changes to dconf: Timeout was reached

JS object wrapper for GObject 0x4ae69a00 (GSettings) is being released while toggle references are still pending. This may happen on exit in Gio.Application.vfunc_dbus_unregister(). If you encounter it another situation, please report a GJS bug.

Log: https://gist.github.com/anonymous/75c456456e81f1f146e4868e2c1647d1

I tried replacing ~/.config/gconf and disabling all other shell extensions but it was gnome-shell-taskbar that made the difference - when I disabled it the shell started successfully in valgrind and I am running the python script, but I don't think I've seen the exact trace with taskbar disabled, only the slightly different one from #32
Comment 35 Daniel Playfair Cal 2017-07-24 12:13:02 UTC
Created attachment 356291 [details]
Valgrind log (no taskbar, no crash)
Comment 36 Daniel Playfair Cal 2017-07-24 12:29:56 UTC
Created attachment 356295 [details]
Valgrind log (no taskbar, other crash)
Comment 37 Daniel Playfair Cal 2017-07-24 12:35:45 UTC
Created attachment 356296 [details]
Valgrind log (with taskbar, no crash)
Comment 38 Daniel Playfair Cal 2017-07-24 13:15:33 UTC
Created attachment 356297 [details]
Val;grind log (no taskbar, other crash in similar situation)

This one still has a different stacktrace but it crashed in just the same situation with that python script running after 1000-2000 images opened
Comment 39 Philip Chimento 2017-07-25 18:37:43 UTC
*** Bug 783951 has been marked as a duplicate of this bug. ***
Comment 40 Philip Chimento 2017-07-25 21:46:43 UTC
Thanks Daniel, those logs were quite helpful.

Pretty sure I know what's going on now - I think the signals.erase() in invalidate_all_signals() is preventing the code that removes the idle function in object_instance_finalize() from running, since object_instance_finalize() has no way to get to the idle handler ID at that point. We'll have to stick the pending idle IDs somewhere else.
Comment 41 Daniel Playfair Cal 2017-07-26 00:05:01 UTC
Thanks, glad to hear that it was useful. Thanks to you and everyone here for your help... C is hard.

Is it worth filing a separate bug for gnome-shell hanging when running in valgrind with taskbar enabled (and whatever vitalik_p is running) (#32)? Doesn't come up in normal use for me but possibly its evidence of a different problem.
Comment 42 Philip Chimento 2017-07-27 00:32:45 UTC
Created attachment 356448 [details] [review]
object: Keep proper track of pending closure invalidations

When a closure is invalidated during garbage collection, we can't free it
immediately because you can't stop tracing JS objects in the middle of
garbage collections. Instead we defer the free to an idle handler.

Previously, we kept track of the idle handler ID inside the closure's
ConnectData structure. However, it was possible for an idle handler to be
scheduled and the closure subsequently freed when the GObject itself was
freed. That meant that when the JS wrapper object was finalized, there
was no way to access the idle handler ID to remove it, so the idle
handler would still run, which meant use-after-free and occasionally a
crash.

This patch keeps track of pending idle handler IDs inside the JS wrapper
object's private structure, instead of the ConnectData structure, so that
all pending handlers are definitely removed when the JS wrapper object is
finalized.
Comment 43 Philip Chimento 2017-07-27 00:36:21 UTC
OK, I think this should do it. In any case, I managed to make a standalone test; it did not crash, but it did show the same use-after-free when run under Valgrind. (I expect that it eventually would crash if run 1000 times like the python script.) And the fix eliminates the use-after-free.

Please try the patch and let me know if it works for you. (It applies both to master and the gnome-3-24 branch.) If it fixes the problem, then I'll release another 1.48.x as soon as possible.

(In reply to Daniel Playfair Cal from comment #41)
> Is it worth filing a separate bug for gnome-shell hanging when running in
> valgrind with taskbar enabled (and whatever vitalik_p is running) (#32)?
> Doesn't come up in normal use for me but possibly its evidence of a
> different problem.

Let's check first if it still happens after this fix is applied. If you can still get it to happen, then go ahead.
Comment 44 Tomas Popela 2017-07-27 05:15:22 UTC
(In reply to Philip Chimento from comment #43)
> Please try the patch and let me know if it works for you. (It applies both
> to master and the gnome-3-24 branch.) If it fixes the problem, then I'll
> release another 1.48.x as soon as possible.

The crash is gone for me. Thank you Philip for the fix!
Comment 45 Cosimo Cecchi 2017-07-27 08:07:39 UTC
Review of attachment 356448 [details] [review]:

Looks good to me.
Comment 46 Florent Thiéry 2017-07-27 08:43:26 UTC
Many thanks for the patch, currently testing.

For other Arch users, here is how to patch gjs with the patch provided by Philip

packages/gjs/trunk$ svn diff PKGBUILD 
Index: PKGBUILD
===================================================================
--- PKGBUILD	(révision 300688)
+++ PKGBUILD	(copie de travail)
@@ -11,8 +11,8 @@
 depends=(cairo gobject-introspection-runtime js38 gtk3)
 makedepends=(gobject-introspection git gnome-common)
 _commit=43c5d7839630dd166372f2c404a9a72c87fd102a  # tags/1.48.5^0
-source=("git+https://git.gnome.org/browse/gjs#commit=$_commit")
-sha256sums=('SKIP')
+source=("git+https://git.gnome.org/browse/gjs#commit=$_commit" "crash.patch::https://bug783935.bugzilla-attachments.gnome.org/attachment.cgi?id=356448&action=diff&collapsed=&context=patch&format=raw&headers=1")
+sha256sums=('SKIP' 'SKIP')
 
 pkgver() {
   cd $pkgname
@@ -21,12 +21,13 @@
 
 prepare() {
   cd $pkgname
+  patch -p1 -i "${srcdir}/crash.patch"
   NOCONFIGURE=1 ./autogen.sh
 }
 
 build() {
   cd $pkgname
-  ./configure --prefix=/usr --disable-static --libexecdir=/usr/lib
+  ./configure --prefix=/usr --disable-static --libexecdir=/usr/lib --enable-debug-symbols=-gdwarf-2
   sed -i -e 's/ -shared / -Wl,-O1,--as-needed\0/g' libtool
   make
 }

@Philip can you comment what is precisely needded to enable gjs debugging (for the record)? I used this in the beginning, but now when the CXXFLAGS is present, compiling fails:

+  export CXXFLAGS='-g -O0'
+  ./configure --prefix=/usr --disable-static --libexecdir=/usr/lib --enable-debug-symbols=-gdwarf-2


make[1]: Entering directory '/home/fthiery/src/arch/packages/gjs/trunk/src/gjs'
  CXX      gi/libgjs_la-object.lo
In file included from /usr/include/c++/7.1.1/x86_64-pc-linux-gnu/bits/os_defines.h:39:0,
                 from /usr/include/c++/7.1.1/x86_64-pc-linux-gnu/bits/c++config.h:533,
                 from /usr/include/c++/7.1.1/bits/stl_algobase.h:59,
                 from /usr/include/c++/7.1.1/deque:60,
                 from gi/object.cpp:26:
/usr/include/features.h:373:4: error: #warning _FORTIFY_SOURCE requires compiling with optimization (-O) [-Werror=cpp]
 #  warning _FORTIFY_SOURCE requires compiling with optimization (-O)
    ^~~~~~~
cc1plus: all warnings being treated as errors
make[1]: *** [Makefile:2232: gi/libgjs_la-object.lo] Error 1
make[1]: Leaving directory '/home/fthiery/src/arch/packages/gjs/trunk/src/gjs'
make: *** [Makefile:1371: all] Error 2

I'd be happy to clarify this in the Arch wiki (https://wiki.archlinux.org/index.php/GNOME/Troubleshooting#Shell_segfaults)
Comment 47 Daniel Playfair Cal 2017-07-27 13:30:03 UTC
I also added

export CPPFLAGS='-D_FORITFY_SOURCE=0'
Comment 48 Florent Thiéry 2017-07-27 13:41:43 UTC
Many thanks; btw had no crashes since i applied the patch.
Comment 49 François Guerraz 2017-07-27 14:58:14 UTC
I confirm the issue is fixed for me now, I'm glad it's finally fixed (I've had similar crashes for ages but I had never been able to reproduce it reliably until recently). Hopefully that's a major stability improvement for gnome!
Comment 50 Vít Ondruch 2017-07-27 17:48:25 UTC
This is scratch build of the Fedora gjs package with patch from comment 42 applied:

https://koji.fedoraproject.org/koji/taskinfo?taskID=20805091
Comment 51 Philip Chimento 2017-07-27 22:42:29 UTC
(In reply to Florent Thiéry from comment #46)
> @Philip can you comment what is precisely needded to enable gjs debugging
> (for the record)? I used this in the beginning, but now when the CXXFLAGS is
> present, compiling fails:
> 
> +  export CXXFLAGS='-g -O0'
> +  ./configure --prefix=/usr --disable-static --libexecdir=/usr/lib
> --enable-debug-symbols=-gdwarf-2

I've never had that problem with _FORTIFY_SOURCE, perhaps it is new in your version of GCC? (I don't have 7.x yet.)

I use CXXFLAGS='-g -Og -fdiagnostics-color=auto' but the most important thing is to configure mozjs with --enable-debug, which I see you also have described on the wiki page.
Comment 52 Philip Chimento 2017-07-27 23:02:13 UTC
Attachment 356448 [details] pushed as db3e387 - object: Keep proper track of pending closure invalidations
Comment 53 Philip Chimento 2017-07-27 23:18:43 UTC
GJS 1.48.6 is now released with this fix in.
Comment 54 Daniel Playfair Cal 2017-07-27 23:26:23 UTC
I'm on gcc 7.1.1 so yeah maybe its new. Wow, had no idea gcc could output in colour.

Sounds like a good idea to improve the arch wiki. Would be good to link to this page from there aswell: https://wiki.archlinux.org/index.php/Debug_-_Getting_Traces. It wasn't obvious to me that the 'debug' option in makepkg was a mostly universal way of compiling with debug symbols. Also a quick guide to running valgrind and suppressing false positives would probably help - perhaps that should be a new page.

Thanks for the patch - I'll try it now in normal use but I've opened a very large number of smiley faces with that script and so far all is well :)
Comment 55 Kalev Lember 2017-07-28 09:58:41 UTC
Fedora update: https://bodhi.fedoraproject.org/updates/gjs-1.48.6-1.fc26
Comment 56 Matías Zúñiga 2017-07-29 03:54:01 UTC
This still happens for me with 1.48.6 (in fedora), although the crashes are less common, and the trace is not exactly the same, like the one in comment 32. For me it happens randomly when playing in wine, and seems to be more frequent when audacious is playing music. 

my enabled extensions:
$ gsettings get org.gnome.shell enabled-extensions
['alternate-tab@gnome-shell-extensions.gcampax.github.com', 'StatusTitleBar@devpower.org', 'TaskBar@zpydr', 'mediaplayer@patapon.info', 'background-logo@fedorahosted.org', 'dash-to-dock@micxgx.gmail.com', 'drop-down-terminal@gs-extensions.zzrough.org', 'user-theme@gnome-shell-extensions.gcampax.github.com', 'remove-dropdown-arrows@mpdeimos.com', 'dynamicTopBar@gnomeshell.feildel.fr', 'impatience@gfxmonk.net', 'sound-output-device-chooser@kgshank.net']
__________

abrt-notification[19556]: Process 11137 (gnome-shell) crashed in js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)()

(gdb) bt
  • #0 js::gc::IsInsideNursery(js::gc::Cell const*)
    at /usr/include/mozjs-38/js/HeapAPI.h line 317
  • #1 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 663
  • #2 JS::Heap<JSObject*>::set(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 296
  • #3 JS::Heap<JSObject*>::operator=(JSObject* const&)
    at /usr/include/mozjs-38/js/RootingAPI.h line 266
  • #4 GjsMaybeOwned<JSObject*>::reset()
    at gjs/jsapi-util-root.h line 267
  • #5 closure_clear_idle(void*)
    at gi/closure.cpp line 133
  • #6 g_idle_dispatch
  • #7 g_main_context_dispatch
  • #8 g_main_context_iterate.isra
  • #9 g_main_loop_run
  • #10 meta_run
    at core/main.c line 648
  • #11 main
    at main.c line 454
  • #0 js::gc::IsInsideNursery(js::gc::Cell const*)
    at /usr/include/mozjs-38/js/HeapAPI.h line 317
  • #1 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 663
  • #2 JS::Heap<JSObject*>::set(JSObject*)
    at /usr/include/mozjs-38/js/RootingAPI.h line 296
  • #3 JS::Heap<JSObject*>::operator=(JSObject* const&)
    at /usr/include/mozjs-38/js/RootingAPI.h line 266
  • #4 GjsMaybeOwned<JSObject*>::reset()
    at gjs/jsapi-util-root.h line 267
  • #5 release_native_object(ObjectInstance*)
    at gi/object.cpp line 1259
  • #6 object_instance_finalize(JSFreeOp*, JSObject*)
    at gi/object.cpp line 1684
  • #7 JSObject::finalize(js::FreeOp*)
    at /usr/src/debug/mozilla-esr38/js/src/jsobjinlines.h line 42
  • #8 js::gc::Arena::finalize<JSObject>(js::FreeOp*, js::gc::AllocKind, unsigned long)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 497
  • #9 FinalizeTypedArenas<JSObject>
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 557
  • #10 FinalizeArenas(js::FreeOp *, js::gc::ArenaHeader **, js::gc::SortedArenaList &, js::gc::AllocKind, struct SliceBudget &, js::gc::ArenaLists::KeepArenasEnum)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 600
  • #11 js::gc::ArenaLists::forceFinalizeNow(js::FreeOp*, js::gc::AllocKind, js::gc::ArenaLists::KeepArenasEnum, js::gc::ArenaHeader**)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 2758
  • #12 js::gc::ArenaLists::finalizeNow(js::FreeOp*, js::gc::AllocKind, js::gc::ArenaLists::KeepArenasEnum, js::gc::ArenaHeader**)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 2741
  • #13 js::gc::ArenaLists::queueForegroundObjectsForSweep(js::FreeOp*)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 2876
  • #14 js::gc::GCRuntime::beginSweepingZoneGroup()
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5069
  • #15 js::gc::GCRuntime::beginSweepPhase(bool)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5164
  • #16 js::gc::GCRuntime::incrementalCollectSlice(js::SliceBudget&, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5889
  • #17 js::gc::GCRuntime::gcCycle(bool, js::SliceBudget&, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6076
  • #18 js::gc::GCRuntime::collect(bool, js::SliceBudget, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6190
  • #19 gjs_schedule_gc_if_needed(JSContext*)
    at gjs/jsapi-util.cpp line 844
  • #20 gjs_call_function_value(JSContext*, JS::HandleObject, JS::HandleValue, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gjs/jsapi-util.cpp line 719
  • #21 boxed_invoke_constructor(JSContext*, JS::HandleObject, JS::HandleId, JS::CallArgs&)
    at gi/boxed.cpp line 337
  • #22 boxed_new(JSContext*, JS::HandleObject, Boxed*, JS::CallArgs&)
    at gi/boxed.cpp line 393
  • #23 gjs_boxed_constructor(JSContext*, unsigned int, JS::Value*)
    at gi/boxed.cpp line 480
  • #24 0x00007f924010f10d in
  • #25 0x0000000000000000 in

Comment 57 Vít Ondruch 2017-07-29 10:48:09 UTC
I applied the patch to gjs-1.49.3 in Fedora Rawhide and now the crashes caught by ABRT references this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1133131
Comment 58 Philip Chimento 2017-07-30 14:39:01 UTC
Thanks for reporting it! Could you please open a new bug and provide the following information, if possible:

- Stack trace with debug symbols (you have it already)
- Steps to make the crash happen, if possible
- Valgrind log, if possible
- Elimination of which shell extension causes the problem, if any

The duplicate bug marked by ABRT is almost certainly a problem in ABRT. From the comments it looks like it is marking all kinds of unrelated bugs as duplicates of that one.
Comment 59 Matías Zúñiga 2017-08-02 06:25:14 UTC
Just got this again with no enabled extensions, the trace is the same. I was just browsing in firefox with background music (audacious).
Yesterday gnome-shell crashed in firefox when audacious was not running (but some extensions where enabled), so i dont think its related to playing music.

I have not steps to reproduce this, as its pretty random. It can take minutes or hours to happen

Trying using the valgrind session in [1], but i can just see the cursor (and a log file of 3 GiB in my home)

[1] https://bodhi.fedoraproject.org/updates/FEDORA-2017-4b45b98198,
Comment 60 vitalik_p 2017-08-05 12:00:12 UTC
It's just awful.

Try this tests:

change last test in "installed-tests/js/testEverythingEncapsulated.js" like this.

describe('Garbage collection of introspected objects', function () {
    // This tests a regression that would very rarely crash, but
    // when run under valgrind this code would show use-after-free.
    it('collects objects properly with signals connected', function (done) {
        function orphanObject() {
            let obj = new Regress.TestObj();
            obj.destroy();
            obj = null;
        }

        orphanObject();
        System.gc();
        GLib.idle_add(GLib.PRIORITY_LOW, () => done());
    });
});

then run "make check"

> FAIL: installed-tests/js/testEverythingEncapsulated.js 30 Garbage collection of introspected objects collects objects properly with signals connected

FAIL: installed-tests/js/testEverythingEncapsulated.js 30 Garbage collection of introspected objects collects objects properly with signals connected
# Message: TypeError: obj.destroy is not a function in ./installed-tests/js/testEverythingEncapsulated.js (line 273)
# Stack:
#   orphanObject@./installed-tests/js/testEverythingEncapsulated.js:273:4
#   @./installed-tests/js/testEverythingEncapsulated.js:277:9
# Test script failed; see test log for assertions
ERROR: installed-tests/js/testEverythingEncapsulated.js - exited with status 1
Comment 61 Hans de Goede 2017-08-06 05:58:44 UTC
I'm still seeing gnome-shell crash in Fedora-26 with the gjs-1.48.6-1.fc26 update too. I'll attach gdb from a text console and collect a backtrace.
Comment 62 vitalik_p 2017-08-06 11:13:22 UTC
sorry, previous test is not correct. must be:

>let obj = new Regress.TestBoxedD('test1', 123);
>obj.free();
>obj = null;

need free object before run garbage collector.

$ cat test.js 

'use strict'

const GLib = imports.gi.GLib;
const Regress = imports.gi.Regress;

let obj = new Regress.TestObj();
let obj = new Regress.TestBoxedD('test1', 123);
obj.free();
obj = null;

log('done');

$ LD_LIBRARY_PATH=<gjs build path>/.libs/ GI_TYPELIB_PATH=<gjs build path>/ gjs-console test.js
Gjs-Message: JS LOG: done
*** Error in `gjs-console': free(): invalid pointer: 0x0000009b21c49920 ***
======= Backtrace: =========
Comment 63 vitalik_p 2017-08-06 11:17:36 UTC
> let obj = new Regress.TestObj();

comment or remove this(see above test.js).
Comment 64 Philip Chimento 2017-08-07 08:34:28 UTC
All readers:
============

The problems originally described by the stack traces on this bug report have supposedly been fixed now. Please do not keep posting new comments or stack traces on this bug unless you are *sure* that they describe exactly the original problem, and that the fix in 1.48.6 was faulty.

Of course, there may be one or even several problems still in existence that cause gnome-shell to crash for you! Here's what you can do instead.

  - Check if your stack trace matches one of these bugs. These are the gnome-shell crashes currently open (or opened but later closed due to lack of information)
    * bug 782464
    * bug 782692
    * bug 783771
    * bug 785657
    Post reproducer info there, stack traces with debug symbols, and output of `call gjs_dumpstack()` from GDB. If the bug was closed as INCOMPLETE but you can provide the missing information, fantastic! Feel free to reopen it.

  - If none of the above bugs match your stack trace, and no-one else has reported a similar stack trace to yours in the meantime, then please open a new bug.

The reason I ask this is not to be bureaucratic or to deny that crashes are happening, but to keep the information manageable for myself as I fix these bugs. If all of the stack traces from unrelated problems are posted here, then I will lose track of which ones are fixed and which ones are not.

To be specific, I am certain that closure structures are still being used after free in multiple places, but I really need you to open separate bugs for separate instances. If it all gets lumped together on this bug report then I can't keep track of it.

Thank you.
Comment 65 Philip Chimento 2017-08-07 08:37:52 UTC
(In reply to vitalik_p from comment #62)
> 'use strict'
> 
> const GLib = imports.gi.GLib;
> const Regress = imports.gi.Regress;
> 
> let obj = new Regress.TestObj();
> let obj = new Regress.TestBoxedD('test1', 123);
> obj.free();
> obj = null;

You can't free objects from Javascript code like that, this will definitely crash, but that is on purpose. (We should really get gobject-introspection to stop exposing free functions in memory-managed languages. I'll open a bug report for that.)

Do you know of any gnome-shell extension or application code that is doing this?
Comment 66 vitalik_p 2017-08-07 09:59:33 UTC
> Do you know of any gnome-shell extension or application code that is doing this?

Maybe you're right.
I think object can be free in another way(thread?).
I don't have free time to check this.
In moment when you call some method, object can be destroyed.
Comment 67 Florent Thiéry 2017-08-30 08:07:42 UTC
Hi

Just had it here again, gjs is now built on commit a9db649304db525ca166ec0845ee7a86cea4bf7f which includes the patch provided in this ticket. Seems to be happening pretty quickly now, not sure what changed in the meantime (i upgraded my Arch system).

Here are the related package versions:
local/gjs 1.48.6-1 (recompiled with debug)
local/js38 38.8.0-3 (recompiled with debug)
local/js38-debug 38.8.0-3 (recompiled with debug)
local/gnome-shell 3.24.3-1 (gnome)

Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `/usr/bin/gnome-shell'.
Program terminated with signal SIGSEGV, Segmentation fault.
  • #0 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
    from /usr/lib/libgjs.so.0
  • #0 js::GCMethods<JSObject*>::needsPostBarrier(JSObject*)
  • #1 JS::Heap<JSObject*>::set(JSObject*)
  • #2 JS::Heap<JSObject*>::operator=(JSObject* const&)
  • #3 GjsMaybeOwned<JSObject*>::reset()
  • #4 0x00007f353181f2ae in
  • #5 g_main_context_dispatch
  • #6 0x00007f352f8f3c88 in
  • #7 g_main_loop_run
  • #8 meta_run
  • #9 main

Here are the packages that have been upgraded and that are more or less closely related to gnome:

2017-08-28 09:46:53 libxkbcommon
2017-08-28 09:46:53 libxkbcommon-x11
017-08-28 09:46:57 libdrm
2017-08-28 09:46:57 wayland
2017-08-28 09:46:59 mesa
2017-08-28 09:47:03 gdk-pixbuf2
2017-08-28 09:47:03 gtk-update-icon-cache
2017-08-28 09:47:03 gtk3
2017-08-28 09:47:11 clutter-gtk
017-08-28 09:47:18 gnome-logs
2017-08-28 09:47:18 gnome-online-miners
2017-08-28 09:47:18 gnome-settings-daemon
2017-08-28 09:47:19 gtk-doc
2017-08-28 09:47:19 gtkspell
017-08-28 09:47:53 networkmanager
2017-08-28 09:47:53 wpa_supplicant
2017-08-28 09:48:02 xf86-video-intel
2017-08-28 09:48:02 xorg-server
2017-08-28 09:48:02 xorg-server-common
2017-08-28 09:48:02 xorg-server-xwayland
2017-08-30 09:48:54 js38-debug
2017-08-30 09:49:12 js38
2017-08-30 09:49:21 gjs

Is my stack trace sufficiently close to assume it is the same bug ?
Comment 68 Philip Chimento 2017-08-31 05:21:25 UTC
Hi,

There are an unknown number of related crashes. I'm keeping one bug report open unless I can obviously tell that two are different. This one was fixed, so yours is probably bug 785657.