After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 781194 - Intermittent crash in gnome-shell, probably in weak pointer updating code
Intermittent crash in gnome-shell, probably in weak pointer updating code
Status: RESOLVED FIXED
Product: gjs
Classification: Bindings
Component: general
1.48.x
Other Linux
: Normal critical
: ---
Assigned To: Georges Basile Stavracas Neto
gjs-maint
Depends on:
Blocks:
 
 
Reported: 2017-04-11 19:27 UTC by Philip Chimento
Modified: 2017-04-20 05:41 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
WIP - js: Make weak pointer lists threadsafe (5.39 KB, patch)
2017-04-19 04:23 UTC, Philip Chimento
rejected Details | Review
Fix compile issue (5.15 KB, patch)
2017-04-19 13:37 UTC, Georges Basile Stavracas Neto
rejected Details | Review
object: Avoid modifying weak_pointer_list while iterating it (2.08 KB, patch)
2017-04-19 21:27 UTC, Georges Basile Stavracas Neto
none Details | Review
object: Avoid modifying weak_pointer_list while iterating it (1.94 KB, patch)
2017-04-19 22:00 UTC, Georges Basile Stavracas Neto
committed Details | Review

Description Philip Chimento 2017-04-11 19:27:17 UTC
I have received some reports of this crash in gnome-shell with GJS 1.48.0. It seems to happen often for some people, sporadically for others, or not at all for still others (me).

Since I haven't been able to reproduce it yet, I need some help tracking it down. If you have seen this crash, please try to get debugging symbols for libmozjs and GJS, and run a session of gnome-shell under GDB, or preferably, RR [1].

It looks like the crash happens when passing a null pointer to JS_UpdateWeakPointerAfterGC. If you can reproduce the crash, here's the info that would be useful to me:

- A backtrace with debugging symbols
- Any other information about the ObjectInstance struct that owns the null pointer
- (with RR) when the pointer became null but wasn't removed from the weak_pointer_list set

[1] http://rr-project.org/
Comment 1 Georges Basile Stavracas Neto 2017-04-11 20:50:36 UTC
Backtrace below:

  • #0 update_heap_wrapper_weak_pointers(JSRuntime*, gpointer)
    at gi/object.cpp line 1328
  • #1 js::gc::GCRuntime::callWeakPointerCallbacks() const
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 1596
  • #2 js::gc::GCRuntime::beginSweepingZoneGroup()
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 4965
  • #3 js::gc::GCRuntime::beginSweepPhase(bool)
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 5164
  • #4 js::gc::GCRuntime::incrementalCollectSlice(js::SliceBudget&, JS::gcreason::Reason)
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 5889
  • #5 js::gc::GCRuntime::gcCycle(bool, js::SliceBudget&, JS::gcreason::Reason)
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 6076
  • #6 js::gc::GCRuntime::collect(bool, js::SliceBudget, JS::gcreason::Reason)
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 6190
  • #7 js::gc::GCRuntime::startGC(JSGCInvocationKind, JS::gcreason::Reason, long)
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 6259
  • #8 js::gc::GCRuntime::maybePeriodicFullGC()
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsgc.cpp line 3246
  • #9 JS_MaybeGC(JSContext*)
    at /home/feaneron/Documentos/Projetos/packages/js38/trunk/src/mozilla-esr38/js/src/jsapi.cpp line 1341
  • #10 gjs_schedule_gc_if_needed(JSContext*)
    at gjs/jsapi-util.cpp line 844
  • #11 gjs_call_function_value(JSContext*, JS::HandleObject, JS::HandleValue, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gjs/jsapi-util.cpp line 719
  • #12 gjs_closure_invoke(GClosure*, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gi/closure.cpp line 229
  • #13 closure_marshal(GClosure*, GValue*, guint, GValue const*, gpointer, gpointer)
    at gi/value.cpp line 273
  • #14 g_closure_invoke
  • #15 0x00007fc8757c7f82 in
  • #16 g_signal_emit_valist
  • #17 g_signal_emit
  • #18 0x00007fc8159ffc7a in
  • #19 0x00007fc875ab456c in
  • #20 g_main_context_dispatch
  • #21 0x00007fc8754dca00 in
  • #22 g_main_loop_run
  • #23 meta_run
  • #24 main

Comment 2 Georges Basile Stavracas Neto 2017-04-11 20:51:29 UTC
Actually, I forgot the header:

  • #0 update_heap_wrapper_weak_pointers
    at gi/object.cpp line 1328

Comment 3 Philip Chimento 2017-04-11 23:01:36 UTC
From Georges' investigation I see the problem is not passing a null pointer to JS_UpdateWeakPointerAfterGC(). Instead it's that there's a null pointer stored in weak_pointer_list. That is, "priv" is null in this line: https://git.gnome.org/browse/gjs/tree/gi/object.cpp?h=1.48.1&id=c418799bff7ee9617e39648b69933ba6f695dee2#n1327

What would be incredibly useful is to find out where and how that null pointer got stored in there. I would suggest doing this with RR. It would go approximately like this:

$ rr record gnome-shell --replace
...wait for crash...
$ rr replay
(rr) cont
...wait for crash...
(rr) print iter
0x1027a30 or something
(rr) watch -l iter
(rr) reverse-cont

That should land you at the place where the null ObjectInstance pointer was stored in weak_pointer_list.
Comment 4 Christophe Fergeau 2017-04-18 16:51:45 UTC
For what it's worth, I believe I'm seeing a similar crash about once a day since switching to fedora 26. I saw at least 2 different backtraces when this happened, but both were involving update_heap_wrapper_weak_pointers.

Since I'm running on wayland, I'm not sure gnome-shell --replace is an option? Or is it? (never tried I think ;)

One of the backtrace I collected:

  • #0 js::gc::GCRuntime::isHeapMinorCollecting()
    at /usr/src/debug/mozilla-esr38/js/src/gc/GCRuntime.h line 602
  • #1 JSRuntime::isHeapMinorCollecting()
    at /usr/src/debug/mozilla-esr38/js/src/vm/Runtime.h line 931
  • #2 js::gc::IsAboutToBeFinalizedFromAnyThread<JSObject>(JSObject**)
    at /usr/src/debug/mozilla-esr38/js/src/gc/Marking.cpp line 463
  • #3 js::gc::IsAboutToBeFinalized<JSObject>
    at /usr/src/debug/mozilla-esr38/js/src/gc/Marking.cpp line 444
  • #4 js::gc::IsObjectAboutToBeFinalized(JSObject**)
    at /usr/src/debug/mozilla-esr38/js/src/gc/Marking.cpp line 599
  • #5 JS_UpdateWeakPointerAfterGCUnbarriered(JSObject**)
    at /usr/src/debug/mozilla-esr38/js/src/jsapi.cpp line 1386
  • #6 JS_UpdateWeakPointerAfterGC(JS::Heap<JSObject*>*)
    at /usr/src/debug/mozilla-esr38/js/src/jsapi.cpp line 1380
  • #7 GjsHeapOperation<JSObject*>::update_after_gc(JS::Heap<JSObject*>*)
    at gjs/jsapi-util-root.h line 89
  • #8 GjsMaybeOwned<JSObject*>::update_after_gc()
    at gjs/jsapi-util-root.h line 329
  • #9 update_heap_wrapper_weak_pointers(JSRuntime*, gpointer)
    at gi/object.cpp line 1328
  • #10 js::gc::GCRuntime::callWeakPointerCallbacks() const
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 1596
  • #11 js::gc::GCRuntime::beginSweepingZoneGroup()
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 4965
  • #12 js::gc::GCRuntime::beginSweepPhase(bool)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5164
  • #13 js::gc::GCRuntime::incrementalCollectSlice(js::SliceBudget&, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5889
  • #14 js::gc::GCRuntime::gcCycle(bool, js::SliceBudget&, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6076
  • #15 js::gc::GCRuntime::collect(bool, js::SliceBudget, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6190
  • #16 js::gc::GCRuntime::startGC(JSGCInvocationKind, JS::gcreason::Reason, long)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6259
  • #17 js::gc::GCRuntime::maybePeriodicFullGC()
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 3246
  • #18 JS_MaybeGC(JSContext*)
    at /usr/src/debug/mozilla-esr38/js/src/jsapi.cpp line 1341
  • #19 gjs_schedule_gc_if_needed(JSContext*)
    at gjs/jsapi-util.cpp line 844
  • #20 gjs_call_function_value(JSContext*, JS::HandleObject, JS::HandleValue, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gjs/jsapi-util.cpp line 719
  • #21 gjs_closure_invoke(GClosure*, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gi/closure.cpp line 229
  • #22 closure_marshal(GClosure*, GValue*, guint, GValue const*, gpointer, gpointer)
    at gi/value.cpp line 273
  • #23 g_closure_invoke
  • #24 signal_emit_unlocked_R
  • #25 g_signal_emit_valist
  • #26 g_signal_emit

Comment 5 Christophe Fergeau 2017-04-18 16:52:47 UTC
The other backtrace I saw:

  • #0 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
  • #1 std::_Rb_tree_const_iterator<ObjectInstance*>::operator++(int)
    at /usr/include/c++/7/bits/stl_tree.h line 373
  • #2 update_heap_wrapper_weak_pointers(JSRuntime*, gpointer)
    at gi/object.cpp line 1329
  • #3 js::gc::GCRuntime::callWeakPointerCallbacks() const
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 1596
  • #4 js::gc::GCRuntime::beginSweepingZoneGroup()
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 4965
  • #5 js::gc::GCRuntime::beginSweepPhase(bool)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5164
  • #6 js::gc::GCRuntime::incrementalCollectSlice(js::SliceBudget&, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 5889
  • #7 js::gc::GCRuntime::gcCycle(bool, js::SliceBudget&, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6076
  • #8 js::gc::GCRuntime::collect(bool, js::SliceBudget, JS::gcreason::Reason)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6190
  • #9 js::gc::GCRuntime::startGC(JSGCInvocationKind, JS::gcreason::Reason, long)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6259
  • #10 js::gc::GCRuntime::gcIfRequested(JSContext*)
    at /usr/src/debug/mozilla-esr38/js/src/jsgc.cpp line 6454
  • #11 InvokeInterruptCallback(JSContext*)
    at /usr/src/debug/mozilla-esr38/js/src/vm/Runtime.cpp line 525
  • #12 0x00007f347ca2068a in
  • #13 0x00007f345e935320 in
  • #14 0x00007ffdd3ba9cf8 in
  • #15 0x00007ffdd3ba9d08 in
  • #16 CheckOverRecursedWithExtraInfo
  • #17 0x00007f347c976790 in
  • #18 0x00007f343db0f62b in
  • #19 0x0000000000000681 in
  • #20 0x00007ffdd3ba9d38 in
  • #21 0x0000000000000000 in

Comment 6 Cosimo Cecchi 2017-04-18 16:57:40 UTC
FWIW I remember seeing this second backtrace on Fedora as well (but not in the past couple of weeks).
Comment 7 Philip Chimento 2017-04-18 17:42:05 UTC
(In reply to Christophe Fergeau from comment #4)
> Since I'm running on wayland, I'm not sure gnome-shell --replace is an
> option? Or is it? (never tried I think ;)

Right, you have to pick an Xorg session at the GDM login screen in order to do that.

Georges, Christophe, Cosimo, and anyone else hitting this bug: could you list anything about your system's configuration that might be relevant so we can see if there is anything in common? I still have not been able to observe this crash.

---

For anyone able to do some debugging, I will summarize the information that I need.

The update_heap_wrapper_weak_pointers() function operates on a std::set (weak_pointer_list) of ObjectInstance* pointers. ("priv" is an ObjectInstance*.) The ObjectInstance struct contains a keep_alive member which can sometimes be a weak pointer (that is, not protected from garbage collection). The JS garbage collector can move objects around, so if you want to keep a weak pointer, you have to receive a notification when it is moved around or garbage collected. That is what update_heap_wrapper_weak_pointers() does.

When the weak pointer is garbage collected, it becomes NULL and the ObjectInstance* that was keeping track of it, is removed from weak_pointer_list.

From debugging with Georges there seem to be two separate crashes happening.  The information that would most help, is backtraces and locals from the points where those situations came to be, instead of from the point where they crashed. That is where the rewind functionality of RR would be useful.

1)
An ObjectInstance* pointer inside weak_pointer_list is invalid (0x1 in Georges' case.)
What I need to know: backtrace of where in the code the "priv" (ObjectInstance*) pointer became invalid. Or, if it was always invalid, backtrace of where the invalid pointer was inserted into weak_pointer_list.

2)
An ObjectInstance* with its keep_alive member wrapping an invalid pointer remains inside weak_pointer_list. I suspect a weak pointer to an object that got garbage collected was not removed from weak_pointer_list.
What I need to know: the execution path through the loop in update_heap_wrapper_weak_pointers() at the time that pointer was garbage collected. (update_after_gc() should have returned true for that pointer, and that should have caused the entry in weak_pointer_list to get removed.)

One thing that I can also try is just throwing a mutex around weak_pointer_list, and you can see if that helps; I didn't think locking was needed since the update_heap_wrapper_weak_pointers() callback should be called in the main thread, but I could be wrong about that.
Comment 8 Philip Chimento 2017-04-19 04:23:58 UTC
Created attachment 350040 [details] [review]
WIP - js: Make weak pointer lists threadsafe

The mysterious crashes might be due to the weak pointer callback getting
called from a different thread.

In case that's the reason, I'm throwing this patch into the dark. Please
try it and let me know if it works.
Comment 9 Georges Basile Stavracas Neto 2017-04-19 13:37:19 UTC
Created attachment 350073 [details] [review]
Fix compile issue

The previous patch didn't compile. I'm attaching another patch with the fix.
Comment 10 Georges Basile Stavracas Neto 2017-04-19 13:41:09 UTC
(In reply to Philip Chimento from comment #8)

> In case that's the reason, I'm throwing this patch into the dark. Please
> try it and let me know if it works.

Thanks for the patch. It indeed avoids crashing GNOME Shell, but now it introduces a full session lock.

When that happens, the mouse pointer still behaves like if everything is working (i.e. it changes when hovering entries, etc), and appearently the keybindings still work too (I can change workspaces using the keyboard and the mouse pointer), but the rendering is frozen.
Comment 11 Olav Vitters 2017-04-19 16:28:53 UTC
Did Arch follow the advice given to distributor-list?

https://mail.gnome.org/archives/distributor-list/2017-March/msg00002.html
Comment 12 Georges Basile Stavracas Neto 2017-04-19 17:30:51 UTC
With Philip's patch, the system freezes at the mutex. Here's the backtrace:


Thread 1 (Thread 0x7efdfe265e80 (LWP 856))

  • #0 __lll_lock_wait
  • #1 pthread_mutex_lock
  • #2 __gthread_mutex_lock
    at /usr/include/c++/6.3.1/x86_64-pc-linux-gnu/bits/gthr-default.h line 748
  • #3 std::mutex::lock()
    at /usr/include/c++/6.3.1/bits/std_mutex.h line 103
  • #4 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
    at /usr/include/c++/6.3.1/bits/std_mutex.h line 162
  • #5 wrapped_gobj_dispose_notify(gpointer, GObject*)
    at gi/object.cpp line 949
  • #6 0x00007efdfbbfbb0f in
  • #7 g_object_run_dispose
  • #8 clutter_actor_destroy
  • #9 clutter_actor_iter_destroy
  • #10 0x00007efdfc435638 in
  • #11 g_closure_invoke
  • #12 0x00007efdfbc0a296 in
  • #13 g_signal_emit_valist
  • #14 g_signal_emit
  • #15 0x00007efdfc43a83c in
  • #16 g_object_unref
  • #17 release_native_object(ObjectInstance*)
    at gi/object.cpp line 1261
  • #18 disassociate_js_gobject
    at gi/object.cpp line 1420
  • #19 update_heap_wrapper_weak_pointers(JSRuntime*, gpointer)
    at gi/object.cpp line 1343
  • #20 0x00007efdf6a501ac in
  • #21 0x00007efdf6a51192 in
  • #22 0x00007efdf6a52f38 in
  • #23 0x00007efdf6a538e0 in
  • #24 0x00007efdf6a53b2d in
  • #25 0x00007efdf6a53ef4 in
  • #26 gjs_schedule_gc_if_needed(JSContext*)
    at gjs/jsapi-util.cpp line 844
  • #27 gjs_call_function_value(JSContext*, JS::HandleObject, JS::HandleValue, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gjs/jsapi-util.cpp line 719
  • #28 gjs_closure_invoke(GClosure*, JS::HandleValueArray const&, JS::MutableHandleValue)
    at gi/closure.cpp line 229
  • #29 closure_marshal(GClosure*, GValue*, guint, GValue const*, gpointer, gpointer)
    at gi/value.cpp line 273
  • #30 g_closure_invoke
  • #31 0x00007efdfbc09f82 in
  • #32 g_signal_emit_valist
  • #33 g_signal_emit
  • #34 0x00007efd97dc6c7a in
  • #35 0x00007efdfbef677c in
  • #36 g_main_context_dispatch
  • #37 0x00007efdfb91ea20 in
  • #38 g_main_loop_run
  • #39 meta_run
  • #40 main

Comment 13 Philip Chimento 2017-04-19 20:14:39 UTC
(In reply to Olav Vitters from comment #11)
> Did Arch follow the advice given to distributor-list?
> 
> https://mail.gnome.org/archives/distributor-list/2017-March/msg00002.html

Don't know! But my reply to that message is still valid - I'd like to find out what's in those patches, so that I can ensure in GJS's test suite that a platform's broken ICU isn't causing a broken GJS.
Comment 14 Philip Chimento 2017-04-19 20:24:46 UTC
(In reply to Georges Basile Stavracas Neto from comment #12)
> With Philip's patch, the system freezes at the mutex. Here's the backtrace:

That's very helpful, thanks. Two things become clear:

1) It freezes because the gobject dispose notify callback is trying to re-enter the lock, so that's likely to be the reason for the crashes in the first place: modifying the weak_pointer_list while it is being iterated over in the _same_ thread.

2) We don't have a test in GJS's test suite that exercises this case, because the tests run with no problem with that patch applied.

One approach for solving (1) would be to remove the garbage-collected ObjectInstance* pointers from weak_pointer_list during the iteration, but store them in a separate list that's local to the callback. Then release the lock and call disassociate_js_gobject() on the pointers in the separate list.
Comment 15 Georges Basile Stavracas Neto 2017-04-19 21:27:34 UTC
Created attachment 350094 [details] [review]
object: Avoid modifying weak_pointer_list while iterating it

I'm using this patch for an hour now, and didn't see Shell crashing anymore.
Comment 16 Philip Chimento 2017-04-19 21:44:00 UTC
Review of attachment 350094 [details] [review]:

Thank you so much for the patch! I'm so happy that you managed to find the cause.

::: gi/object.cpp
@@ +1153,3 @@
                                   gpointer   data)
 {
+    static std::set<ObjectInstance *> to_be_disassociated;

I would make this non-static. Instead, create a new list on each run of the function, then you can rely on it being destructed when the function ends, and you don't have to erase the elements from it.

Another suggestion would be to use a std::vector - the global list is a std::set for fast insertion and retrieval, but here you don't need to care about that, you want fast iteration in the second loop.

@@ +1171,3 @@
     }
+
+    for (auto iter = to_be_disassociated.begin(); iter != to_be_disassociated.end(); iter++) {

Since you don't need to erase the elements, you don't need to address them by an iterator, so you can make this much simpler with a C++ range for loop:

    for (ObjectInstance *priv : to_be_disassociated)
        disassociate_js_gobject(priv->gobj);

Or even store priv->gobj in the list instead of priv, so it would be more like

    for (GObject *gobj : to_be_disassociated)
        disassociate_js_gobject(gobj);
Comment 17 Georges Basile Stavracas Neto 2017-04-19 22:00:36 UTC
Created attachment 350095 [details] [review]
object: Avoid modifying weak_pointer_list while iterating it

Thanks for the review. Indeed, it looks much better this way.

Is the commit message good enough?
Comment 18 Philip Chimento 2017-04-19 22:19:48 UTC
Review of attachment 350095 [details] [review]:

+1
Comment 19 Georges Basile Stavracas Neto 2017-04-20 00:34:07 UTC
Comment on attachment 350095 [details] [review]
object: Avoid modifying weak_pointer_list while iterating it

Thank you so much for the review. I'm leaving this bug open as it still needs to go to gnome-3-24 branch.

Attachment 350095 [details] pushed as 48997c8 -object:  Avoid modifying weak_pointer_list while iterating it
Comment 20 Philip Chimento 2017-04-20 05:39:52 UTC
Pushed to gnome-3-24.

I thought about adding a test, but I don't see how to consistently cause the situation from Georges' backtrace.
Comment 21 Philip Chimento 2017-04-20 05:40:25 UTC
Review of attachment 350040 [details] [review]:

Not needed.
Comment 22 Philip Chimento 2017-04-20 05:41:11 UTC
Review of attachment 350073 [details] [review]:

Not needed.