GNOME Bugzilla – Bug 781194
Intermittent crash in gnome-shell, probably in weak pointer updating code
Last modified: 2017-04-20 05:41:11 UTC
I have received some reports of this crash in gnome-shell with GJS 1.48.0. It seems to happen often for some people, sporadically for others, or not at all for still others (me). Since I haven't been able to reproduce it yet, I need some help tracking it down. If you have seen this crash, please try to get debugging symbols for libmozjs and GJS, and run a session of gnome-shell under GDB, or preferably, RR [1]. It looks like the crash happens when passing a null pointer to JS_UpdateWeakPointerAfterGC. If you can reproduce the crash, here's the info that would be useful to me: - A backtrace with debugging symbols - Any other information about the ObjectInstance struct that owns the null pointer - (with RR) when the pointer became null but wasn't removed from the weak_pointer_list set [1] http://rr-project.org/
Backtrace below:
+ Trace 237354
Actually, I forgot the header:
+ Trace 237355
From Georges' investigation I see the problem is not passing a null pointer to JS_UpdateWeakPointerAfterGC(). Instead it's that there's a null pointer stored in weak_pointer_list. That is, "priv" is null in this line: https://git.gnome.org/browse/gjs/tree/gi/object.cpp?h=1.48.1&id=c418799bff7ee9617e39648b69933ba6f695dee2#n1327 What would be incredibly useful is to find out where and how that null pointer got stored in there. I would suggest doing this with RR. It would go approximately like this: $ rr record gnome-shell --replace ...wait for crash... $ rr replay (rr) cont ...wait for crash... (rr) print iter 0x1027a30 or something (rr) watch -l iter (rr) reverse-cont That should land you at the place where the null ObjectInstance pointer was stored in weak_pointer_list.
For what it's worth, I believe I'm seeing a similar crash about once a day since switching to fedora 26. I saw at least 2 different backtraces when this happened, but both were involving update_heap_wrapper_weak_pointers. Since I'm running on wayland, I'm not sure gnome-shell --replace is an option? Or is it? (never tried I think ;) One of the backtrace I collected:
+ Trace 237374
The other backtrace I saw:
+ Trace 237375
FWIW I remember seeing this second backtrace on Fedora as well (but not in the past couple of weeks).
(In reply to Christophe Fergeau from comment #4) > Since I'm running on wayland, I'm not sure gnome-shell --replace is an > option? Or is it? (never tried I think ;) Right, you have to pick an Xorg session at the GDM login screen in order to do that. Georges, Christophe, Cosimo, and anyone else hitting this bug: could you list anything about your system's configuration that might be relevant so we can see if there is anything in common? I still have not been able to observe this crash. --- For anyone able to do some debugging, I will summarize the information that I need. The update_heap_wrapper_weak_pointers() function operates on a std::set (weak_pointer_list) of ObjectInstance* pointers. ("priv" is an ObjectInstance*.) The ObjectInstance struct contains a keep_alive member which can sometimes be a weak pointer (that is, not protected from garbage collection). The JS garbage collector can move objects around, so if you want to keep a weak pointer, you have to receive a notification when it is moved around or garbage collected. That is what update_heap_wrapper_weak_pointers() does. When the weak pointer is garbage collected, it becomes NULL and the ObjectInstance* that was keeping track of it, is removed from weak_pointer_list. From debugging with Georges there seem to be two separate crashes happening. The information that would most help, is backtraces and locals from the points where those situations came to be, instead of from the point where they crashed. That is where the rewind functionality of RR would be useful. 1) An ObjectInstance* pointer inside weak_pointer_list is invalid (0x1 in Georges' case.) What I need to know: backtrace of where in the code the "priv" (ObjectInstance*) pointer became invalid. Or, if it was always invalid, backtrace of where the invalid pointer was inserted into weak_pointer_list. 2) An ObjectInstance* with its keep_alive member wrapping an invalid pointer remains inside weak_pointer_list. I suspect a weak pointer to an object that got garbage collected was not removed from weak_pointer_list. What I need to know: the execution path through the loop in update_heap_wrapper_weak_pointers() at the time that pointer was garbage collected. (update_after_gc() should have returned true for that pointer, and that should have caused the entry in weak_pointer_list to get removed.) One thing that I can also try is just throwing a mutex around weak_pointer_list, and you can see if that helps; I didn't think locking was needed since the update_heap_wrapper_weak_pointers() callback should be called in the main thread, but I could be wrong about that.
Created attachment 350040 [details] [review] WIP - js: Make weak pointer lists threadsafe The mysterious crashes might be due to the weak pointer callback getting called from a different thread. In case that's the reason, I'm throwing this patch into the dark. Please try it and let me know if it works.
Created attachment 350073 [details] [review] Fix compile issue The previous patch didn't compile. I'm attaching another patch with the fix.
(In reply to Philip Chimento from comment #8) > In case that's the reason, I'm throwing this patch into the dark. Please > try it and let me know if it works. Thanks for the patch. It indeed avoids crashing GNOME Shell, but now it introduces a full session lock. When that happens, the mouse pointer still behaves like if everything is working (i.e. it changes when hovering entries, etc), and appearently the keybindings still work too (I can change workspaces using the keyboard and the mouse pointer), but the rendering is frozen.
Did Arch follow the advice given to distributor-list? https://mail.gnome.org/archives/distributor-list/2017-March/msg00002.html
With Philip's patch, the system freezes at the mutex. Here's the backtrace:
+ Trace 237379
Thread 1 (Thread 0x7efdfe265e80 (LWP 856))
(In reply to Olav Vitters from comment #11) > Did Arch follow the advice given to distributor-list? > > https://mail.gnome.org/archives/distributor-list/2017-March/msg00002.html Don't know! But my reply to that message is still valid - I'd like to find out what's in those patches, so that I can ensure in GJS's test suite that a platform's broken ICU isn't causing a broken GJS.
(In reply to Georges Basile Stavracas Neto from comment #12) > With Philip's patch, the system freezes at the mutex. Here's the backtrace: That's very helpful, thanks. Two things become clear: 1) It freezes because the gobject dispose notify callback is trying to re-enter the lock, so that's likely to be the reason for the crashes in the first place: modifying the weak_pointer_list while it is being iterated over in the _same_ thread. 2) We don't have a test in GJS's test suite that exercises this case, because the tests run with no problem with that patch applied. One approach for solving (1) would be to remove the garbage-collected ObjectInstance* pointers from weak_pointer_list during the iteration, but store them in a separate list that's local to the callback. Then release the lock and call disassociate_js_gobject() on the pointers in the separate list.
Created attachment 350094 [details] [review] object: Avoid modifying weak_pointer_list while iterating it I'm using this patch for an hour now, and didn't see Shell crashing anymore.
Review of attachment 350094 [details] [review]: Thank you so much for the patch! I'm so happy that you managed to find the cause. ::: gi/object.cpp @@ +1153,3 @@ gpointer data) { + static std::set<ObjectInstance *> to_be_disassociated; I would make this non-static. Instead, create a new list on each run of the function, then you can rely on it being destructed when the function ends, and you don't have to erase the elements from it. Another suggestion would be to use a std::vector - the global list is a std::set for fast insertion and retrieval, but here you don't need to care about that, you want fast iteration in the second loop. @@ +1171,3 @@ } + + for (auto iter = to_be_disassociated.begin(); iter != to_be_disassociated.end(); iter++) { Since you don't need to erase the elements, you don't need to address them by an iterator, so you can make this much simpler with a C++ range for loop: for (ObjectInstance *priv : to_be_disassociated) disassociate_js_gobject(priv->gobj); Or even store priv->gobj in the list instead of priv, so it would be more like for (GObject *gobj : to_be_disassociated) disassociate_js_gobject(gobj);
Created attachment 350095 [details] [review] object: Avoid modifying weak_pointer_list while iterating it Thanks for the review. Indeed, it looks much better this way. Is the commit message good enough?
Review of attachment 350095 [details] [review]: +1
Comment on attachment 350095 [details] [review] object: Avoid modifying weak_pointer_list while iterating it Thank you so much for the review. I'm leaving this bug open as it still needs to go to gnome-3-24 branch. Attachment 350095 [details] pushed as 48997c8 -object: Avoid modifying weak_pointer_list while iterating it
Pushed to gnome-3-24. I thought about adding a test, but I don't see how to consistently cause the situation from Georges' backtrace.
Review of attachment 350040 [details] [review]: Not needed.
Review of attachment 350073 [details] [review]: Not needed.