GNOME Bugzilla – Bug 670200
Hang during garbage collection
Last modified: 2013-04-17 07:05:28 UTC
Created attachment 207744 [details] debug info OS: Fedora 16, GNOME Shell 3.2.2.1, js-1.8.5-7 I had pretty much just started up my computer, interacted with the network-manager menu (connected to a VPN) and then opened the 'network settings' dialog. At this point the shell process hung during a garbage collection. I gave it several minutes to recover and nothing happened.
Are you sure that it was hung during garbage collection - that it wasn't doing stuff, and then garbage collection, and doing more stuff and then garbage collecting, ...? (Unfortunately, the two choices here are; A) Memory corruption at some earlier part of the program caused GC to go into a loop B) you were in some infinite loop which we have no data about either way, it's very hard to proceed at this point unless it happens repeatedly.)
I took several backtraces and they were always at the same point, and 'next' executed in the above trace simply hung (I guess still blocking on its mutex). It's the first time I've seen this problem and I've been using this system for a few months, so odds of recurring sadly are probably low
It happened again, this time with --gdm-mode. My impression is that the JS runtime threading is getting confused. We have one worker thread which is idle (blocking on its 'wakeup' cond). Meanwhile the main loop is inside AutoGCSession::AutoGCSession() in this block of code: /* * Discount the request on the current thread from contributing to * rt->requestCount before we wait for all other requests to finish. * JS_NOTIFY_REQUEST_DONE, which will wake us up, is only called on * rt->requestCount transitions to 0. */ size_t requestDebit = cx->thread->data.requestDepth ? 1 : 0; JS_ASSERT(requestDebit <= rt->requestCount); if (requestDebit != rt->requestCount) { rt->requestCount -= requestDebit; do { JS_AWAIT_REQUEST_DONE(rt); } while (rt->requestCount > 0); rt->requestCount += requestDebit; } I ran the following to test if we are really blocking on the cond: (gdb) break jsgc.cpp:2669 Breakpoint 1 at 0x38204afa96: file jsgc.cpp, line 2670. (2 locations) (gdb) cont Continuing. and it never woke up again, so we must really be stuck. More interestingly, this shows something definitely isn't right with the JS runtime: (gdb) print rt->requestCount $3 = 4294967295 (gdb) print requestDebit $1 = 1 (gdb) print cx->thread->data.requestDepth $3 = 4294967295 I'm not sure what a 'request' is in this context.
It's possible the bug lies in gjs - if we have a mismatched pair of JS_BeginRequest()/JS_EndRequest(), it might result in this symptom.
*potentially* a dupe of https://bugzilla.gnome.org/show_bug.cgi?id=678378 ...
oh I ran into this yesterday and discovered the same thing mentioned in comment 3 and came up with the same theory mentioned in comment 4. I did some code inspection and found cases where we were missing JS_EndRequest, but I didn't find cases were missing JS_BeginRequest (which seems like a more likely cause given how high the numbers are). I'm going to try doing a JS build today with asserts enabled to see if that will smoke out the issue. I'll also attach a patch for the missing EndRequest calls once I write one.
So I've spent a few days looking into this and it's a pretty tricky problem. First, this problem has been aggravated recently because of a bug in the background code where we don't clean up the settings object associated with a background when the background actor is destroyed. I've put a fix for that in bug 697119. It turns out the problem isn't mismatched JS_EndRequest() calls after all (though like I said earlier there are some missing JS_EndRequest calls that i'll post a patch for). It's because we're calling JS_BeginRequest/JS_EndRequest from multiple threads. spidermonkey only allows for JS calls to be made from one thread at a time. We normally keep javascript to one thread, but there is a case where we sometimes don't: toggle reference notification. Basically, GJS allocates a JSObject for every GObject. The lifetime of of the JSObject is tied to the lifetime of the GObject. As long as the JSObject is alive, we want to force the GObject to stay alive (since the javascript code might be using the GObject via the JSObject). We also want to force the JSObject to stay alive as long as the GObject is alive. That's because if javascript calls into C to regain access to an object, it should get the same JSObject back from a previous call so that any data stored on it earlier isn't lost [say actor.get_children() should always return the same JSObject wrappers for the children, so if code does this.actor._delegate=this, it can get the delegate back). We can force the JSObject to stay alive as long the GObject is alive, by "rooting" the JSObject to the JS global object. We can force the GObject to stay alive as long as the JSObject is alive by taking a reference on the GObject. Doing both things together is effectively a reference cycle, though. If the JSObject owns a reference to the GObject it will never die, and so the JSObject will never get unrooted and die, so the GObject will never get unreferenced and die. This cycle problem is addressed using a "toggle reference". GObject notifies GJS via a callback function when no one other than javascript is referencing the object. It also notifies GJS via a callback function when something other than javascript starts referencing the object. The idea is when javascript is the only thing referencing the GObject, we can "unroot" the JSObject from the global object. Then the only thing keeping the JSObject alive is the javascript code itself. When the javascript code stops referencing the object it will die the next time the garbage collector runs. Likewise, when GJS gets notified that something other javascript is referencing the gobject we can re-root the jsobject to the global jsobject and force it stay alive even if no javascript code is referencing it anymore. The problem is these callbacks can be run from any thread. unrooting and rerooting the jsobject to the global object is something that can only be done when javascript isn't being used from other threads. And it really can't be done when the garbage collector is running in another thread. It's tempting to defer toggle notification to the main thread (with an idle function or g_main_context_invoke). The problem is, if the JSObject needs to be rooted to the global object, then deferring to idle could mean it gets garbage collected before the rooting takes hold. We could turn off the garbage collector, but doing that requires using javascript functions which we can only call from the main thread. So we're sort of stuck. One thing we could do is block the toggle notification from proceeding until garbage collection finishes (if it's running). That would help somewhat, but isn't a complete answer.
okay, looking in jsgc.cpp I see this: /*• * Let the API user decide to defer a GC if it wants to (unless this• * is the last context). Invoke the callback regardless. Sample the• * callback in case we are freely racing with a JS_SetGCCallback{,RT}• * on another thread.• */• if (JSGCCallback callback = rt->gcCallback) {• if (!callback(cx, JSGC_BEGIN) && gckind != GC_LAST_CONTEXT)• return;• }• So we can suspend the GC from running by returning false from our GC callback. We can also suspend the toggle notification callback from starting until the GC finishes (if it's already running) by locking a mutex in the GC callback and trying to lock the mutex in the notification callback. So, in summary: 1) have the GC callback check a boolean to see if we want the GC suspended. If so, return FALSE 2) If not, take a lock while the GC is running 3) From the toggle notification callback try to take the same lock. 4) Suspend the GC. 5) Queue an idle to process toggle notification and return 6) When idle runs, process toggle notification and then resume GC That should be sufficient to fix this bug, I think.
It is not true (according to the documentation) that we can only use BeginRequest from the main thread. What is true is that you cannot call BeginRequest on the same context from different threads. So a simpler solution would be to have two different contexts: one for normal JS executions, and one for foreign JS toggle references. Note that refcount can go from 1 to >1 only after a JS call, so only in the main thread. Which means that a toggle notify on a secondary thread will never cause us to root an object, and therefore this flow is safe.
I talked with Giovanni about this on IRC. Upshot is I'm going to try comment 8 instead of comment 9, since that has the potential to be more js1.7 friendly (although not completely js1.7 friendly since the callback mentioned in comment 8 changes signature.
Created attachment 240426 [details] [review] jsapi-util,object: Add missing JS_EndRequest calls There are a few early return cases where a JS_BeginRequest isn't getting paired with JS_EndRequest. This commit fixes those cases.
Created attachment 240427 [details] [review] value: handle argc == 0 in closure_marshal It's possible for a closure to have 0 arguments. When this happens we end up calling alloca(0) which is undefined. This commit protects against argc == 0. (concretely, I was seeing this happen from g_closure_invoke(clsoure, &result_value, 0, NULL, NULL) in glib's gsourceclosure.c: source_closure_callback function)
Created attachment 240428 [details] [review] gi: fix compile error This commit fixes a typo that can only see when compiling gjs with extra debugging.
Review of attachment 240428 [details] [review]: OK.
Review of attachment 240427 [details] [review]: OK.
Review of attachment 240426 [details] [review]: Hm, I'm not sure I like the flow in some of them, but I think the alternatives would be worse, so OK.
Attachment 240426 [details] pushed as 27a1395 - jsapi-util,object: Add missing JS_EndRequest calls Attachment 240427 [details] pushed as a026cc5 - value: handle argc == 0 in closure_marshal Attachment 240428 [details] pushed as e2d80eb - gi: fix compile error
Created attachment 240544 [details] [review] context: always set GC callback Right now we only set the GC callback if gc-notifications are enabled. This commit makes us always set the callback as prep work toward fixing a GC deadlock.
Created attachment 240545 [details] [review] object: defer toggle notifications to idle We don't want to be mucking with javascript from two threads at once. This commit defers toggle notification processing to the main thread.
Created attachment 240546 [details] [review] object: run toggle notifies directly, when possible There are a number of cases where toggle notifies run that we can get away with handling them directly. This commit does that, to minimize the number of idle functions getting queued.
I talked with Owen and Colin today and we came up with a slightly different solution to comment 8, which i've posted in the above three patches. The basic idea is 1) defer toggle notifications to idle when necessary 2) when toggling up from 1->2 references, take a reference of our own to prevent it from toggling back down (and potentially up again) before we can deal with it 3) accept that JSObjects associated with a gobject may get garbage collected before we have a chance to root them. This can happen if an object is already on the chopping block waiting to be GC'd and some internal library code refs it (like GDBusConnection's weak reference to the system bus) 4) handle toggle notifications immediately when possible for efficiency reasons.
Created attachment 240548 [details] [review] One other strange issue There's one other strange thing i've run across. If i manually force a GC, I sometimes get this assertion failure: at jsutil.cpp:83 This patch replaces the macros with what they do under the covers and it "fixes" it. I haven't yet figured out why.
hmm, that trace didn't make it through git bz:
+ Trace 231722
Comment on attachment 240548 [details] [review] One other strange issue This ended up being because I build spidermonkey with -DDEBUG but didn't build gjs with -DDEBUG
(this is against the gnome-3-8 branch, not master since I don't have spidermonkey 17 yet)
walters gave me a shiny spidermonkey 17 rpm, so i rebased it. will post.
Created attachment 240649 [details] [review] context: always set GC callback Right now we only set the GC callback if gc-notifications are enabled. This commit makes us always set the callback as prep work toward fixing a GC deadlock.
Created attachment 240650 [details] [review] object: defer toggle notifications to idle We don't want to be mucking with javascript from two threads at once. This commit defers toggle notification processing to the main thread.
Created attachment 240651 [details] [review] object: run toggle notifies directly, when possible There are a number of cases where toggle notifies run that we can get away with handling them directly. This commit does that, to minimize the number of idle functions getting queued.
Review of attachment 240649 [details] [review]: Looks correct modulo one question: ::: gjs/context.c @@ +878,3 @@ GjsContext *gjs_context = data; gjs_context->idle_emit_gc_id = 0; (Similarly would have to grab a mutex here) @@ +896,3 @@ + if (gjs_context->gc_notifications_enabled) { + if (gjs_context->idle_emit_gc_id == 0) + gjs_context->idle_emit_gc_id = g_idle_add (gjs_context_idle_emit_gc, gjs_context); This is not a new bug, but I believe gjs_context here needs to be protected with a mutex - otherwise we face a possible race with this callback being triggered from the GC thread, and racing with the idle in the main thread?
Review of attachment 240650 [details] [review]: ::: gi/object.c @@ +906,3 @@ + idle_id = g_idle_add(idle_handle_toggle_down, operation); + g_object_set_data (operation->gobj, + "gjs-toggle-down-notify", Should use _set_qdata(), it avoids this function having to do the quark lookup each time. @@ +907,3 @@ + g_object_set_data (operation->gobj, + "gjs-toggle-down-notify", + GUINT_TO_POINTER(idle_id)); But honestly, I don't quite understand this. If we go e.g. up -> down -> up, there will be two up idles queued, right? But then at object finalize, we're only removing one of them? Won't that crash? @@ +911,2 @@ } else { + guint idle_id; Could hoist idle_id from both paths. @@ +923,2 @@ */ + g_object_ref(gobj); Is there a reason we're doing the ref here outside the lock? Were we to do it inside the lock, the locking calls could be hoisted outside the conditional. @@ +1176,3 @@ + g_mutex_lock(&toggle_notify_lock); + idle_id = GPOINTER_TO_UINT(g_object_get_data(priv->gobj, + "gjs-toggle-down-notify")); g_object_steal_qdata() will take care of access and removing it atomically. @@ +1185,3 @@ + + idle_id = GPOINTER_TO_UINT(g_object_get_data(priv->gobj, + "gjs-toggle-up-notify")); Ditto.
Review of attachment 240651 [details] [review]: Will wait to review this one until previous dependency is ready.
Review of attachment 240649 [details] [review]: ::: gjs/context.c @@ +896,3 @@ + if (gjs_context->gc_notifications_enabled) { + if (gjs_context->idle_emit_gc_id == 0) + gjs_context->idle_emit_gc_id = g_idle_add (gjs_context_idle_emit_gc, gjs_context); yea i guess you could end up in a race where idle_emit_gc_id gets zeroified from the idle right after getting set in the gc callback
Review of attachment 240650 [details] [review]: ::: gi/object.c @@ +907,3 @@ + g_object_set_data (operation->gobj, + "gjs-toggle-down-notify", + GUINT_TO_POINTER(idle_id)); we can't go from up to down. we ref the object when we go up, so we can't go back down again. @@ +923,2 @@ */ + g_object_ref(gobj); well the lock is only needed to prevent the idle handler from running before the idle id is set on the object. I think it's better to keep the locked region more compact than to be able to shrink the function, personally. But if you feel strongly, i'm not opposed to changing it. I had thought about moving the idle+set_data to a separate helper function. That would make this function shrink. I'm cool with whatever.
Created attachment 240665 [details] [review] context: lock access to GC idle There's a small race where the GC notification idle id might get zeroed before the idle runs, if the GC callback is run from a non-main thread. That means there's the potential for the GC notification idle to get dispatched after the context is finalized. This commit fixes that by protecting access to the idle id with a lock.
Created attachment 240666 [details] [review] context: always set GC callback Right now we only set the GC callback if gc-notifications are enabled. This commit makes us always set the callback as prep work toward fixing a GC deadlock.
Created attachment 240667 [details] [review] object: defer toggle notifications to idle We don't want to be mucking with javascript from two threads at once. This commit defers toggle notification processing to the main thread.
Created attachment 240668 [details] [review] object: run toggle notifies directly, when possible There are a number of cases where toggle notifies run that we can get away with handling them directly. This commit does that, to minimize the number of idle functions getting queued.
(I haven't had an opportunity to thoroughly test the latest patchset, but I have to go and wanted to post it before I left)
Review of attachment 240665 [details] [review]: Right.
Review of attachment 240666 [details] [review]: Looks right.
Review of attachment 240667 [details] [review]: A few more comments...looking closer to right to me. ::: gi/object.c @@ +778,3 @@ +{ + g_object_steal_qdata(gobj, + g_quark_from_static_string("gjs-toggle-down-notify")); Ah...you're kind of missing the point of the quark stuff =) You're supposed to cache the GQuark as a static in a good place like class_init or at worst a GOnce. "git grep g_quark_from_static_string" in e.g. GLib to see some users like gobject/gbinding.c. @@ +810,3 @@ + if (idle_id != 0) + g_source_remove (idle_id); +} You could unify these two functions and have them take the GQuark to remove. @@ +873,3 @@ + ToggleRefNotifyOperation *operation = data; + + clear_toggle_down_idle_id(operation->gobj); When we call these, don't we need to be holding the toggle notify lock? @@ +886,3 @@ + ToggleRefNotifyOperation *operation = data; + + clear_toggle_up_idle_id(operation->gobj); And here?
Review of attachment 240668 [details] [review]: This looks reasonable to me...definitely tricky code, but I can't think of anything wrong with this.
Hi, (In reply to comment #42) > Ah...you're kind of missing the point of the quark stuff =) You're supposed to > cache the GQuark as a static in a good place like class_init or at worst a > GOnce. It was my understanding that quarks created from static strings were very cheap (essentially return (GQuark) static_string;) Looking quickly, that understanding is wrong. I'll change it. > @@ +810,3 @@ > + if (idle_id != 0) > + g_source_remove (idle_id); > +} > > You could unify these two functions and have them take the GQuark to remove. sure, why not. > @@ +873,3 @@ > + ToggleRefNotifyOperation *operation = data; > + > + clear_toggle_down_idle_id(operation->gobj); > > When we call these, don't we need to be holding the toggle notify lock? I don't think so. We'll never have two toggle down idles dispatched at the same time. If the GC is running in a non-default thread and finalizes a JSObject then we will call cancel_pending_toggle_down_idle_id, but that's harmless to call if the idle is already running. Actually thinking about it more, that case can't actually happen since the JSObject is rooted to the keep-alive object until the idle dispatches. So really cancel_pending_toggle_down_idle_id should be an assert. > @@ +886,3 @@ > + ToggleRefNotifyOperation *operation = data; > + > + clear_toggle_up_idle_id(operation->gobj); > > And here? Again, we can't have two idles dispatched at the same time, so that aspect of needing the lock isn't a concern. But we do have a problem where a JSObject could be getting garbage collected on a secondary thread while being used in the main thread. hmm. The only way we could prevent that from happening is to suspend the gc from running until the idle is dispatched, and block the idle from running while the GC is running.
Review of attachment 240665 [details] [review]: ::: gjs/context.c @@ +883,2 @@ gjs_context->idle_emit_gc_id = 0; + g_mutex_lock(&gc_idle_lock); this should say unlock
Created attachment 240764 [details] [review] context: lock access to GC idle There's a small race where the GC notification idle id might get zeroed before the idle runs, if the GC callback is run from a non-main thread. That means there's the potential for the GC notification idle to get dispatched after the context is finalized. This commit fixes that by protecting access to the idle id with a lock.
Created attachment 240765 [details] [review] context: always set GC callback Right now we only set the GC callback if gc-notifications are enabled. This commit makes us always set the callback as prep work toward fixing a GC deadlock.
Created attachment 240766 [details] [review] object: defer toggle notifications to idle We don't want to be mucking with javascript from two threads at once. This commit defers toggle notification processing to the main thread, and introduces a mutex to prevent the garbage collector and toggle notifications from running at the same time.
Created attachment 240767 [details] [review] object: run toggle notifies directly, when possible There are a number of cases where toggle notifies run that we can get away with handling them directly. This commit does that, to minimize the number of idle functions getting queued.
(In reply to comment #44) > > When we call these, don't we need to be holding the toggle notify lock? > I don't think so. We'll never have two toggle down idles dispatched at the > same time. If the GC is running in a non-default thread and finalizes a > JSObject then we will call cancel_pending_toggle_down_idle_id, but that's > harmless to call if the idle is already running. Basically when I saw the mutex, I went looking for the "other half" - what's the code that races with this one? Oh, are we protecting against e.g. the GDBus thread bouncing it up at the same time another thread is decrementing it, i.e. two threads calling into g_object_set_data() at once? But that's actually a threadsafe API, so even that's not a concern. So given that, do we need the mutex here at all? > Again, we can't have two idles dispatched at the same time, so that aspect of > needing the lock isn't a concern. But we do have a problem where a JSObject > could be getting garbage collected on a secondary thread while being used in > the main thread. hmm. The only way we could prevent that from happening is to > suspend the gc from running until the idle is dispatched, and block the idle > from running while the GC is running. We discussed this one, and were going to accept the "dewrappering" right? I think the way we'd have to do this correctly is something like a "destroyed" flag in the priv object. Then when the JSObject gets destroyed, if we have an idle pending, it holds a ref on priv. When we fire, we check the destroyed flag, if it's set, do nothing. Then drop the ref on priv.
(In reply to comment #50) > Basically when I saw the mutex, I went looking for the "other half" - what's > the code that races with this one? > So given that, do we need the mutex here at all? The point of the mutex is because we do foo = g_idle_add(...) set_data(foo); and inside the idle handler unset the data. The idle handler may run before g_idle_add returns. If it unsets the data before the data gets set, then the data will stay around after the idle handler runs. > We discussed this one, and were going to accept the "dewrappering" right? Right, the dewrappering is unavoidable, and not even really that problematic in practice. But what we can't do is try to root the JSObject in the idle handler while the JSObject is getting GC'd. That means running javascript from two threads at once, and that means mucking with an already freed object. I'm saying we accept the dewrappering, but we have to know when it happens and do the right thing. > is two threads calling JS at the same time. > think the way we'd have to do this correctly is something like a "destroyed" > flag in the priv object. The priv object is off limits from the second thread (since it requires calling Javascript functions to get at) I think the patchset i've posted right before lunch should work okay, but I'm going to do read through of it and some more testing. Let me know what you think.
Review of attachment 240765 [details] [review]: Right.
Review of attachment 240764 [details] [review]: Still right.
Review of attachment 240766 [details] [review]: we talked on irc about this, and decided two things. ::: gi/object.c @@ +929,3 @@ + ToggleRefNotifyOperation *operation = data; + + if (!clear_toggle_idle_id(operation->gobj, operation->direction)) { if we go with the mutex approach we really do need to block on it here or it's not doing anything. @@ +998,3 @@ + qdata_key = get_qdata_key_for_toggle_direction(direction); + + idle_id = g_idle_add_full(G_PRIORITY_HIGH, but we can avoid the mutex all together by using g_idle_source_new
Review of attachment 240767 [details] [review]: ::: gi/object.c @@ +1049,3 @@ */ + if (gc_blocked) + handle_toggle_down(context, gobj); we could potentially end up calling this before queued idles. we need to maintain ordering.
Created attachment 240798 [details] [review] object: defer toggle notifications to idle We don't want to be mucking with javascript from two threads at once. This commit defers toggle notification processing to the main thread, and introduces a mutex to prevent the garbage collector and toggle notifications from running at the same time.
Created attachment 240799 [details] [review] object: run toggle notifies directly, when possible There are a number of cases where toggle notifies run that we can get away with handling them directly. This commit does that, to minimize the number of idle functions getting queued.
Review of attachment 240798 [details] [review]: Ok, this makes more sense to me.
Review of attachment 240799 [details] [review]: I'll admit to not running through all the cases here, but this looks on the surface right. Let's go with it.
Attachment 240764 [details] pushed as d112a71 - context: lock access to GC idle Attachment 240765 [details] pushed as 7ab5e0e - context: always set GC callback Attachment 240798 [details] pushed as 50a3d86 - object: defer toggle notifications to idle Attachment 240799 [details] pushed as 0adcb39 - object: run toggle notifies directly, when possible
*** Bug 697170 has been marked as a duplicate of this bug. ***
Follow up bug: https://bugzilla.gnome.org/show_bug.cgi?id=697436
Any news on when the patch for this bug will be pushed to the Ubuntu Quantal (12.10) PPA? It affects me ~3 times a day.
not sure. (I don't think anyone on the CC list of this bug works on ubuntu or the quantal ppa) Maybe ask in launchpad?
Can you release new version for GNOME 3.8.1 ?