GNOME Bugzilla – Bug 653833
Gnome shell hangs after fast user switch [intel i915]
Last modified: 2012-03-19 03:00:02 UTC
After swtiching users (either with Ctrl+Alt+F? or the Switch user menu item) gnome shell sometimes hangs (clicking on the User menu displays an underline, but the menu does not appear). Alt+F2 then 'r' <Enter> restarts the shell most of the time, occasionally I am forced to do a Ctrl+Alt+Backspace to restart X. This is the info about my graphics driver from lspci, 00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 12) (prog-if 00 [VGA controller]) Subsystem: Acer Incorporated [ALI] Device 0297 Flags: bus master, fast devsel, latency 0, IRQ 43 Memory at fb800000 (64-bit, non-prefetchable) [size=4M] Memory at d0000000 (64-bit, prefetchable) [size=256M] I/O ports at ec00 [size=8] Expansion ROM at <unassigned> [disabled] Capabilities: <access denied> Kernel driver in use: i915 Kernel modules: i915
Might be related to bug 651378 ?
Not sure if it would be related to 651378, as it never recovers. I did see this in the .xsession-errors file, (gnome-shell:14421): Clutter-CRITICAL **: clutter_actor_queue_relayout: assertion `CLUTTER_IS_ACTOR (self)' failed ** Message: Error: Stream contains no data. gsttypefindelement.c(954): gst_type_find_element_activate (): /GstPlayBin2:play/GstURIDecodeBin:uridecodebin0/GstDecodeBin2:decodebin20/GstTypeFindElement:typefind: Can't typefind empty stream
I think we worked this out to being a 16-minute animation timeout, if you are seeing the same issue. /usr/share/gnome-shell/js/ui/tweener.js has a 1000*1000 number in it, try bumping t that up a long way and see if it reoccurs.
I experience the same issue. Can anyone confirm if the proposed solution is valid? Will this be permanently fixed in future releases?
(I forgot to mention:) And if the solutions works, then please describe it in a more elaborate way. I did not manage make the change.
Just open /usr/share/gnome-shell/js/ui/tweener.js using a text editor as root, and replace "1000*1000" with e.g. "2", and check this fixes the bug (after restarting the Shell). But Dave, why do you think that's the problem?
Because myself and Ray worked it out one day on irc. The timeout occurs while gnome-shell is vt switched and is blocked from rendering, something gets badly stuck, you want to change it to a much larger number just to make it take days instead. Ray said he'd fix it someday :-)
Ah, so my explanation is wrong, it should be "replace 1000*1000 with 10000000000". Thanks for the details.
yea, i gotta look into this...
Someone else hit this today so I took a quick look at this. I wrote a patch that I think should fix it, though, I haven't tried to reproduce the problem yet to confirm the patch works. I'll post the patch now, so I don't lose it, and test it later.
Created attachment 207426 [details] [review] tweener: clean up some dead code The tweener code maintains a _startTime property that it no long uses and also has a lingering comment about a currentTime property that has long since been removed. This commit just mops up that remnant cruft.
Created attachment 207427 [details] [review] tweener: don't blow up after 16 minutes of blocked animations Clutter.Timeline requires a duration be passed to it at construct time. Once that duration has elapsed, the timeline completes and any associated animations stop. Tweener uses a Clutter.Timeline, but doesn't really care about this duration. It has its own logic for determining when to stop the animations it's driving. The timeline object is merely being used as a source for getting regular "ticks" to know when to draw the next frame. As far as the tweener code is concerned, the only requirement on duration is that it is at least as long as the animations being tweened. Given the specific duration of the timeline isn't that important, it was arbitrarily picked to be a 1000 seconds. Under normal circumstances, animations should be less than a second, so this is a more than reasonable upper bound. There are circumstances where the timeline can be blocked longer than 1000 seconds, however. For instance, if the user's laptop is suspended, it may stay suspended indefinitely long. Likewise, if the machine is fast user switched to another X server, the DRI lock will prevent clutter from drawing until the user swiches back. In either of these cases, the clutter timeline may stop ticking before the tweener animation completes and the shell will stall forever waiting for its pending animation to finish. This commit gives the tweener code one last tick in the event the timeline expires, so that it can finish off any pending animations.
I seem to have the same issue with gnome-shell 3.2.2.1 ("Gallium 0.4 on AMD CEDAR", ATI Radeon). It does not hang forever, though; it recovers after 20-30 seconds. Applying patches #11 & #12 did not fix the issue. Am I hitting the same bug or a different one?
(Forgot to include: this happens whenever I switch VTs or suspend the system; and after returning gnome-shell *always* hangs, no exceptions.)
Sounds different. comment 13 almost sounds like bug 651378 but that's fixed I think.
Comment on attachment 207427 [details] [review] tweener: don't blow up after 16 minutes of blocked animations <airlied> halfline: just got hung gdm login on f-u-s with new package <halfline> airlied: bummer <halfline> airlied: was this after more than 16 minutes ? <airlied> halfline: yeah it was definitely > 16 mins <airlied> I switched user a few hours ago <airlied> then went to switch again <halfline> okay <halfline> i'll have to reproduce artificially with a small timeout and see why the patch isn't enough
So, it's not enough to give one last tick, because animations can always get queued up right before the timeline expires, in which case they won't know they need to complete when the timeline ends. Really ClutterTimeline isn't a good fit for tweener as it currently stands, but it's pretty important to use, since it's synchronized with the clutter paint loop. So we have a few choices that I can think of: 1) Fix up ClutterTimeline to fit the needs of tweener better 2) Fix up tweener.js in gjs to deal with ClutterTimeline's finite duration 3) Fix up tweener.js in gnome-shell to have a UINT_MAX timeout. Of the three choices 3) is by far the easiest, but it doesn't really fix the bug, just makes it only affect people who resume after leaving their laptops suspended for 2 months. I'm not sure 1) makes sense, given clutter has its own builtin tweener like api, which works fine with cluttertimeline. So that leaves 2) and what i'm currently leaning toward.
Have you considered using a looping ClutterTimeline to get called at the right time, but ignore the time parameter, and use GLib.get_monotonic.time() as the time source?
(In reply to comment #18) > Have you considered using a looping ClutterTimeline to get called at the right > time, but ignore the time parameter, and use GLib.get_monotonic.time() as the > time source? Yea, Jasper and I talked about that as an option.
(In reply to comment #13) > I seem to have the same issue with gnome-shell 3.2.2.1 ("Gallium 0.4 on AMD > CEDAR", ATI Radeon). It does not hang forever, though; it recovers after 20-30 > seconds. > > Applying patches #11 & #12 did not fix the issue. > > Am I hitting the same bug or a different one? I started running into this too. It's the X server getting stuck spawning xkbcomp a bunch of times (like thousands) when first entering the vt. I started to look into it with gdb but gave up patience after an hour or so of fighting the debugger (which got really confused by the X server's weird builtin scheduler thing)
Created attachment 207712 [details] [review] tweener: use one ticker for each tween Tweener tries to keep one global ticker for all animations, so that animations share a common clock, I guess. This isn't important in practice though, since the ticker is normally replaced (by gnome-shell at least) with one that uses a clutter timeline, and all clutter timelines are sync'd to a common master clock. Furthermore, having one ticker poses a problem. The ticker based on clutter's timeline object have a fixed duration and so, animations queued toward the end of a ticker's lifetime can overflow past that ticker's life. This can only happen in practice, if the ticker is frozen for a while because the system is suspended or the user is VT switched away. This commit changes tweener to have one ticker per tween, making the ticker shipped with the shell be more compatible with us.
Created attachment 207713 [details] [review] tweener: adapt to new gjs tweener api gjs now lets us have one ticker per animation. This fixes the impedance mismatch we currently have between tweener and Clutter.Timeline.
attachment 207712 [details] [review] is for gjs and attachment 207713 [details] [review] builds on the earlier shell patches (which are still needed).
the above two patches didn't take very long to do, and I don't think the proposed solution in comment 18 will take very long either, so I may go ahead and give it a try too so we can see which one we ultimately like better.
Just for easy reference, Mantas filed bug 670164 to track the xkbcomp / vt switch freeze up shenangins
Created attachment 207802 [details] [review] tweener: make timeline loop indefinitely Tweener uses a clutter timeline to manage all active animations running at a given moment. The timeline is mopped up when no animations are going any more. Clutter requires timelines to have a finite duration, but since animations can happen at any moment, no fixed duration can accomodate the shell's needs. To combat this problem, we pick a relatively long duration: 10 seconds. No string of animations should take that long, so that should be good enough in practice. Unfortunately, this tactic fails when the user suspends their machine, or VT switches. A tween can take much longer than 10 seconds to complete in those cases and bad things will happen. This commit changes the clutter timeline to automatically loop back to 0, so that, despite, it's fixed duration property, it effectively never stops. Since the timeline loops, its concept of elapsed time no longer increases monotonically, so we now ignore it and track time ourselves with GLib.get_monotonic_time(). This partially reverts commit 35764fa09e4341e79732409c4e74c226d19f780f.
Okay attachment 207802 [details] [review] is the other approach. Both seem to work okay here. (I artificially test this by changing the timeline duration to 1.5 seconds, then starting an overview animation and quickly changing VTs, waiting, and changing back). attachment 207802 [details] [review] has the advantage of being only 7 lines + comments, the other attachments have the advantage of being seemingly more "right" to me. Owen, thoughts?
Everywhere I wrote "10 seconds" in comment 26 I meant "16 minutes" i think rereading it.
Review of attachment 207802 [details] [review]: "...so that despite its fixed duration..." Overall makes sense to me, just some minor comments. ::: js/ui/tweener.js @@ +241,3 @@ // animation then only do frame dropping from there. if (this._startTime < 0) + this._startTime = GLib.get_monotonic_time(); Note that GLib.get_monotonic_time() returns microseconds, whereas Clutter.TimeLine.get_elapsed_time() returns milliseconds. One unfortunate issue with this is how we handle 64 bit values in JS is we just convert them to regular JS numbers, i.e. doubles. That's probably OK in this case with monotonic time because we only run into precision issues after 2^53 microseconds or 272.38 years. @@ +253,2 @@ getTime : function() { + return this._currentTime; And so now we're making this function return microseconds too. Nothing appears to actually call it right now, but, maybe we should divide by 1000?
(In reply to comment #29) > Review of attachment 207802 [details] [review]: > > "...so that despite its fixed duration..." > > Overall makes sense to me, just some minor comments. So you like this better than the four patch approach above it? > ::: js/ui/tweener.js > @@ +241,3 @@ > // animation then only do frame dropping from there. > if (this._startTime < 0) > + this._startTime = GLib.get_monotonic_time(); > > Note that GLib.get_monotonic_time() returns microseconds, whereas > Clutter.TimeLine.get_elapsed_time() returns milliseconds. > > One unfortunate issue with this is how we handle 64 bit values in JS is we just > convert them to regular JS numbers, i.e. doubles. That's probably OK in this > case with monotonic time because we only run into precision issues after 2^53 > microseconds or 272.38 years. Okay, the precision probably doesn't matter too much as long as it increments linearly, strictly monotonic, and at least precise enough to capture 60fps. > @@ +253,2 @@ > getTime : function() { > + return this._currentTime; > > And so now we're making this function return microseconds too. Nothing appears > to actually call it right now, but, maybe we should divide by 1000? It's called by tweener.js in gjs.
Okay, I'm going to go ahead and push attachment 207802 [details] [review] (with the proposed changes). I don't think we want this bug in 3.4 since it's pretty visible for users who suspend or user switch and the freeze is tomorrow. I actually prefer the 4 patch approach over the one patch approach, but the one patch approach is more conservative than the other patches would be (doesn't create changes across multiple modules, doesn't change gjs api, etc).
Attachment 207802 [details] pushed as 207abe9 - tweener: make timeline loop indefinitely