GNOME Bugzilla – Bug 735101
Becomes unresponsive, spins CPU
Last modified: 2015-06-03 21:14:37 UTC
This has happened to me a few times recently. This time I had plenty (~20-30) g-t windows/tabs opened and started closing them (by pressing ^D to the shell). When I was left with 3 g-t windows, a single tab in each, it became unresponsive (grayed out by the WM), spinning the CPU at 100%. strace prints this repeating endlessly: poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=3, events=POLLIN}, {fd=33, events=POLLIN}, {fd=59, events=POLLIN}, {fd=14, events=POLLIN}], 6, 4294967295) = 1 ([{fd=6, revents=POLLIN}])
Created attachment 283970 [details] file descriptors from /proc/xx/fd
Seems like there's some problem around fd 6 (-> socket:[40395]). /proc/<g-t-server>/fdinfo/6 contains pos: 0 flags: 02004002 I don't know what it means. Do sockets go in part that neighboring number belong together? There's no socket:40394 under /proc, socket:40396 belong to "/usr/bin/ibus-daemon --daemonize --xim"
The windows are still grayed and don't react to keyboard, mouse, window resize events. However, I can feed data to the vte widgets (echo foo > /dev/pts/28) and they appear correctly.
I killed ibus-daemon. No change in the dead g-t or in its strace output. Executing "gnome-terminal" pops up a new dead window. There's a shell running inside according to ps, but I can't see it (probably the color palette was not set up). I can feed data to the tty, but I can't see that either. The poll() method in strace now also includes the new fds. Killing the shell keeps the window on the screen, frees up the tty and removes the corresponding ptmx fd from the poll call.
I clicked on the X button of the window, and choosed "force quit" when the WM asked. This killed the g-t-server process. That's it for now.
Happened again, again when closing many windows in a row. Bringing them to focus by click or Alt+click, and the closing by Ctrl+D. Maybe it's some weird sequence of these actions (e.g. pressing Ctrl before releasing the mouse button) that causes the hang, maybe not.
I have experienced exact same behavior multiple times.
I've seen this too, and I attached gdb to the running gnome-terminal-server process. Stepping through in gdb it seems that the loop is simply the g_main_loop, but for some reason no callback is called. I haven't worked out what's up. Was reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1176603
Vaidas, Tim: Can you recall what you did right before g-t hung? Did you also close a window (e.g. quit the shell), or something other special action, or just worked inside a terminal?
The most recent time I remember, I'd definitely pressed Ctrl-D in one terminal to exit (but it didn't seem to receive that). The unfortunate thing is that you don't realise quite what's happened until several seconds afterwards.
Okay, at least it's a good starting point. We should assume the bug is connected to the shell attempting to quit, and apparently g-t sometimes fails to properly handle this event. There was a similar fix recently (bug 740929; made it into vte 0.38.3) but I don't think it's related. I'm almost sure the shell does receive that Ctrl-D and acts upon this (otherwise I see absolutely no reason why g-t could hang). Could you maybe test this for me to be totally sure? You should modify the terminal title or prompt to contain the tty number (e.g. PS1='...\l...'), and when the bug occurs next time double-check that bash on the terminal you were about to close is indeed no longer running. Or, without modifying the prompt or the title: As per comment 3, when it hangs you see all your bashes and echo to their ttys, these should appear in various g-t tabs/windows, but none of them should go to the tab where you pressed Ctrl-D.
I have not seen that for a while now. It must have been fixed already. > rpm -qa | egrep '(gnome-terminal|vte)' gnome-terminal-3.14.2-1.fc21.x86_64 vte3-0.36.3-3.fc21.x86_64 vte-0.28.2-14.fc21.x86_64 vte291-0.38.2-3.fc21.x86_64 vte-profile-0.38.2-3.fc21.x86_64
I've seen it in the last couple of weeks with gnome-terminal-3.14.2-1.fc21.x86_64 and vte291-0.38.2-3.fc21.x86_64. OK, I've modified my PS1 and I'll keep an eye on it.
OK, this just happened. I had two Terminal windows open, on /dev/pts/3 and /dev/pts/4. I had just entered 'Ctrl-D' into the one on /dev/pts/3, and gnome-terminal-server is now busy looping. On a VT, I did this: echo test1 > /dev/pts/3 echo test2 > /dev/pts/4 I see *both* test strings, one in each Terminal window. I can also see that gnome-terminal-server has three child processes two of which are bash and the other is gnome-pty-helper.
Created attachment 293955 [details] backtrace from gdb, not very revealing As luck would have it, this is the end of the day for me so I can leave this machine suspended until tomorrow. If there are any tests you'd like me to run on this busy-looping process, I can do that before I reboot tomorrow. Attached is the output from gdb but as mentioned before it doesn't reveal much.
Do you still have that broken gnome-terminal? What I'm curious about is if the bash of pts/3 is still running.
Created attachment 293956 [details] child processes It is.
Hmmmm, this is very surprising. At this moment I'm totally lost. If the shell didn't quit then ctrl-d didn't arrive to the shell, and it has no special meaning to the terminal, so it's not different from any other keypress. In the mean time, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=774600 is probably the same issue, although the reporter there maximizes the window instead of pressing ctrl-d. Maybe you could try to update to gnome-terminal-3.14.3. Actually it'd be nice if RedHat released an official update, because of https://bugzilla.redhat.com/show_bug.cgi?id=1172693 . I only see a very small chance that the fix to that bug fixes this issue, but it's worth a try.
It may well be that something I did before pressing Ctrl-D actually triggered it. It's only when the window stays around that I notice something's up. In this instance, I'm pretty sure I had clicked on the window in order to bring it in front of another before typing at it.
(In reply to comment #18) > Maybe you could try to update to gnome-terminal-3.14.3. There is no such thing as gnome-terminal-3.14.3. I think you are talking about vte-0.38.3. :) > Actually it'd be nice > if RedHat released an official update, because of > https://bugzilla.redhat.com/show_bug.cgi?id=1172693 . I only see a very small > chance that the fix to that bug fixes this issue, but it's worth a try. We already have the patch for that bug in our vte291-0.38.2-3.fc21 package. We should probably build 0.38.3 but the changes are not much so I was holding off on bothering the users with another update.
(In reply to comment #20) > There is no such thing as gnome-terminal-3.14.3. I think you are talking about > vte-0.38.3. :) Yup I mixed things up :) Thanks for clarifying! > We already have the patch for that bug in our vte291-0.38.2-3.fc21 package. We > should probably build 0.38.3 but the changes are not much so I was holding off > on bothering the users with another update. I overlooked this. Then there's no point in updating, and we can already be sure that the zombie fix did not fix the current issue.
To add further info to my rhbz post. I just had this bug hit while trying to reproduce another unrelated bug :-) I had <10 gnome-terms open on various workspaces. I had one open on current workspace. I did cd /tmp in it and hit SH-CTL-N. In the new window I clicked to get focus and hit ^D. The new window went away properly, but the old window was now hung!!! Opening a new window from XFCE panel showed that all gnome-terms were indeed hung. This (I think) is slightly different triggering behavior. Then I thought maybe it was happening when a lot of output was going to a term in current or another workspace while opening/closing other windows. So I opened a term and started find /. Changed workspaces and opened 1 more window and the bug triggered immediately, I'm pretty sure on the open (not a close, but can't be positive as I'm doing things very fast to recreate the bug)! Yay. But then I killed them all in an xterm, tried to do it again and couldn't :-( So I had it happen to me with almost no terms open, twice in about 2 minutes. But it never wants to recreate for me on demand. Very annoying. I'll keep trying to be aware of exactly what I'm doing when it happens.
Assuming a race condition around when a child quits, I was experimenting with opening multiple tabs/windows, getting the PID of those shells (echo $$), and then terminating them at pretty much the same time (kill pid1 pid2 pid3). Could you guys please try this and see if it manages to lock up g-t?
Hmm, my tcsh's don't respond to SIGINT, strange. They do seem to respond to -HUP or -9. HUP should be close enough to INT for this test, I'm guessing, but -9 might short circuit some things, so that's not ideal. I came up with this, open a ton of terminal windows everywhere and do this in an xterm (not a gnome-term!): kill -HUP `pstree -p XXXXXX |grep tcsh| head -n -1 |perl -pe 's/.*?\((\d+)\).*/$1/'` where XXXXXX is the ps id of your gnome-terminal-server, which won't change even after the kill (so you can try it multiple times quickly) because the head -n -1 will keep 1 term alive. Oh ya, replace "tcsh" with your shell. Check the sanity of my cmd before executing it! (replace kill -HUP with echo, ie) I opened and killed >50 terms at once with the above and couldn't trigger the bug. Also, in case closing it at more human speed is what triggers it, I tried: ~>cat /tmp/killtcsh #!/bin/perl -w @pss=`pstree -p 18230 |grep tcsh| head -n -1 |perl -pe 's/.*?\\((\\d+)\\).*/\$1/'`; foreach $ps (@pss) { chop $ps; system "kill -HUP $ps"; print "killing $ps\n"; sleep 1; } Does the same thing but delays 1s between kills. Still couldn't get the bug to hit, even while opening new terms the entire time the above was running. I have a few more ideas I will try.
Also trying this to automate open and a human-like shell quit: >cat /tmp/killtcsh-open-also #!/bin/perl -w while (1) { $life=rand(10); system 'gnome-terminal --geometry=100x40+0+0 --role=ws4 -e \'tcsh -c "echo '.$life.'; sleep '.$life.'; exit" \' ' ; sleep 1; } So far no bug :-(
Maybe it does need a shell running... to automate something that's even closer to what you'd do manually, you can try to synthesize Ctrl-D keypresses using e.g. wmctrl and xdotool.
It might also be useful if you could configure with --enable-debug and run g-t-server with GNOME_TERMINAL_DEBUG=all. I guess I'll install F21 + sawfish to see if I can finally reproduce the bug again. It stopped happening to me shortly after I reported the bug.
I should have checked it when I faced the bug... since then a lot of things changed, but fds 0..10 still look the same now, and fd 6 belongs to the @"/tmp/.X11-unix/X0" socket, so probably it was the case originally too. This suggests that there's pending data arriving from X which Gtk+'s main loop doesn't want to process for some reason.
(In reply to comment #26) > Maybe it does need a shell running... to automate something that's even closer > to what you'd do manually, you can try to synthesize Ctrl-D keypresses using > e.g. wmctrl and xdotool. My 2nd example (comment #25) is for sure running a shell. "exit" is also very close to ^D, unless this is happening at the X event layer where modifier keypresses matter. Still, I will look at your suggestions to emulate ^D, since indeed every time it happens I do believe I have pressed ^D (I never type "exit").
If the actual poll() strace is important, mine is different than the one in the description. This is the one I captured a few times ago. I haven't been capturing them all, but I will start, to find commonalities. 03:41:15.575609 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=50, events=POLLIN}, {fd=169, events=POLLIN}, {fd=216, events=POLLIN}, {fd=21, events=POLLIN}, {fd=166, events=POLLIN}, {fd=111, events=POLLIN}, {fd=107, events=POLLIN}, {fd=41, events=POLLIN}, {fd=30, events=POLLIN}, {fd=25, events=POLLIN}, {fd=43, events=POLLIN}, {fd=57, events=POLLIN}, {fd=93, events=POLLIN}, {fd=136, events=POLLIN}, {fd=35, events=POLLIN}, {fd=16, events=POLLIN}, {fd=177, events=POLLIN}, {fd=123, events=POLLIN}, {fd=10, events=POLLIN}], 21, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) I will try the debug as you suggest, and capture more info next time it happens (like you have).
(In reply to comment #27) > It might also be useful if you could configure with --enable-debug and run > g-t-server with GNOME_TERMINAL_DEBUG=all. I have it compiled and installed, and I've tried every which way to get GNOME_TERMINAL_DEBUG=all set to work on all terms, but it isn't taking. I've got it in /etc/environment and rebooted and it is globally an env var, but my g-t-server ignores it and doesn't have it in its env (per psauxewww)! I see in the g-t-server docs page that you are supposed to launch it manually and it will have debugging set, and I've done that, but it also says only terms launched in 10s will attach to that server, and that is useless for testing a bug like this. I need my default g-t-server to have the debug on globally, so when I hit gnome-term icon in xfce panel (my usual launch method) I'll get a debugged terminal. Also, if I manage to get this going globally, where is the debug output going to go? Launching it graphically attaches no tty to it, I don't think.
(In reply to comment #31) > I have it compiled and installed, and I've tried every which way to get > GNOME_TERMINAL_DEBUG=all set to work on all terms, but it isn't taking. This works for me when this env var is set for the manually launched g-t-server. Make sure it's installed where you expect, e.g. maybe it's installed under /usr/libexec while you're launching the one from your OS at /usr/lib/gnome-terminal, or something like this. > I need my default g-t-server to have the debug on globally, so > when I hit gnome-term icon in xfce panel (my usual launch method) I'll get a > debugged terminal. > > Also, if I manage to get this going globally, where is the debug output going > to go? Launching it graphically attaches no tty to it, I don't think. I'm afraid you'll need to write a wrapper script and launch that from xfce panel. Something along these lines: if g-t-server is not running G_T_DEBUG=all /usr/lib/g-t-server >> output_goes_here & sleep 0.5 fi gnome-terminal I don't know why this needs to be so cumbersome :(
I don't know either, but it is. As per: https://wiki.gnome.org/Apps/Terminal/Debugging this is non-trivial. Sure, it's trivial to make it work for the first term opened, but after that (the 10s interval mentioned) each new launch gets me yet another g-t-server process! (I've tried many variations on your pseudo-code above.) a) I don't think I'm supposed to have 50+ g-t-server ps's running at once, one per term b) I don't think this bug will hit when each g-t has its own g-t-server ps as that may mitigate the multi-window race condition. I tried attaching the new terms to the existing g-t-server using the above web page's idea of --app-id but when I use app-id I get no terminal, just an error: Error creating terminal: The name my.foo.Terminal was not provided by any .service files Doesn't seem to matter what app-id I put in there. Not sure if I'm supposed to use some meaningful name. Looks like systemd weirdness to me. If you can give me some way for the launch script to attach subsequent term launches to The One (debugged) g-t-server ps, then we're in business. Otherwise we have to find some way to make the "default" g-t-server that launches by who knows what means take up an env var of our choice (ie DEBUG).
(In reply to comment #33) > Sure, it's trivial to make it work for the first term > opened, but after that (the 10s interval mentioned) each new launch gets me yet > another g-t-server process! It really shouldn't. Only launching the server and the first window is tricky. From then on, each execution of the gnome-terminal command should connect to the already running server and open a new window. > a) I don't think I'm supposed to have 50+ g-t-server ps's running at once, one > per term > > b) I don't think this bug will hit when each g-t has its own g-t-server ps as > that may mitigate the multi-window race condition. These are correct. > I tried attaching the new terms to the existing g-t-server using the above web > page's idea of --app-id but when I use app-id I get no terminal, just an error: > > Error creating terminal: The name my.foo.Terminal was not provided by any > .service files > > Doesn't seem to matter what app-id I put in there. Not sure if I'm supposed to > use some meaningful name. Looks like systemd weirdness to me. The only thing that matters is that gnome-terminal-server and gnome-terminal must be given the same app_id - or both of them needs to go without this param. I have no clue what its value stands for. There needs to be a dot in it. I usually use "x.y" - or omit it if I can.
Happened to me now. Unfortunately I hadn't set up any debug stuff previously. The usual stuff: heavy work in tons of windows/tabs for hours, and then I closed many of them with Ctrl-D, only 2+1 tabs remained. One of the tabs was playing music with mpg123 in verbose mode, it's a console mp3 player with one line of status info that it updates with approximately 5 FPS. When the music played, and also when I restarted music later (from xterm, redirecting mpg123's stderr to g-t) then g-t consumed about 10-20% CPU and Xorg consumed about 50-70%, the overall load usually staying a little bit under 1.0 (even with the music player and a firefox included). When the music (terminal activity) stops, g-t jumps up to 100% and the load goes up to 1.2-1.5-ish. strace shows the usual stuff, fd 6, with plenty of other things interleaved if there's terminal activity. Attaching with ltrace caused a segfault soon after; but it does so anyway, I guess ltrace is just broken. And there was something else unusual... I'll post details about it after a bit of investigation :) Maybe it's not new, I just happened to discover right now.
So, the two weirdnesses. Setup: g-t running 'mpg123 -v whatever.mp3', one status line keeps updating. Another g-t window contains two tabs of vastly different colors (e.g. a black-on-white and a white-on-black). This window is positioned so that it partially covers the varying text (the mp3's position). Switch between the two tabs. Notice that quite often the area where there's the varying mp3 time underneath gets refreshed before the rest of the window. E.g. you switch from the black-on-white tab to the white-on-black, first this small rectangle becomes black and later the rest. Happens with Unity, doesn't happen with Gnome shell. Whether g-t's transparency patch is applied is irrelevant. So far this is a bit annoying with Unity but probably not interesting, and I can imagine that a WM might easily do this. In the terminal that later crashed, I noticed that when changing tabs, often (like 1 out of 20 or so) there was a complete there-back-there switch. The whole terminal was painted with the other tab's contents, then painted again with the one I was about to leave, then painted again correctly. I could do the same with any two windows of g-t. Now this is something I have absolutely no clue how could happen. Also, I can no longer reproduce it since the crash of that instance. Any correlation with the bug? Probably not, but who knows.
Created attachment 294374 [details] debugging wrapper script Trevor, I started using this script in the hope that g-t will crash again for me, you might find it useful.
Hmm, got the debug mode to work and it seems to not be creating >1 instance anymore, not sure why it was before. The appid didn't seem to be necessary. So now we wait. Weird, but my new debug-on terms launched using this new script all seem to have a menu bar with just 1 entry for "Terminal". This seems separate from the "real" menu bar, which is hidden, and I can make it appear and disappear without affecting that Terminal menu bar. So I can't get rid of it. A bit annoying. Maybe that's intended when debugging? From your new experiences: perhaps the bug is load related? Or aggravated by a higher load causing slower screen/window draws? We could try loading up the CPU with a burn-in program like stress while trying to make the bug happen. I, too, thought screen updates would aggravate it that's why during my tests I usually put a couple of find / in some terms. Also, the weird redraw / race condition things you described sure sound like they may also be related to bug 947847 doesn't it? It's a long shot, but maybe.
(In reply to comment #38) > Weird, but my new debug-on terms launched using this new script all seem to > have a menu bar with just 1 entry for "Terminal". [...] A bit > annoying. Maybe that's intended when debugging? I don't think so. You can try configuring with --enable-distro-packaging, but maybe that's for Ubuntu only, I'm not sure. > From your new experiences: perhaps the bug is load related? Or aggravated by a > higher load causing slower screen/window draws? We could try loading up the > CPU with a burn-in program like stress while trying to make the bug happen. I, > too, thought screen updates would aggravate it that's why during my tests I > usually put a couple of find / in some terms. Could be a load related race condition. Could be that we're leaking some resource on tab switch or window creation or something like that, and we end up having too much of it. I'm really clueless right now. > Also, the weird redraw / race condition things you described sure sound like > they may also be related to bug 947847 doesn't it? It's a long shot, but > maybe. You meant https://bugzilla.redhat.com/show_bug.cgi?id=947847, which in turn boils down to gnome bug 677329. Could even be related. That one sounds like a bug that has a deterministic path to understanding the situation and fixing the bug, it's just way too complicated. Maybe we should somehow convince the guys there who have at least _some_ understanding of the situation to take yet another look.
As stated in that bug, GDK_CORE_DEVICE_EVENTS=1 makes use another method of getting events which doesn't have the focus in/out bug. It would be interesting to see if the unresponsiveness bug still remains then or also goes away.
It's been a week and the bug has not hit again. I've been using the system normally and heavily, but I only did one quickly-close-all-gnome-terminals (for reboot) in that time, and the bug didn't hit. I'll keep trying. Note: Fedora yum updates brought in gnome-terminal-3.14.2-2.fc21.x86_64 so I rebuilt again with debug-enabled. My previous version was gnome-terminal-3.14.2-1.fc21.x86_64. I think that -1 to -2 is a Fedora-only patch release? In any case, I doubt it fixed the bug (can't be that lucky). Sometimes it does take a while for the bug to appear. Also, I fixed the annoying problem of a "Terminal" menu entry and menu bar showing regardless of the settings when debug-enabled, this patch will apply to the Fedora-patched version. I'll attach for others to use. I have no idea why they think that if you want to debug you want around 2 lines of usable terminal wasted on a dumb menu you never will touch. This gets rid of it for people building and testing this bug. Finally, how verbose is the debugging log supposed to be? Mine doesn't give much info. I just want to make sure it's actually on/working. Is there something in it I can look for to make sure? I mostly see size-alloc / size-alloc result lines.
Created attachment 295079 [details] [review] hides the annoying Terminal menu/menuline when configured with --debug-enable
(In reply to comment #41) > Note: Fedora yum updates brought in gnome-terminal-3.14.2-2.fc21.x86_64 so I > rebuilt again with debug-enabled. My previous version was > gnome-terminal-3.14.2-1.fc21.x86_64. I think that -1 to -2 is a Fedora-only > patch release? It is the same as the gnome-3-14 branch on git.gnome.org.
I've just found that according to https://mail.gnome.org/archives/commits-list/2011-October/msg08745.html, you might specify debug flags in addition to "all", and then they'll be removed. So probably with GNOME_TERMINAL_DEBUG=all,appmenu you won't need your patch.
(In reply to comment #41) > I mostly see size-alloc / size-alloc result lines. That's okay. There's log only on focus in/out, opening a new tab and similar events; not when you're working inside the terminal.
This happened again today so I poked it some more. Stepping through the glib polling code I saw that poll() said fd 6 had data ready to read, but none of the sources seemed to care about that file descriptor so no callback was called. I wanted to know what the file descriptor was connected to, so I tried this: [twaugh@rubik:4 ~]$ ls -l /proc/25581/fd | grep -w 6 lrwx------. 1 twaugh twaugh 64 Jan 27 12:15 6 -> socket:[129944] [twaugh@rubik:4 ~]$ lsof | grep -w 129944 gnome-ter 25581 twaugh 6u unix 0xffff88003364d400 0t0 129944 socket gdbus 25581 25582 twaugh 6u unix 0xffff88003364d400 0t0 129944 socket dconf 25581 25583 twaugh 6u unix 0xffff88003364d400 0t0 129944 socket gmain 25581 25591 twaugh 6u unix 0xffff88003364d400 0t0 129944 socket Does that help?
I'm afraid it's not new info since comment 28. Could you start g-t via the wrapper script from comment 37, and next time it hangs attach the log file?
It happened today but I was too busy with work to do something with that - when I came back the terminal was unfrozen. Nothing interesting in log only dozens of lines: (gnome-terminal-server:2652): GLib-CRITICAL **: g_child_watch_add_full: assertion 'pid > 0' failed
> It happened today but I was too busy with work to do something with that - when > I came back the terminal was unfrozen. It's interesting, we never saw it unfreeze from this situation. > Nothing interesting in log only dozens of lines: Hmmm, my bad. I forgot to tell you that you need a debug build of gnome-terminal (./configure --enable-debug --prefix=/usr; make; sudo make install -- or something similar; beware that this overwrites the package from your distro).
Bingo! Had to do a reboot for kernel updates and I had been working for a few weeks straight with probably 50 terminals open. I started closing the windows and each one I closed would make the other terminals lag and slow down. I could swear it also slowed down closing other apps too, but the gkrellms I monitor didn't look weird, so this could just be a red herring. By the 10th workspace of closing terms were taking a while to redraw and/or close. I could almost feel the bug about to happen. Then I closed one (via ^D as usual) and its window contents went solid light-grey and all terms froze. Had to kill on an xterm. But I got the log, and with debug on. I'll attach the lower portion of the log. Since there are no timestamps, I'm guessing at how much log you need. If you need more let me know. I tried to look for when it got "interesting".
Created attachment 296066 [details] debug enabled log output from g-t-server during freeze
It might be cool if you could attach the whole log, starting from the last "------" which separates it from a previous g-t-server process. I have really no idea what I'm looking for, but it seems that things start to go wrong way before the visible bug occurs.
Could you please try to place an ''export GDK_CORE_DEVICE_EVENTS=1'' at the beginning of the wrapper script, and restart (quit from every terminal window)? I'm curious to see if the bug still occurs in this case. (Of course we can't get a confirmative "no" on this, but we might get a "yes", or the bug might not occur for a very long time which will then become suspicious that indeed there might be some correlation.) If you're using a touchpad, you'll see a minor regression that scrolling is not smooth (per line) but 3-4 lines at a time instead.
OK, I'll attach the whole log. It's quite long since it ran for multiple weeks without closing g-t-s. I should note that when the it hit the bug this time, the system was doing moderate I/O, as I always run a yum update before I reboot, while I am closing windows. That plus paging in unused pages could cause the delays needed for this (probably) race condition to hit. I'll test the GDK_CORE_DEVICE_EVENTS next. If that doesn't work, what else do we do to track down this bug? Can we create a more-debug mode? Would running strace on it the whole time be helpful? Perhaps just running strace before doing a system reboot would be enough? Or is the problem going to be in internal stuff, and not sys calls?
Created attachment 296170 [details] full debug enabled log output from g-t-server during freeze
Just a note: the critical warnings "g_child_watch_add_full: assertion 'pid > 0' failed" are due to bug 740929 comment 12 which was fixed shortly after the version you're using, but accidentally might even come handy here. I haven't spotted before, but there are a plenty of warnings "signal 'size-request' is invalid for instance '0xdeadbeef' of type [either 'TerminalScreen' or 'TerminalWindow']". No clue what these are, probably innocent, but who knows...
Hmm, I had to do another shutdown. It was only up a couple of days this time with about 1/3 as many windows, but I got g-t to get laggy and slow by the normal closing routine. This time they got slower and slower and then a 3 or so were sitting saying "exit" but the window wasn't closing. I launched a new window while they were sitting there and it did the usual frame-draw-with-grey-middle and all these g-t's sat there for about 60s I'm guessing and then they recovered and everything was fine. Then closing the rest of the windows went *really* fast (no slowness at all). That's new. Usually when they start getting boggy and I push them with more closes it causes the bug for sure. So maybe GDK_CORE_DEVICE_EVENTS really does mitigate the bug, or maybe it's too soon to tell. But I think this is the first time it recovered from a >10s hang. As usual, I have the log, if you think it might be useful let me know and I'll post it.
So GDK_CORE_DEVICE_EVENTS doesn't make the bug go away, but causes it to happen differently (if we can draw a conclusion from 1 such event). This is what I least expected, I hoped it would either totally fix it, or the freeze would happen the same way. I'm not sure if the log file with GDK_CORE_DEVICE_EVENTS=1 will ever be useful (I'm not sure with the previous log file either) but it cannot hurt. I really don't know which way to continue debugging... If only the bug was reproducible in a few minutes :) I have the habit of quitting all the terminals at the end of my working day, I'll try to change this to keep it running as long as possible. I'll study the log files too and see if I find something suspicious, as my free time permits (probably not within the next few days or even weeks, sorry). Thanks a lot for your continuous support!
Created attachment 296318 [details] debug log when pseudo-freeze (60s) hit while closing windows with GDK_CORE_DEVICE_EVENTS=1
I haven' compiled the debug version yet but my bug is slightly different from some of the notes: 1. GDK_CORE_DEVICE_EVENTS=1 changes nothing for me 2. bug happens when i have several terminals opened for longer time - especially ssh 3. I can see, move terminal, I can write to it via /dev/pts/x, terminal resets, resizes , I can open new tab and the new tab works fine, I can highlight text, copy it - just no pasting. The only thing that happens for me is that there is absolutly no response to thekeys/mouse I input to the terminal (selection works fine). Same no response to things send to /dev/pts/x - they just appear on the screen but there is no reaction.
Also after some longer time (minutes) my terminal unfreezes and works fine. Unfortunately by that time network connectios are gone (Write failed: Broken pipe) I looked in the gnome-terminal source code and from the HACKING file it is VTE actually responsible for the terminal - so it looks like a problem iseither there or on the interface between vte and gnome-terminal. I wonder if there are any other terminals using vte- I could try using them.
(In reply to comment #60) > I can open new tab and the new tab works fine, I can highlight text, > copy it - just no pasting. > > The only thing that happens for me is that there is absolutly no response to > thekeys/mouse I input to the terminal (selection works fine). Could you be please more precise? In which windows/tabs do you have no response to keys or paste? Only the ssh ones? I might be mistaken, but based from your description I have a feeling that probably your ssh sessions stall due a network connectivity problem. In that case it's totally normal for the window not to change its contents - the keypresses are being sent to the server but the response doesn't make it to the terminal. If this happens next time, you could try pressing Enter ~ . (that is, Enter followed by tilde and finally a dot), this should immediately terminate your ssh session and you should get back the local prompt. > Same no response to things send to /dev/pts/x - they just appear on the screen > but there is no reaction. That's normal, writing to /dev/pts/x is just as if the application printed the given text, and not as if you typed it. I don't know if there's a similar way to simulate keyboard input. > I wonder if there are any other terminals using vte- I could try using them. There are quite a few, although I'm not aware of any other supporting the latest vte. Anyway, choose one that uses Gtk+ version 3 (not version 2), and correspondingly vte 0.36-ish (also called libvte 2.90), rather than the ancient 0.28 (sometimes called libvte9). Roxterm is one example you might want to check out.
I think I've found something suspicious in Trevor's log files. Look at the ... 'size-request' is invalid for instance '0x...' of type 'TerminalScreen' lines. In the first log file, there are 61 such lines, with 57 unique IDs, 4 of which are duplicated. In the second files, there are 33 such lines: 32 different values, one of them duplicated. In both cases, the very last such message is with a duplicated ID. That is: a previous, ceased TerminalScreen object used to reside at the same memory location which got reused again for another screen. Could we face something similar to bug 730389? Maybe there's a leftover stale reference which doesn't cause harm until it happens to point again to a valid object of the same type?? Although it'd probably cause a crash much sooner, but I have a hunch that this idea is worth some investigation.
... and pretty much the same story with the same kinds of warning messages related to TerminalWindow. Trevor, could you please attach one or two more log files, if it freezes again for you? I'm just curious if those also exhibit this pattern.
With the GDK_CORE_DEVICE_EVENTS=1 or not? What if GDK_CORE_DEVICE_EVENTS=1 mitigates the bug. I guess I should turn it off again to get another crash sooner?
I don't know, I don't have a preference in this regard... as it's more convenient for you. Maybe 1 more of each kind. Or whatever :)
OK- in my case it was my stupidy - I tested for 1 week and all cases were stalled ssh ( I have dozens of terminals and 95% of them is with ssh). I've learned Enter ~ . trick and will open a ticket to our network group.
Created attachment 297424 [details] my logs after a freeze This happened to me now. I can't remember if I closed a g-t tab/window or not. Just a few seconds before the freeze there was something really suspicious: I started a st (suckless) terminal (git c5f1d74) which was "born sick": didn't immediately act upon my keypressed but seemed to batch up 3-4 of them until it showed all the letters. Exactly as described in bug 730763. I quickly quit this and started another st which worked fine. st is a native X11 app (no Gtk+ or anything alike), so the "born sick" bug must be somewhere in X11. The timing is quite suspicious to me: my system works fine for months and then these two bugs occur within half a minute, I don't think this is coincidence. Which suggests that probably this particular bug also resides somewhere deep inside X. I'll see if by launching st I can trigger it more often.
Created attachment 298505 [details] gnome-terminal-server log with debugging until it freezes This bug hits me every now and then. I've attached a gnome-terminal-server log.
(In reply to Egmont Koblinger from comment #68) > not. Just a few seconds before the freeze there was something really > suspicious: I started a st (suckless) terminal (git c5f1d74) which was "born > sick": didn't immediately act upon my keypressed but seemed to batch up 3-4 > of them until it showed all the letters. Exactly as described in bug > 730763. I quickly quit this and started another st which worked fine. I strongly suspect bug #730763 is a different bug altogether. In fact, I've never had 730763 hit me, ever, and I have this 735101 hit me all the time. > st is a native X11 app (no Gtk+ or anything alike), so the "born sick" bug > must be somewhere in X11. The timing is quite suspicious to me: my system > works fine for months and then these two bugs occur within half a minute, I > don't think this is coincidence. Which suggests that probably this > particular bug also resides somewhere deep inside X. Again, perhaps true for 730763, but I wouldn't yet assume it's X to blame for 735101 yet. If that was true, we'd have xterm, kterm, whateverterm seeing this problem too. Right now it's just gnome-terminal.(?) Lately I've seen that gnome-terminal does a LOT of swapping (as seen on gkrellm in realtime) on my system when I open a new window after not having opened one in a long time. It *feels* like g-t wants to scan through it's entire memory footprint every time a new window opens. If you wait long enough, and are doing enough other things with enough ram pressure and even a bit of g-t's ram gets paged out, it make for a long "grey-window" wait for a new g-t. No freeze though, just a wait. I have a funny feeling that closing some g-t windows (if you can, as all of X tends to bog down) right at that moment (during page-in) would trigger this bug. When I don't have so much important stuff open I'll test that theory. I have a funny feeling that closing windows (esp a lot of them) also pages in all those stale pages too. Note, it should be fairly easy to create some sort of memory load test script that can have g-t get pushed out to swap. Tweaking the kernel swap pressure settings might help too. > I'll see if by launching st I can trigger it more often. Well, anything that might help trigger certainly is worth doing!
I just finished a 30-day session (no logouts, no reboots) with heavy heavy g-t use and did the normal "close every window rapidly with ^D" thing before reboot, and this time the bug did *not* hit. It's not guaranteed, but normally this would trigger the bug. The windows did start to bog down a *tiny* bit, but that's pretty normal as their contents get paged in. Opening a new window in the middle of the closes gave me a 3s grey-window pause, but that was it. Everything was quite orderly. No bug! So either a) I didn't get "lucky" and trigger the bug this time, or b) GDK_CORE_DEVICE_EVENTS=1 did indeed make the bug go away. (I was/am running in GDK_CORE_DEVICE_EVENTS=1 mode for 30+ days ever since you asked me to above.) I have a (big) log I can post if you think a "good" log is of any value.
Strangely I haven't had this problem for a long time anymore, but I'm suddenly curious: could GTK_IM_MODULE=xim cause any such problems? (I'm just wondering because I noticed that even scroll events are handled differently when I use xim...)
Created attachment 299540 [details] my logs after a freeze, 2nd crash Happened again, ironically just when I finished compiling and was about to install 0.39.92, the hard code freeze version. I pressed quickly ^C and ^D in a window to terminate the app and quit. That window did quit, and I was left with 2 other windows, 2 tabs in each when it froze. That is, nothing new. No suspicious sign at all before the crash occurred. I did _not_ have GDK_CORE_DEVICE_EVENT=1. Anyway, I'm attaching the logs, it might be useful one day.
What a day! It froze again. Makes me suspect that something's wrong with X or the window manager, and it began sending stupid events which Gtk+ is not tolerant against. I didn't close any window now, I clicked on the border, perhaps with Alt, to give it focus and bring it forward.
Try with GDK_CORE_DEVICE_EVENT=1 on like I've been using. It hasn't happened since I set that. Not sure if that helps track down the bug at all. Has everytime we've all had it happen occurred when we're using meta keys (alt, ctl, etc)? Seems to me it's always when we're doing some meta-something. Have no idea if that helps either. Could be X, but probably not WM as we are all using vastly different wm's I'm sure (me: sawfish). The hunt continues.
https://bugs.launchpad.net/ubuntu/+source/gnome-terminal/+bug/1437742: "Running `btrfs` commands which cause heavy system load (e.g. `btrfs filesystem defragment`, `btrfs balance`, etc.) in `gnome-terminal` causes the terminal never to return and use one CPU core at 100 % forever. [...]"
Got the same thing today (twice) here with gnome-terminal 3.14.1-1 and libvte 0.38.1-2 on debian. I have no idea how I triggered the issue. Not much new info but: * no swap configured here (and plenty of free ram) * was doing a build at the time (on one core, so some IO load) * no input modules configured * no idea what triggered it (can't remember closing a window, but that seems like a sane as that could cause POLLIN on a closed socket thereby looping) * I may have seen an update while the terminal was hogging CPU already (not sure) * also seeing bug #730763 sometimes I know, not very specific. I guess I'll try to dig deeper the next time it happens. Maybe I can find out why one of the threads has a broken mainloop source.
Ok, got it again, some notes: * fd 6 is the X server (the one that POLLIN is coming) * the mainloop will *not* read (well recvmsg) on FD 6 normally * if an event arrives on another FD, xcb will also do a recvmsg * new command output is drawn immediatly (just echo something to /dev/pts/X) * if the mainloop is looping, then echo >/dev/pts/X will stop the loop for a while I am looking into it some more now.
OK, after tracing this for ages :/ What I now know: * POLLIN happens (well, duh) * gdk_event_source_check returns 0! * event_source->display->event_pause_count == 1 * obviously: (event_source->event_poll_fd.revents & G_IO_IN) == true * looking at _gdk_event_queue_find_first there are only twp events with "flags == 0", these: $41 = {event = {type = GDK_WINDOW_STATE, any = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000'}, expose = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', area = {x = 1, y = 129, width = 32765, height = 864797296}, region = 0x7ffd338bc1f0, count = 91284128}, visibility = { type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', state = GDK_VISIBILITY_PARTIAL}, motion = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 1, x = 6.9527192091040882e-310, y = 6.9527619357611717e-310, axes = 0x7ffd338bc1f0, state = 91284128, is_hint = 0, device = 0x7fe34cd5f0b3, x_root = 6.9472657117135523e-310, y_root = 6.9472657071599469e-310}, button = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 1, x = 6.9527192091040882e-310, y = 6.9527619357611717e-310, axes = 0x7ffd338bc1f0, state = 91284128, button = 0, device = 0x7fe34cd5f0b3, x_root = 6.9472657117135523e-310, y_root = 6.9472657071599469e-310}, touch = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 1, x = 6.9527192091040882e-310, y = 6.9527619357611717e-310, axes = 0x7ffd338bc1f0, state = 91284128, sequence = 0x7fe34cd5f0b3 <g_value_unset+35>, emulating_pointer = 1289133800, device = 0x7fe34cd536e2, x_root = 0, y_root = 0}, scroll = { type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 1, x = 6.9527192091040882e-310, y = 6.9527619357611717e-310, state = 864797168, direction = 32765, device = 0x570e2a0, x_root = 6.9472657095101677e-310, y_root = 6.9472657117135523e-310, delta_x = 6.9472657071599469e-310, delta_y = 0}, key = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 1, state = 129, keyval = 32765, length = 864797296, string = 0x7ffd338bc1f0 "", hardware_keycode = 58016, group = 112 'p', is_modifier = 1}, crossing = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', subwindow = 0x7ffd00000081, time = 864797296, x = 6.9527619357548477e-310, y = 4.5100351655375017e-316, x_root = 6.9472657095101677e-310, y_root = 6.9472657117135523e-310, mode = 1289041634, detail = 32739, focus = 0, state = 0}, focus_change = { type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', in = 0}, configure = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', x = 1, y = 129, width = 32765, height = 864797296}, property = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', atom = 0x7ffd00000081, time = 864797296, state = 32765}, selection = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', selection = 0x7ffd00000081, target = 0x7ffd338bc270, property = 0x7ffd338bc1f0, time = 91284128, requestor = 0x7fe34cd5f0b3}, owner_change = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', owner = 0x7ffd00000081, reason = (unknown: 864797296), selection = 0x7ffd338bc1f0, time = 91284128, selection_time = 0}, proximity = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 1, device = 0x7ffd00000081}, dnd = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', context = 0x7ffd00000081, time = 864797296, x_root = 32765, y_root = 0}, window_state = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', changed_mask = GDK_WINDOW_STATE_WITHDRAWN, new_window_state = (GDK_WINDOW_STATE_WITHDRAWN | GDK_WINDOW_STATE_FOCUSED)}, setting = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', action = GDK_SETTING_ACTION_CHANGED, name = 0x7ffd00000081 <error: Cannot access memory at address 0x7ffd00000081>}, grab_broken = {type = GDK_WINDOW_STATE, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', keyboard = 1, implicit = 129, grab_window = 0x7ffd338bc270}}, flags = 0, screen = 0x0, windowing_data = 0x0, device = 0x0, source_device = 0x0} $42 = {event = {type = GDK_UNMAP, any = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000'}, expose = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', area = {x = 0, y = 0, width = 0, height = 0}, region = 0x0, count = 0}, visibility = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', state = GDK_VISIBILITY_UNOBSCURED}, motion = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 0, x = 0, y = 0, axes = 0x0, state = 0, is_hint = 0, device = 0x0, x_root = 0, y_root = 0}, button = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 0, x = 0, y = 0, axes = 0x0, state = 0, button = 0, device = 0x0, x_root = 0, y_root = 0}, touch = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 0, x = 0, y = 0, axes = 0x0, state = 0, sequence = 0x0, emulating_pointer = 0, device = 0x0, x_root = 0, y_root = 0}, scroll = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 0, x = 0, y = 0, state = 0, direction = GDK_SCROLL_UP, device = 0x0, x_root = 0, y_root = 0, delta_x = 0, delta_y = 0}, key = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 0, state = 0, keyval = 0, length = 0, string = 0x0, hardware_keycode = 0, group = 0 '\000', is_modifier = 0}, crossing = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', subwindow = 0x0, time = 0, x = 0, y = 0, x_root = 0, y_root = 0, mode = GDK_CROSSING_NORMAL, detail = GDK_NOTIFY_ANCESTOR, focus = 0, state = 0}, focus_change = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', in = 0}, configure = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', x = 0, y = 0, width = 0, height = 0}, property = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', atom = 0x0, time = 0, state = 0}, selection = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', selection = 0x0, target = 0x0, property = 0x0, time = 0, requestor = 0x0}, owner_change = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', owner = 0x0, reason = GDK_OWNER_CHANGE_NEW_OWNER, selection = 0x0, time = 0, selection_time = 0}, proximity = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', time = 0, device = 0x0}, dnd = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', context = 0x0, time = 0, x_root = 0, y_root = 0}, window_state = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', changed_mask = (unknown: 0), new_window_state = (unknown: 0)}, setting = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', action = GDK_SETTING_ACTION_NEW, name = 0x0}, grab_broken = {type = GDK_UNMAP, window = 0x5307360 [GdkX11Window], send_event = 0 '\000', keyboard = 0, implicit = 0, grab_window = 0x0}}, flags = 0, screen = 0x0, windowing_data = 0x0, device = 0x0, source_device = 0x0} I think I am going to give up on this for now.
Questions that I don't know the answer to and won't dig further into for now: * why is display->event_pause_count > 0 (and what does that mean?) * why does no event exist for which the test in _gdk_event_queue_find_first is TRUE (flags & GDK_EVENT_PENDING) && (flags & GDK_EVENT_FLUSHED) as paused is TRUE.
OK, wild theory: * Ctrl+D causes child to exit * Child exit is a HIGH priority event in VTE (-100 or so) * Causes tab to close (maybe even the window) * Underlying GdkWindow for widget is destroyed * GdkFrameClock is between GDK_FRAME_CLOCK_PHASE_FLUSH_EVENTS and GDK_FRAME_CLOCK_PHASE_RESUME_EVENTS * GdkFrameClock events are processed at lower priority, so this *can* happen. * As window is destroyed, it never calls _gdk_display_unpause_events, while _gdk_display_pause_events was called during PHASE_FLUSH If this is true, then it is a GDK bug, which can be fixed by simply making sure that the display is unpaused again when a window is destroyed or the frame clock is changed. I guess I'll poke Owen about it, maybe he has an idea whether the above (or something similar) may be happening.
Created attachment 304326 [details] Script that reproduces the issue So, here is a script that reproduces the issue. Just hit the 'Make it stop working' button a lot, and when the window closes, things are broken :)
Obviously, this has been fixed already ... *** This bug has been marked as a duplicate of bug 742636 ***
Egmont, does that sound like the same bug to you? I guess it could be. What version of GTK and/or VTE/GT do we need installed to check if this (the gnome-terminal bug) is really fixed? I haven't seen the bug in forever, but I'm still running with GDK_CORE_DEVICE_EVENTS=1
Ah, that could explain why I haven't seen this anymore for quite a while either (using -git builds of gtk, vte, etc.)
(In reply to Trevor Cordes from comment #84) > Egmont, does that sound like the same bug to you? I guess it could be. > What version of GTK and/or VTE/GT do we need installed to check if this (the > gnome-terminal bug) is really fixed? The fix is in the master, gtk-3-16 and gtk-3-14 branches of gtk+.
(In reply to Trevor Cordes from comment #84) > Egmont, does that sound like the same bug to you? I guess it could be. > What version of GTK and/or VTE/GT do we need installed to check if this (the > gnome-terminal bug) is really fixed? I haven't seen the bug in forever, but > I'm still running with GDK_CORE_DEVICE_EVENTS=1 I faced this bug on Ubuntu Trusty (gtk+ 3.10.8) and Utopic (3.12.2) but not (yet?) on Vivid (3.14.12) - the distro's gtk+ stack in all cases, only vte and gnome-terminal compiled by myself from git. I don't have the core events env var, not having smooth scroll drove me nuts. Let's really hope that this bug is indeed fixed. This means it mustn't appear on Vivid, since the suspected fix went into 3.14.8.