GNOME Bugzilla – Bug 745561
Geary hides but doesn't exit when main window closed
Last modified: 2019-10-24 04:45:14 UTC
Sometimes when I close the main window, the geary process doesn't actually exit. Instead, the main window just hides. I can get it to reopen by running 'geary' from another terminal. This tends to happen only when Geary has been open for a while. It's possible that this is related to the draft manager, as I only started noticing the problem then. I've noticed two variations on this problem. In the first, we have composers that aren't getting destroyed. When you re-show the main window, everything comes up properly. If you're running with debugging, you'll when you create a new composer that there are some phantom ones sticking around. I assume Geary is waiting for those to close, but they're stuck waiting for some call. In the second, there aren't any phantom composers. When the main window is re-shown, it comes up missing the conversation list and maybe the folder list (I don't remember offhand). You can create a new composer, but on doing so, you find that it's the only one around. Something else must be keeping it alive. In either case, you can kill geary and restart it without any ill effects.
In the second case, the folder list contains only an empty "Label" category. Also, the close button is broken. You can keep clicking it and nothing happens. My guess is that the GUI has been partly destroyed.
I don't know why there are two cases here. I do know that as part of Undo Geary now does a "clean" shutdown (closing folders and account objects) rather than just exits. It's highly possible the app is hanging in one of these asynchronous close operations. I've added some debug logging to the code that indicates what's happening during application shutdown. If you see this again, it would be great if you could post here (or email me) the tail of the log.
I stumbled one case where Geary hung: if closing the app and there's a network problem, outstanding operations (i.e. move, archive, etc.) would be perpetually rescheduled, preventing the folder from closing. Fixed in commit e1b828. Network problems aren't *that* rare, especially since I've seen cases where Gmail will close a connection w/o notifying the client. However, it doesn't happen often enough that I'm convinced this is the entirety of the problem. I'm leaving this ticket open because there may be other cases where the hang occurs.
I've been running the updated code and closing Geary periodically to see what happens. Several times I've had it exit with phantom composers. In those case, there is no debug output as a result of hitting the close button (but I don't know if there was supposed to be). I haven't gotten into the other state. I don't know if that's because it's fixed, or if I just haven't been (un)lucky. I have noticed, however, periodic stalls when the UI becomes unresponsive. There seems to be something happening at startup, though I'm not sure quite what. One thing I am more sure of -- every time I open an unread message, there's a stall for a second or two before the message is marked as read and the UI comes back to life. I suspect there's some synchronous network thing happening in the main thread, but I haven't looked to figure out what it is.
I've traced down the stall to a commit that seems unrelated to this. It's now filed as bug 745914.
I just saw the hang-at-close. The new debug logging was never emitted; it was as though Geary didn't recognize it needed to exit and merely hid the main window. Not good. Going back over the trace, I saw one phantom composer that was never destroyed. Also, the Drafts folder was not closed. This could be the DraftsManager hanging or the composer not closing/destroying the DraftsManager, causing it to leave the folder open. Regarding the startup pause, I'm not sure what that may be (and have not noticed that). The shutdown delay is as per the new design; implementing Undo requires the app (more-or-less) gracefully shutdown so operations can be flushed to the network and the connections closed. Unfortunately, with a network hang this can some times take 30 seconds, so I've been thinking about putting a timer on the shutdown process that calls Posix.exit() if it takes too long.
Caught a hang-at-exit not related to the phantom composers. I'm not sure I caught all of the relevant debug messages, since I didn't see it right away, but here's what I have: [deb] 20:32:47 106.447721 geary-controller.vala:283: Stopping conversation monitor for Gmail:rschroll@gmail.com:INBOX (open_count=2 remote_opened=true)... [deb] 20:32:47 0.003245 geary-controller.vala:293: Stopped conversation monitor for Gmail:rschroll@gmail.com:INBOX (open_count=1 remote_opened=true) [deb] 20:32:47 0.013773 geary-controller.vala:306: Closing Gmail:rschroll@gmail.com:INBOX (open_count=1 remote_opened=true)... [deb] 20:32:47 0.004879 imap-engine-minimal-folder.vala:883: Closing replay queue for Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true) (flush_pending=true): ReplayQueue:Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true) (notification=0 local=0 local_active=true remote=0 remote_active=false) [deb] 20:32:47 0.000265 imap-engine-replay-queue.vala:452: ReplayQueue.do_replay_local_async ReplayQueue:Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true) (notification=0 local=0 local_active=false remote=1 remote_active=false) exiting [deb] 20:32:47 0.000047 geary-controller.vala:311: Waiting for Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true) to close completely... [deb] 20:32:47 0.000158 imap-engine-replay-queue.vala:561: ReplayQueue.do_replay_remote_async ReplayQueue:Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true) (notification=0 local=0 local_active=false remote=0 remote_active=false) exiting [deb] 20:32:47 0.000056 imap-engine-minimal-folder.vala:886: Closed replay queue for Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true): ReplayQueue:Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=true) (notification=0 local=0 local_active=false remote=0 remote_active=false) [deb] 20:32:47 0.000123 imap-engine-minimal-folder.vala:943: Folder Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=false) closed [deb] 20:32:47 0.098990 imap-engine-minimal-folder.vala:989: Not reestablishing connection to Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=false): closed [deb] 20:32:47 0.000294 geary-controller.vala:315: Closed Gmail:rschroll@gmail.com:INBOX (open_count=0 remote_opened=false) [deb] 20:32:47 0.000119 geary-controller.vala:324: Closing account Gmail:rschroll@gmail.com [deb] 20:32:47 0.000705 folder-list-account-branch.vala:122: Could not remove folder Gmail:rschroll@gmail.com:$GearySearchFolder$ [deb] 20:32:47 0.005886 geary-controller.vala:1306: Folder (null) selected [deb] 20:32:47 0.002679 folder-list-account-branch.vala:122: Could not remove folder Gmail:rschroll@gmail.com:UMass/Sent (open_count=0 remote_opened=false)
Meant to add that this was on the new conversations branch, but merged with master. Don't think that matters, but who knows.
This log shows Geary hanging as it tries to close the Account object. The good news is, all the open Folders have been closed, meaning all "real" I/O has been flushed to the network and disk. The bad news is, I don't see any obvious place in the code path where the execution would hang. I did spot a couple of suspicious spots but nothing that was "ah-ha, that's it." I've pushed two commits to master and merged them into the conversations branch. One is a defensive commit that is almost unlikely to do with this hang but could bite us elsewhere. The other tightens up some locking semantics and adds debug to ClientSessionManager, which is where I think the problem lies. Let's keep an eye out for this and see if it occurs again.
Created attachment 299878 [details] debug log Here's another non-composer hang.
I saw this exact hang this morning. Let me take a look.
Once more, with feeling. A new commit adds even more logging to try and pin this down. I did see a potential deadlock in ClientSessionManager which this commit should fix. If we see this again, I'll probably go with the nuclear option next. Pushed to master, commit f9e8502
I saw this again, but there were no shutdown debug messages logged. It appeared that the window was merely hidden. The last composer message in the log is this: [deb] 12:02:56 0.370531 geary-controller.vala:2355: Creating composer of type COMPOSER_WIDGET_COMPOSE_TYPE_REPLY_ALL; 1 composers total I had composed and sent a message with Geary jusr prior to closing it.
This morning I closed Geary and its last debug output was at imap-account.vala:158, which is emitted just prior to acquiring the account_session_mutex. This tells me that somewhere else the mutex was claimed and never released. The good news is, this might go toward solving a long-standing problem in Geary where the ImapEngine.GenericAccount object stops polling the server for changes to the folders. It's not a vital problem (at the moment) but means that when you close and restart Geary it spends a lot of time at startup resynchronizing with the server. In the optimal use case, Geary is synchronized (more or less) every time you exit. I'm looking closely at this right now but don't expect to hold up the 0.10 release for this unless something obvious magically pops up.
Unfortunately, this bug is serious enough that I need to push a hack to work around it. I've done just that: commit eb8fc9d3. This hack will forcibly close the process if the shutdown sequence takes longer than five seconds to complete. (This number can easily be tweaked if that's deemed too long or too short.) The good news is, my testing shows this solves the problem completely. (I tested by introduing a permanent hang in the shutdown code.) It doesn't kill the process if the app asks the user about discarding drafts, etc. The bad news is, this code hides bugs. There is at least one hang in the Engine (comment 14) and we know there are situations where composers hang around after they've been dismissed (which may be an Engine or a client bug, or both). Robert, my proposal is to close this ticket and open one more for the Imap.Account hang (or two if the composer problem isn't ticketed, although I recall that it is). Does this sound okay to you?
I don't think this hack will kill Geary in the phantom composer case. Note that in that case, we never get into the GearyApplication.exit() function. Instead, the controller just hides the window and waits for the last composer to close itself and trigger the exit. I tested this by disconnecting my network after a draft had been saved, and then closing the composer and Geary. It hid but kept running. (The good news is that it closed itself when I reconnected the network and it was able to delete the draft.) Actually, even though this isn't the phantom composer bug, I think it's another related bug -- we should be able to exit even when we've lost the network and can't complete draft actions. Maybe we need kill timers in the composers. As for this temporary fix -- perhaps we should revert it immediately after the release, so that we stop hiding bugs in the development code.
Created attachment 300478 [details] [review] Force app close if composers stuck open This starts a timer when we try to close any composers, which will force any open composers to close after a certain time. This introduces another interval variable; perhaps we should merge it with the one in the previous commit. Also, I wonder if we should elevate the messages from these to warnings.
> I don't think this hack will kill Geary in the phantom composer case. Yeah, I didn't mean to imply in my last comment that this worked around the composer problem. I was thinking of the hang in the Engine. > Maybe we need kill timers in the composers. My concern with that (and with my shutdown hack) is that the low-level networking code has timeouts already. Any network request should eventually timeout. This is my fundamental concern with the Engine hang, it appears a network request has been initiated and never completes. That's not supposed to happen. So, yes, we could introduce a timer into the composer, but it shouldn't be part of the design. > As for this temporary fix -- perhaps we should revert it immediately after > the release, so that we stop hiding bugs in the development code. My only concern is that we have a number of users who run out of trunk (or the Daily Build PPA) and to work around this bug requires killing the process from the console or the system monitor. What I'm leaning toward now is to have a command-line option or an environment variable that disables it. We could use that while everyone else uses the timer.
> This starts a timer when we try to close any composers, which will force any > open composers to close after a certain time. This introduces another > interval variable; perhaps we should merge it with the one in the previous > commit. This looks fine to me. I was going to release 0.10 today. The question is, do we release with or without this patch, and if we include it, do we risk it and go now or try it out for a few days. Have you been running it over the weekend? Have you seen it kick in and work? > Also, I wonder if we should elevate the messages from these to warnings. Warnings automatically trigger gdb breakpoints, so we try to use them with care. I'm fine with these remaining as debug statements.
(In reply to Jim Nelson from comment #19) > Have you been running it over the weekend? Have you seen it kick in and > work? I just tried closing Geary, after having had it open for a few days. There were seven phantom composers. This mechanism kicked in, but started kicking up critical errors about g_closure_unref and then dumped core before getting through them all. I suspect the problem is that we're removing entries from waiting_to_close as we're iterating through waiting_to_close. Presumably, I could solve this by making a copy of waiting_to_close and iterating through the copy. Or by getting its length before hand and then clearing items from the end. I don't know if you should wait for a fix for this or not. On the one hand, it's clearly a bug. On the other, the user probably won't notice. They hit the close button and Geary goes away. They reopen it later and it pops back up, completely functional. Do they really care that it was running in the background between these two events? I'm fine going either way on this.
(In reply to Jim Nelson from comment #18) > My concern with that (and with my shutdown hack) is that the low-level > networking code has timeouts already. Any network request should eventually > timeout. Do you know how long these timeouts are offhand? I'm wondering if my test of turning off the network connection and then closing a composer lasted long enough to trigger them or not.
Created attachment 300621 [details] [review] Force app close if composers stuck open I tested this with three open composers and the network disabled. Everything worked as it should.
> I suspect the problem is that we're removing entries from > waiting_to_close as we're iterating through waiting_to_close. A common problem with Gee. Generally the way we get around it is to call Gee.Collection.to_array() and iterate over that, i.e.: foreach (B blob in blobs.to_array()) I saw your attachment walking over the list backwards; that's fine too. > Do they really care that it was running in the background between these two events? I agree, the composer bug is different in that Geary is merely hidden and operating and can be brought back by re-execing the app. The other bug hangs the process in a really bad state. Let's hold off on your patch for now. > Do you know how long these timeouts are offhand? I'm wondering if my test > of turning off the network connection and then closing a composer lasted > long enough to trigger them or not. There's a few timeouts in play here. There is a generic command timeout (30s) that is triggered if the Imap.ServerResponse isn't received for a corresponding Imap.Command. In other words, command(s) need to be outstanding for this to happen. (Connections can sit idle for long periods of time in IMAP.) This used to be 15s but we had a bug report from users with slow/spotty connections. If the session has a folder selected, it has a 5m keepalive timeout. If the session has no folder selected (is at the "root", used for listing folders and detecting changes in them) the keepalive timeout is 10m. This timeout means that if no traffic has gone across the wire for that duration, a NOOP command is sent to check if the connection is still alive. Finally, all IMAP TCP connections are configured at the socket layer with a 10m tiemout. This is slightly redundant, as the command timeout and the keepalive timeout can detect connection loss more reliably. (Since Geary pools unused connections for reuse, this is a handy way of freeing connections after they've been unusued for 10m.) It also has one more purpose: to time out the initial connect to the server. And now that I think about it, 10m is a helluva long time to wait for a connect. That is conceivably the Engine bug I've been chasing -- a connect hangs for up to 10m and when the user goes to exit the app hangs due a held mutex. I can't recall if I ever waited 10m when it occurred. I doubt I did. Alas, it's too late for me to generate a fix for this today, so let's keep this ticket open for the following work to land post-0.10: * Your patch * An environment variable/cmdline arg to disable timeout(s) * Reduced connect timeout
Shortly after writing my last comment I close Geary and it hung with composers outstanding. I walked the log carefully and noticed something interesting: when Geary went to dispose of the draft, it never did the "true remove" step for Gmail, i.e. instead of simply deleting the draft, it gets moved to the Trash folder and deleted there, which ensures it's no longer present on the server. (This was introduced some time back.) This step goes through a similar code path as the other hang. There are other steps that could hang too, but it's possible this is hanging on the sessions table mutex too. In short, these could all be the same bug at heart. I wrote up a quick patch that starts to refactor how the sessions mutex is used in an attempt to avoid holding it. I also want to reduce the connection timeout. Time is limited right now, but I'm thinking about this.
I have had Geary open for several days now, plenty long enough to generate a few phantom composers. I added an account to test another bug, and the conversation viewer was filled with one of those phantoms. I closed it, and it was replaced by another, and then another. Finally, I got back to the conversation viewer. I still have a number of phantoms present though, so I'm not sure why those three were picked to be displayed. Weird.
As for Bug 731491 I haven't seen any phantom composers since the draft management rework landed, and I've just pushed some changes to master that I think finally fix the shutdown woes (commit 0125910 and commit 7ad941b). This, along with the better network connectivity handling that recently landed (Bug 789924) means that Geary is pretty reliably shutting down now when asked. As such I'm going to close this as fixed, please open a new bug if any new issues crop up.
Reopening, sigh. There's still some folders not getting closed, maybe after a suspend/resume cycle causes an error to be thrown and hence the folder never closed. Would be good to fix these issues properly, but we might also need a force close op for Folders.
*** Bug 788499 has been marked as a duplicate of this bug. ***
The fix for Bug 778276 should have improved further the situation here again. I did see the new account synchroniser account operation getting wedged on a folder, but I think commit 7d1de0d should have solved that. Will see if I can repro it again over the next few days, but am pretty confident the network-related issues here should now be taken care of.
I think the last issue I've seen is indeed a TCP connection hang - Geary will wait for the 30s command timeout to flush the command used to exit SELECTED state back to AUTHENTICATED when a folder gives back it's connection to the pool, but the connection has already been lost. We probably want to flag the connection pool to not bother doing so when the account is closing.
Issue in the previous comment was addressed in commit 07bc441 just pushed to master, and I haven't seen a shutdown hang since, although I still need to test this on my super-unreliable cable connection at home when I get back there next week.
If there's a network when geary starts but the server is not reachable, Geary still hangs on quit until force close timeout. Might need cause a session to be obtained though to achieve this.
One other possibility is if there is remote op in the replay queue that gets scheduled just after its remote session has hung, and the user chooses to quit before that times out. E.g. User comes back to Geary after being away, their flakey Internet connection has caused the session to hang, they delete a message and then quit. Maybe some operation-progress-feedback would help make this case clear?
*** Bug 796167 has been marked as a duplicate of this bug. ***
Pretty sure remainder of hangs were caused by GLib's networking code, and this has improved substantially in 2.60 so I have not seen one of these for a long time. Closing this again.