After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 748223 - transition offline to online doesn't complete
transition offline to online doesn't complete
Status: RESOLVED FIXED
Product: evolution
Classification: Applications
Component: Mailer
3.16.x (obsolete)
Other Linux
: Normal normal
: ---
Assigned To: evolution-mail-maintainers
Evolution QA team
Depends on:
Blocks:
 
 
Reported: 2015-04-21 01:20 UTC by Carl Schaefer
Modified: 2017-08-09 09:31 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
gdb stack trace output (29.56 KB, text/plain)
2015-04-21 01:20 UTC, Carl Schaefer
Details
another stack trace (27.99 KB, text/plain)
2015-04-21 13:37 UTC, Carl Schaefer
Details
stack trace w/glib2 debug (30.85 KB, text/plain)
2015-04-23 04:53 UTC, Carl Schaefer
Details
Requested backtrace (without glib2 debug package) (27.14 KB, text/plain)
2016-07-04 11:48 UTC, Rann Bar-On
Details
Updated backtrace with libcamel and libglib debugging symbols (34.50 KB, text/plain)
2016-07-04 14:27 UTC, Rann Bar-On
Details
Backtrace with two stuck connections, in case it's useful (34.93 KB, text/plain)
2016-07-04 17:22 UTC, Rann Bar-On
Details

Description Carl Schaefer 2015-04-21 01:20:51 UTC
Created attachment 302041 [details]
gdb stack trace output

[see with 3.16.1 on Arch]

when Evolution goes offline then online (for example, when my
laptop suspends and resumes) it usually can't finish going online;
there will be one server connection that is stuck in the bar at the
bottom (though not always the same server), e.g.

  Reconnecting to 'Gmail' (cancelling)

after which point fetching mail and sending mail works, but loading
images (Ctrl-I) doesn't, and online/offline toggling appears to do
nothing.  I have to close and restart to restore normal operation.  In
this state close doesn't work right away; after 60 seconds a dialog
box appears:

    Close Evolution with pending background operations?

    Evolution is taking a long time to shut down, possibly due to
    network connectivity issues. Would you like to cancel all pending
    operations and close immediately, or keep waiting?

    Close Immediately     Keep Waiting

This problem happens most times that Evolution goes offline, but seems
more likely when a server folder is selected (as opposed to one from
"On This Computer").

Milan asked for the output of 'gdb --batch --ex "t a a bt"' when running evolution and evolution-data-server built with debugging symbols, which I've attached.  Let me know if I can provide other potentially useful information.
Comment 1 Carl Schaefer 2015-04-21 13:37:29 UTC
Created attachment 302068 [details]
another stack trace

this one is the same problem, but the "server" that Evolution is stuck reconnecting to is "On This Computer".  I've only seen this happen once; every other time it's been a remote IMAP server.
Comment 2 Milan Crha 2015-04-22 09:35:46 UTC
Thanks for a bug report. It seems to me that you've got out of free threads in the GTask thread pool. The GTask thred pool is limited to 10 running threads. It means that some operations are using these GTask threads, but also require other operations to be run in yet another thread, thus one operation can require two threads from the GTask pool (sometimes even more). If there are enough accounts configured and enabled, then they can use all free GTask threads and make the followup threads starve in the GTask thread pool. There was this issue in the past, but it's always coming back with new releases of GLib.

What is your exact version of the GLib (glib2), please? Could you install debuginfo package for it and gather the backtrace again, please? It'll show whether this is the issue with the GTask thread starving or not.
Comment 3 Carl Schaefer 2015-04-23 04:53:17 UTC
Created attachment 302196 [details]
stack trace w/glib2 debug

my version of glib2 is 2.44.0-1
Comment 4 Milan Crha 2015-04-23 10:20:39 UTC
Thanks for the update. This backtrace looks slightly differently. There are running 6 imapx_parser_thread threads, then 4 threads waiting for the services to be connected and finally 3 IMAPx threads in an IDLE state (that is when the IMAPx account has set to "Listen for server change notifications"). There is no starving on the GTask thread queue, as it seemed to me from the previous backtrace.

I'll try to reproduce it there and return back to you when/if I'd find anything.
Comment 5 Milan Crha 2015-04-23 17:16:13 UTC
I tried to reproduce this by enabling and disabling my WiFi connection (the only active on the machine) and, as always, I wasn't able to reproduce this.I had enabled 8 accounts, while 3 of them require VPN, which I didn't run, and I didn't get any similar "freeze" as you face.
Comment 6 theseer 2015-04-23 20:55:53 UTC
I cannot explicitly and reliably "enforce" this to happen on my (affected) machine either.

The only pointers I can give so far is that it mostly happens for me - in case it does happen - when I resume from suspend. Simply en- and disabling WIFI or switching routes (e.g. by en- and disabling a VPN) doesn't (seem to) trigger it.

A bold (and by no means supported by any research or debugging) assumption would be that it's simply starting to early, not having a DNS and/or route working yet so threads pile up. The only support for this notion would be that it's more likely to fail/hang when the network changed, e.g. it got suspended in network A and resumes in network B.

Maybe someone else can confirm this behavioral pattern/observation?
Comment 7 theseer 2015-04-23 21:04:53 UTC
In case it's of any help: I have 12 accounts (including "On this computer") but only 4 are active (5, if you count "on this computer" along). 3 Accounts are of type ImapX, configured to use SSL on dedicated port. They all hit the same server using two different hostnames. 1 active is of type "none" to allow for sending under that name.
Comment 8 Carl Schaefer 2015-04-24 01:04:05 UTC
(In reply to theseer from comment #6)
> I cannot explicitly and reliably "enforce" this to happen on my (affected)
> machine either.

nor can I, though it seems to happen most of the time

> A bold (and by no means supported by any research or debugging) assumption
> would be that it's simply starting to early, not having a DNS and/or route
> working yet so threads pile up. The only support for this notion would be
> that it's more likely to fail/hang when the network changed, e.g. it got
> suspended in network A and resumes in network B.
> 
> Maybe someone else can confirm this behavioral pattern/observation?

I can reproduce it with no change in my laptop's network state by toggling Evolution offline then online, though the problem does seem more likely to happen after resume from suspend.
Comment 9 Milan Crha 2015-04-24 04:30:46 UTC
(In reply to theseer from comment #6)
> A bold (and by no means supported by any research or debugging) assumption
> would be that it's simply starting to early, not having a DNS and/or route
> working yet so threads pile up.

I hope it's not the case, because the 3.16.x has a timeout set on the network change to not start "network discovery" right after the change is noticed, but rather later, to give the time to fully establish the connection and work with it only after it is fully initialized. The timeout is like 5 seconds, if I recall correctly.
Comment 10 theseer 2016-01-13 15:57:17 UTC
Just a small update: I was researching this problem the other day and found a seemingly related issue where the solution was to compare the sources files on the raw level (https://mail.gnome.org/archives/evolution-list/2015-April/msg00090.html).

Comparing various accounts showed that the "problematic" one had a relatively high value for concurrent connections of 7. I reduced that to 5 since that was the value for the other accounts and so far the problem seems to have vanished.
(vanished == didn't re-occur within the last couple of days)
Comment 11 Carl Schaefer 2016-01-16 03:58:52 UTC
I don't think I've seen this problem since upgrading to 3.18
Comment 12 theseer 2016-01-16 13:22:13 UTC
I did have it happen to me even with 3.18 but so far not after adjusting settings.
Comment 13 Rann Bar-On 2016-07-04 02:38:13 UTC
This is happening for me in 3.20.3 in accounts with only 3 concurrent connections.
Comment 14 Milan Crha 2016-07-04 10:44:29 UTC
Hi, Rann, could you install debuginfo packages for the evolution-data-server, evolution, glib2, glib-networking, eventually also gnutls library and then capture a backtrace with the ever-waiting evolution, please?

You can get the backtrace with command like this:
   $ gdb --batch --ex "t a a bt" -pid=`pidof evolution` &>bt.txt
Please check the bt.txt for any private information, like passwords, email address, server addresses,... I usually search for "pass" at least (quotes for clarity only).

The evolution-data-server 3.20.4, to be released the next week, contains
this [1] commit, which can be related, though the backtrace at comment #3
doesn't contain that particular function.

[1] https://git.gnome.org/browse/evolution-data-server/commit/?id=594c548fa8
Comment 15 Rann Bar-On 2016-07-04 11:33:14 UTC
I'd be glad to. I'm running Debian, and have added the Automatic Debug Packages repos (https://wiki.debian.org/AutomaticDebugPackages). However, glib2 does not appear there, so I can't get a debug package for that for now.
Comment 16 Rann Bar-On 2016-07-04 11:34:07 UTC
(or maybe I'm entirely misunderstanding you!)
Comment 17 Rann Bar-On 2016-07-04 11:48:05 UTC
Created attachment 330843 [details]
Requested backtrace (without glib2 debug package)
Comment 18 Milan Crha 2016-07-04 14:14:40 UTC
Thanks for the update. It can be that your distribution packages the glib library differently, possibly libglib (as this /lib/x86_64-linux-gnu/libglib-2.0.so.0 can be seen in the backtrace).

Your backtrace is partly useful, partly weird. But again, it can be that your distribution packages things in that way (I know Fedora, which is different from this). I see that libcamel symbols are missing, even though they are part of the evolution-data-server, the same as IMAPx code, which shows line numbers.

Anyway, the backtrace shows two threads disconnecting, one connecting and three threads reading (or waiting for) data from a server. I'd guess you face the issue fixed by commit [1], but that is part of the evolution-data-server 3.20.2 and later, which you should have installed, if you use 3.20.3 (unless you've 3.20.3 evolution, but older evolution-data-server).

[1] https://git.gnome.org/browse/evolution-data-server/commit/?h=gnome-3-20&id=b59863d88
Comment 19 Rann Bar-On 2016-07-04 14:26:21 UTC
Thank you!

I have 3.20.3 installed.

$ dpkg -s evolution-data-server
Package: evolution-data-server
Status: install ok installed
Priority: optional
Section: gnome
Installed-Size: 1906
Maintainer: Debian Evolution Maintainers <pkg-evolution-maintainers@lists.alioth.debian.org>
Architecture: amd64
Version: 3.20.3-1

I've installed the debugging symbols for libcamel and libglib. I'll attach the new backtrace in a second.
Comment 20 Rann Bar-On 2016-07-04 14:27:19 UTC
Created attachment 330847 [details]
Updated backtrace with libcamel and libglib debugging symbols
Comment 21 Rann Bar-On 2016-07-04 17:22:50 UTC
Created attachment 330860 [details]
Backtrace with two stuck connections, in case it's useful
Comment 22 Milan Crha 2016-07-07 15:11:41 UTC
Thanks for the update. I see that the Thread 19 is waiting for a lock which it already holds. That causes the deadlock. I fixed it for the next release. I'm not closing this bug yet, because I'm not sure whether it fixes the initial report.

Created commit_8bbffdb in eds master (3.21.4+) [1]
Created commit_e6dca36 in eds gnome-3-20 (3.20.4+)

[1] https://git.gnome.org/browse/evolution-data-server/commit/?id=8bbffdb
Comment 23 André Klapper 2017-08-08 20:48:31 UTC
(In reply to Milan Crha from comment #22)
> Thanks for the update. I see that the Thread 19 is waiting for a lock which
> it already holds. That causes the deadlock. I fixed it for the next release.
> I'm not closing this bug yet, because I'm not sure whether it fixes the
> initial report.

Carl, Rann, theseer: Does the problem still happen in 3.22 or later?
Comment 24 Carl Schaefer 2017-08-08 21:59:18 UTC
I can't reproduce this problem with 3.22, and haven't seen it happen in a long time
Comment 25 theseer 2017-08-09 08:55:13 UTC
As I cannot really enforce this to happen, it's hard to say whether or not the fix did the trick. I can confirm though, that I at least don't recall any hangs in this regard for quite a while.

Currently using 3.24.4 (3.24.4-1.fc26) - on Fedora 26 obviously.

I'll report in case it happens again.
Comment 26 André Klapper 2017-08-09 09:31:47 UTC
Thanks for the quick feedback everybody!
Based on the last two comments and comment 22 I declare victory on this ticket for the time being. If this happens again, please file a new ticket.