GNOME Bugzilla – Bug 698912
Deadlocks in get_connection() in soup-session.c
Last modified: 2013-07-12 22:14:00 UTC
I wrote a gravatar.com module for Evolution that uses the SoupRequest API. https://git.gnome.org/browse/evolution/diff/modules/gravatar/e-gravatar-photo-source.c?id=91aee2a805d70c9f7a5f0d089139fbfc710939c1 It's pretty simple; just calls soup_request_send() with a SoupSessionSync. Works fine with libsoup 2.40, but with libsoup 2.42 I'm getting very frequent deadlocks that all look like this:
+ Trace 231856
Thread 111 (Thread 0x92cfbb70 (LWP 1053))
It's not freezing Evolution, but these deadlocked threads keep piling up.
For the time being I worked around the deadlocks by creating a new SoupSession instance for each SoupRequest sent. Not very efficient, but seems to work.
(In reply to comment #1) > For the time being I worked around the deadlocks by creating a new SoupSession > instance for each SoupRequest sent. Not very efficient, but seems to work. Do you use soup_session_abort()? If so, do the deadlocks only occur after doing an abort?
I was not calling soup_session_abort() myself, but perhaps it's used internally within libsoup to implementation cancellation of soup_request_send()? Before the work around, the context was simply a central SoupSessionSync with a SoupRequester feature that I was using to create and submit SoupRequests from various worker threads. g_cancellable_cancel() is called on unfinished requests when the user moves on to another email message, so cancellation is indeed frequent. I did not notice whether cancellation triggered the deadlocks, but it seems plausible.
Tried reproducing this and failed... Is it possible you are leaking streams somewhere? (I tried tracing through the logic of e-photo-cache, but it's pretty hairy, and then you have to consider all of its callers too...) Anyway, when using the streaming APIs, it doesn't actually read the response body from the server until you read from the returned GInputStream. So if you leak the stream without reading from it or closing it first, then the message will remain in-progress forever, and so will never yield its SoupConnection back to the pool, and then after this happens enough times, you'll have hit the max-connections limit, and so all further requests will end up blocked in get_connection() waiting for the situation to change, which it never will. This is a pretty bad failure mode, so libsoup ought to be a little more clever here...
I've worked around the issue (poorly) in the current 3.9 code, but was able to reproduce the deadlock quite easily in the commit cited in comment #0. Is that the revision you were testing? I suppose it's possible the stream is being leaked. e-photo-cache is invoked to handle SoupRequests with a custom "mail://contact-photo/..." URI. The resulting stream is handed off to WebKit via SoupRequest.send_finish() and after that I don't know what happens to it. You can find this higher-level logic in evolution/mail/e-mail-request.c, which should be a little easier to trace through.
(In reply to comment #5) > I've worked around the issue (poorly) in the current 3.9 code, but was able to > reproduce the deadlock quite easily in the commit cited in comment #0. Is that > the revision you were testing? No, I just wrote a program that made lots of requests simultaneously in lots of threads, and cancelled some, etc, but it always completed fine. I just tried jhbuilding evo, but I'm not sure it's actually working. I can't find any place to turn the gravatar support on, and I can't get it to successfully connect to my IMAP server anyway. And trying to install the rawhide packages wants to pull in the entire universe... > You can find this higher-level logic in evolution/mail/e-mail-request.c, which > should be a little easier to trace through. Yeah, but that doesn't help if the leak is happening in the lower-level parts.
Okay thanks. I'll proceed then on the assumption that Evolution is leaking streams somewhere and see if I can track it down before I pester you further.
Starting to see the same SoupSession hang in other contexts now. For example, bug 702747. Maybe we're leaking streams there too.
Another case, reported downstream: https://bugzilla.redhat.com/show_bug.cgi?id=976529
(In reply to comment #9) > Another case, reported downstream: > https://bugzilla.redhat.com/show_bug.cgi?id=976529 In gdata, which doesn't use the streams API, so maybe we're leaking streams internally...
tried and failed to reproduce this again. How do I reproduce the gravatar case? What do I have to do to cause evolution to try to load gravatars? (I've updated to F19 now, so I should be able to rebuild it with the broken version of the code.)
Revert commit 1b2104e in Evolution and in the Headers tab of Mail Preferences, enable: [x] Show the photograph of sender in the message preview (There's no switch for Gravatar photos specifically.) When last I tried this with libsoup 2.42, the deadlocks were frequent. I'm still trying to get around to checking our streams for reference leaks. Sorry for the delay.
(In reply to comment #12) >I'm still trying to get around to checking our streams for reference leaks. Easy way to do this would be to add printfs to soup_body_input_stream_init() and _finalize(). But as mentioned above, the rhbz bug implies the problem is internal to libsoup.
probably a dup of bug 703463, which is now fixed in git. If you could verify that, that would be great
After reverting commit 1b2104e in Evolution and building libsoup from git (version 2.43.2, revision 2081c31), I can still reproduce the deadlocked get_connection() calls almost immediately. Eventually we starve the thread pool, the message pane stops updating and the status bar fills with "Retrieving message" tasks that never finish and don't respond to cancellation. I added some debug output to our send_finish() method in e-mail-request.c and in all cases the reference count on the returned GMemoryInputStream was 2: one reference being transferred to the caller and another held by our internal GSimpleAsyncResult. Also verified the returned streams *are* being finalized. So bug is still present and doesn't appear to be a stream reference leak. Happy to try any other debugging steps you can think of.
(In reply to comment #15) > After reverting commit 1b2104e in Evolution and building libsoup from git > (version 2.43.2, revision 2081c31), I can still reproduce the deadlocked > get_connection() calls almost immediately. Just tried this: with "old" libsoup, I can reproduce the hangs, and gdb shows they're hanging in exactly the same way as bug 703463. ("p soup_connection_get_state (priv->connections->data)" in a stack frame with "SoupSessionPrivate *priv" returns SOUP_CONNECTION_REMOTE_DISCONNECTED). With the patched libsoup, I get no hangs. So I'm thinking you must have been in some sort of jhbuild dll hell where it wasn't pulling in the libsoup.so you thought it was? I'm building rawhide and f19 packages now. I assume you'll get mail from koji when they're done. marking DUPLICATE, but reopen if it's really still not fixed for you. *** This bug has been marked as a duplicate of bug 703463 ***
I don't use jhbuild, but it's entirely possible I was pulling in the wrong libsoup.so. I'll retest with that in mind. Thanks for looking into this.