Bug 708006 – libsoup fails to close connection when message cancelled

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 708006 - libsoup fails to close connection when message cancelled


Summary:	libsoup fails to close connection when message cancelled


Status:	RESOLVED FIXED

Product:	libsoup
Classification:	Core
Component:	HTTP Transport
Version:	2.42.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsoup-maint@gnome.bugs
QA Contact:	libsoup-maint@gnome.bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-09-13 07:27 UTC by Milan Crha
Modified:	2013-09-13 14:08 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
debug output (267.96 KB, text/plain) 2013-09-13 10:56 UTC, David Woodhouse		Details
soup-connection: fix the "reusable" flag after a message restart (1.80 KB, patch) 2013-09-13 13:07 UTC, Dan Winship	none	Details \| Review

Description Milan Crha 2013-09-13 07:27:55 UTC

Moving this from a downstream bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1007321

Description of problem:
Evolution show another message

Demonstration video: https://docs.google.com/file/d/0B0nwzlfiB4aQMkF5S1lDMjJTMWs/edit?usp=sharing

------------------------------------------------------------------------------

In turned out that the internal cache has stored a wrong message for the selected in the video. Deleting the message from a cache and let evo-ews download it again shows correct message.

Even it is not fully reproducible, it would be good to investigate it anyway. Possible causes, which comes on mind are thing like lost connection in the middle of message download or similar, but I have no clue at all.

Mikhail, I know you were facing some level of instability of the evolution with evo-ews in the recent past. Is it still so unstable for you, or it helped with disabled/removed plugins from a web browser share folder? (I think it was the main cause). The crash as such might not cause this cache file replacement, simply because the process doesn't have time or any chance to save the message. Another reason might be a collision in the cache hash function, which would be quite interesting. I asked in the downstream bug report about checking both messages, the correct and the wrong, whether download of one will replace body of the other. Did you manage to find both messages from the video?

Comment 1 David Woodhouse 2013-09-13 09:53:13 UTC

I've seen this, but I've not previously managed to reproduce it. I think it might happen when downloading the message is slow, so I navigate *away* from a message while the preview pane still says it's downloading. 

Hm yes, big set of new messages in my inbox this morning. I select the top one and it starts loading. Hit the down cursor a few times, watching it *start* loading the message each time.

Dammit, should have had coffee first. I think I then got to the bottom one and then went back *up* again, paused... and now I'm looking at the penultimate message in the index, but the *body* I see is that of the final message.

FWIW I have also seen this behaviour with the ActiveSync back end, probably in similar circumstances.

Comment 2 David Woodhouse 2013-09-13 10:07:35 UTC

Right now I don't seem to be able to reproduce with ActiveSync. The EAS back end still continues to fetch the message and store it in cache after we navigate away, while EWS seems to cancel the fetch so when I come back to the message later, it's still not fetched.

Comment 3 David Woodhouse 2013-09-13 10:34:50 UTC

I've added some debugging in camel_ews_folder_get_message() and this problem doesn't appear to be on the camel side. We make a request for the message we *wanted*, we get back the response with the MimeContent replaced by a filename, and the contents of the file (printed right there) are the *wrong* message.

Next I'll take a closer look at the 'content stealing' in e-soap-message.c

Comment 4 David Woodhouse 2013-09-13 10:56:52 UTC

Created attachment 254847 [details]
debug output

It's not the content stealing either. It looks like libsoup might be giving us back the *wrong* response to our query. Perhaps there's some confusion in the way we cancel requests?

In this example, 'Message 5' (ItemId ending v80nEAAA=) is the one that's corrupted in my cache.

Instead of Message 5, I see Message 9 (ItemId ending v80nDAAA=).

Look for ESoapMessage 38 and ESoapMessage 39 in the attached log.

ESoapMessage 38 is a fetch for ItemId v80nDAAA= (mail 'Message 9') which never gets a response. Did we cancel it? Does it matter that we cancelled it *while* libsoup was doing the retry-with-NTLM-Authorization-header dance? 

ESoapMessage 39 is a fetch for ItemId v80nEAAA= (mail 'Message 5'). Take a close look at the ItemId which is specified in the response that libsoup hands us. It's v80nDAAA= — the ItemId of the message requested by the *earlier*, presumably aborted, request.

And thus we end up with 'Message 9' in our cache where 'Message 5' should be.

This looks like it might be an issue either with libsoup, or with the way we are cancelling requests. Dan?

Comment 5 David Woodhouse 2013-09-13 11:24:39 UTC

There is a potential race condition in ews_cancel_request() — we check if the job is on the active_jobs_list, then *drop* the queue lock, then get it again a little while later and act on the result of our check. The message might have been queued by then though, surely?

But that isn't the cause of what I'm chasing. This looks like a libsoup issue, I think. If we cancel a message while it's doing the NTLM authorisation resend of that message, it gets a *response* to the cancelled message, and attributed it to the *next* request we submit (the *actual* response to that next request never seems to arrive).

It goes like this...

ews_next_request queues message 0x2845190
> POST /EWS/Exchange.asmx HTTP/1.1
> Soup-Debug: SoupSessionAsync 1 (0x7fdfc83f6290), ESoapMessage 36 (0x2845190), SoupSocket 3 (0x11d0f00)
> Authorization: NTLM <type 1>
>  ... <Get item v80nZAAA=>

< HTTP/1.1 401 Unauthorized
< Soup-Debug: ESoapMessage 36 (0x2845190)
< WWW-Authenticate: NTLM <type 2>

> POST /EWS/Exchange.asmx HTTP/1.1
> Soup-Debug: SoupSessionAsync 1 (0x7fdfc83f6290), ESoapMessage 36 (0x2845190), SoupSocket 3 (0x11d0f00), restarted
> Authorization: NTLM <type 3>
>  ... <Get item v80nZAAA=>

ews_cancel_request cancels message 0x2845190 (found 1)
ews_connection_schedule_cancel_message schedules cancellation of message 0x2845190
ews_connection_scheduled_cb cancels SouplMessage 0x2845190
ews_next_request queues message 0x7fdfc40059b0

> POST /EWS/Exchange.asmx HTTP/1.1
> Soup-Debug: SoupSessionAsync 1 (0x7fdfc83f6290), ESoapMessage 37 (0x7fdfc40059b0), SoupSocket 3 (0x11d0f00)
 ... <Get item v80naAAA=>

< HTTP/1.1 200 OK
< Soup-Debug: ESoapMessage 37 (0x7fdfc40059b0)
 ... <GetItemResponse v80nZAAA=>


And again, we have the *wrong* response.

Comment 6 David Woodhouse 2013-09-13 11:29:36 UTC

I need to get the timing right to reproduce this. I do seem to need to cancel the message *while* it's in the third phase of the NTLM authorisation, resending the message which is being cancelled. And my latest debug run does seem to have observed the *next* request getting an out-of-sync response too; it gets the result that the above ESoapMessage 36 *should* have got.

Comment 7 David Woodhouse 2013-09-13 12:06:35 UTC

Expanding on the debugging a little (different run obviously so pointers are different but the correlation should be obvious):

ews_cancel_request cancels message 0x7f7cdc005930 (found 1)
ews_connection_schedule_cancel_message schedules cancellation of message 0x7f7cdc005930
ews_connection_scheduled_cb cancels SouplMessage 0x7f7cdc005930
soup_session_cancel_message msg 0x7f7cdc005930 found 0x7f7cf4010900
Not finished; call cancel_message
soup_session_async_cancel_message msg 0x7f7cdc005930
soup_session_real_cancel_message msg 0x7f7cdc005930
IO for message 0x7f7cdc005930 is in progress. Shouldn't we close the connection?
ews_next_request queues message 0x1ecd690



So we call into soup_session_cancel_message, which sees that the message is not in state SOUP_MESSAGE_FINISHED, and calls soup_session_async_cancel_message() which calls soup_session_real_cancel_message().

In soup_session_async_cancel_message() we *see* that I/O is in progress on this message. That's my "shouldn't we close?" debug message.

The documentation for soup_session_cancel_message() suggests that "if
@session has started sending the request but has not yet received
the complete response, then it will close the request's connection".

But nothing closes the connection. The next requst is using the same SoupSocket.

Comment 8 David Woodhouse 2013-09-13 13:03:24 UTC

The SoupConnection class maintains a record of whether a connection is reusable (priv->reusable). When we submit a request, this is set to FALSE in the set_current_msg() function.

When the response is received, it's set back to TRUE.

All of which makes sense.

However, if we call soup_connection_send_request() to *restart* a message which was previously sent, it doesn't call set_current_msg(), and priv->reusable doesn't get cleared. So when the message is subsequently cancelled, the connection isn't aborted. It's reused, with results as described above.

This would fix it, but Dan may want to move this *into* the set_current_msg() function and call that function unconditionally from soup_connection_send_request()? That might be nicer.

--- a/libsoup/soup-connection.c
+++ b/libsoup/soup-connection.c
@@ -971,5 +975,8 @@ soup_connection_send_request ()
 
        if (item->msg != priv->current_msg)
                set_current_msg (conn, item->msg);
+       else
+               priv->reusable = FALSE;
+
        soup_message_send_request (item, completion_cb, user_data);
 }

Comment 9 Dan Winship 2013-09-13 13:07:10 UTC

Created attachment 254858 [details] [review]
soup-connection: fix the "reusable" flag after a message restart

When a message got restarted, the "reusable" flag on the connection
wasn't getting cleared, so the connection would think it was still
reusable even if the second request got cancelled part way through.

=====

Try this? I was trying to create a test case to reproduce it, but I
haven't managed to yet...

Comment 10 David Woodhouse 2013-09-13 13:23:32 UTC

Yes, that also seems to do the trick. I can see the same sequence of events (submit message, 401, resubmit, cancel, submit new message) and we *are* correctly tearing down the connection when the pending message is cancelled.

Comment 11 David Woodhouse 2013-09-13 14:08:00 UTC

We both fixed it simultaneously. After IRC consultation, the simpler patch from comment 8 has been committed to master (b63d0847) and gnome-3-8 (33c015ed)