Bug 591739 – big libsoup i/o rewrite

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 591739 - big libsoup i/o rewrite


Summary:	big libsoup i/o rewrite


Status:	RESOLVED FIXED

Product:	libsoup
Classification:	Core
Component:	HTTP Transport
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsoup-maint@gnome.bugs
QA Contact:	libsoup-maint@gnome.bugs

URL:
Whiteboard:

Depends on:
Blocks:	522772 523100

Reported:	2009-08-13 19:47 UTC by Dan Winship
Modified:	2012-09-18 17:38 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch with a possible (and probably quite broken) approach (20.60 KB, patch) 2012-02-04 15:33 UTC, Gustavo Noronha (kov)	none	Details \| Review

Description Dan Winship 2009-08-13 19:47:56 UTC

We have a new class. Let's call it "SoupConnection" for now, although the word "Connection" might change, and it has really nothing at all to do with the class that's currently called "SoupConnection". Anyway, it is an abstract class with two implementations: SoupNetworkConnection (which talks to HTTP servers), and SoupCacheConnection (which talks to local disk). SoupCacheConnection is a subclass of SoupNetworkConnection and can just fall back to the network-y methods when the cache misses.

SoupSession takes a SoupMessage, acquires an appropriate SoupConnection, and gives it a SoupMessage. If the request has a body, the session gets the SoupConnection's GOutputStream and writes the body to it. The SoupConnection either sends or looks up the request (depending on if it's a network or cache connection), and fills in the message's status and headers. SoupSession then reads from the GInputStream and fills in the message's SoupMessageBody. The exact mechanics of all this may change.

The input and output streams potentially pass through several layers of filters. Eg, the input stream stack looks something like:

    - GSocketInputStream / GFileInputStream
        (data stream consists of unprocessed server data)
    - optional lame-server-bugfixing filter
        (data stream is now assumed to be a syntactically valid HTTP-message)
        (HTTP parsing occurs)
        (filters past this point only see the message-body, not headers)
    - Transfer-Encoding filters (automatically added if needed)
        (chunked decoding occurs)
        (filters past this point see the entity-body)
    - Content-Encoding filters (automatically added if needed)
        (filters past this point see the actual payload)
    - optional misc filters (eg, SoupContentSniffer)

SoupContentSniffer would be redone as a filter; rather than needing all kinds of special hooks in the I/O path, it would just buffer its input until it had enough to sniff, and then after sniffing it would just pass everything through transparently

"optional lame-server-bugfixing filter" is for stuff like LF->CRLF translation, etc; this lets people who need to deal with broken (mostly embedded) HTTP implementations work around them themselves rather than needing libsoup changes (eg, bug 588658)

Despite what the diagram above claims, probably SoupCacheConnection actually skips the first few steps of the filter stack, since cache entries would be pre-bugfixed and transfer-decoded.


There will be new APIs that let you access the input/output streams directly rather than mucking with SoupMessageBody. In fact, these are preferred to "mucking with". SoupMessageBody will still be there for the simple deal-with-it-all-at-once case, but if you're  doing streaming, you should use the streams rather than trying to deal with the signals and SoupBuffers and stuff.

Comment 1 Benjamin Otte (Company) 2009-08-13 20:16:57 UTC

I've recently been thinking about SoupMessage. Because originally a SoupMessage was just a http request + response, and it grew with cache and redirects and whatnot and gained the ability to restart and all this.

It would make sense IMO to have a - most likely not exported - soup object that represents an actual message matching what the http rfc calls a message, so that a request is one message and a response is one message. That way you could make a response/request cycle iterate through many messages easily without having to emit weird signals. It's not well thought out, but here's an example:

a http "request" would look like this:
1) construct a SoupRequestMessage (maybe a SoupGetMessage or a SoupPostMessage or whatever)
2) set headers on the message
3) send the message
4) parse enough of the reply (probably headers are needed, maybe status code is enough)
5) construct a SoupResponseMessage subtype based on status code (maybe on headers, too?):
   - 200 gets you a normal SoupResponseMessage
   - 304 gets you a SoupCacheMessage
   - 301/302 gets you a SoupRestartMessage
   - 401 gets you a SoupAuthorizationMessage
   - ...
6) if the message queues a reload, goto 4)
7) grab the output stream of the message and do the things that you listed above

This would make the internal API quite a bit saner, as code would be separated into separate chunks and not be scattered everywhere.

Comment 2 Benjamin Otte (Company) 2009-08-13 20:18:27 UTC

On a completely unrelated note:

I wish SoupSession would do away with the SoupSessionSync vs SoupSessionAsync stuff and behave more like gio.

Comment 3 Dan Winship 2009-08-13 21:07:11 UTC

(In reply to comment #2)
> I wish SoupSession would do away with the SoupSessionSync vs SoupSessionAsync
> stuff and behave more like gio.

More like which parts of gio?

Note that both subclasses implement both send_message (sync) and queue_message (async), so part of the problem is just that the names are wrong. "SoupSessionSync" should be "SoupSessionThreaded" (because it assumes you'll have multiple threads each taking full responsibility for a single message at a time) and "SoupSessionAsync" should be "SoupSessionMainloopBased" (because it's built around a single main loop that processes the entire queue every time it runs). Putting them into the same class would be a mess.

(Although SoupSessionSync really needs to pay a bit more attention to SoupMessageQueue, like SoupSessionAsync does... right now, once you exceed max-connections/max-connections-per-host, messages get processed in essentially random order.)

Comment 4 Gustavo Noronha (kov) 2012-02-04 14:54:50 UTC

I think I found the problem with reads that should block returning G_IO_ERROR_WOULD_BLOCK, thought this would be the best bug to describe it in =). The problem is I have a stream pipeline like this (I'm leaving out Soup and InputStream):

Body -> Multipart -> Filter(blocking) -> Body -> Filter(non-blocking)

When the deepest Filter's read_fn is hit, it does a g_pollable_input_stream_read_nonblocking(), and that will refuse to block, and return the error. I'm not sure what the best way to fix this is, maybe we should refactor Filter to have different API for blocking/nonblocking instead of having it as a property? I'll attach a patch.

Comment 5 Gustavo Noronha (kov) 2012-02-04 15:33:05 UTC

Created attachment 206766 [details] [review]
patch with a possible (and probably quite broken) approach

This has made my "mjpeg viewer" substantially more robust (it works as well as chromium heh), but breaks a whole lot of tests - I guess there are expectations of implicit non-blocking reads when doing g_input_stream_read in io->body_istream. I noticed that SoupBodyInputStream has a blocking member in its private structure that is currently unused, maybe that's the missing piece of the puzzle?

Comment 6 Dan Winship 2012-02-04 18:37:36 UTC

I was just debugging this too... the problem is that the code assumes that read() will only be called on blocking=TRUE streams, and read_async() will only be called on blocking=FALSE streams. But in some cases, we end up hitting the do-an-async-read-by-doing-a-sync-read-in-a-thread codepath on a blocking=FALSE stream, and then it fails.

Comment 7 Dan Winship 2012-02-04 18:45:40 UTC

I think I have to give up on the "cheating" non-blocking thing and just use the pollable/async stuff properly. It should be easier now given the current organization of soup-message-io...

Comment 8 Dan Winship 2012-04-12 18:21:34 UTC

So, this is basically ready. glib and libsoup have to branch for 3.6, then bug 673997 needs to land in glib, and http://git.mysterion.org/libsoup/log/?h=giobased4 can land in libsoup. The WebKit layout tests all pass against it (except for http/tests/misc/redirect-to-about-blank.html, which is actually already broken [try it in GtkLauncher], it's just that the test didn't notice before, and now it does).

Comment 9 Dan Winship 2012-04-18 14:59:36 UTC

giobased4 is now merged to master. renaming this bug to cover the remaining parts

Comment 10 Dan Winship 2012-09-18 17:38:31 UTC

the filtering stuff is happening in bug 682112, so I'm unrenaming this bug and marking it fixed!