After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 331214 - g_io_channel async socket io stalls
g_io_channel async socket io stalls
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: win32
2.8.x
Other Windows
: Normal normal
: ---
Assigned To: gtk-win32 maintainers
gtk-win32 maintainers
: 337991 (view as bug list)
Depends on:
Blocks: 331224
 
 
Reported: 2006-02-14 23:15 UTC by Kenneth Haley
Modified: 2018-05-24 10:39 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test-case (1.94 KB, application/x-bzip2)
2006-02-14 23:19 UTC, Kenneth Haley
  Details
proposed patch (1.12 KB, patch)
2006-02-24 10:30 UTC, Kenneth Haley
none Details | Review
possible loss of FD_CLOSE (1.34 KB, patch)
2006-02-24 10:51 UTC, Kenneth Haley
none Details | Review
test program, my version. builds on Linux, too. (6.03 KB, text/plain)
2006-03-31 22:36 UTC, Tor Lillqvist
  Details
diff to giowin32.c: essentially what the bug reporter has suggested (4.04 KB, patch)
2006-03-31 22:38 UTC, Tor Lillqvist
none Details | Review
test program, my version, with modification to read as much as possible each time in the callback (6.07 KB, text/plain)
2006-03-31 23:03 UTC, Tor Lillqvist
  Details
Tarball with two tests (6.96 KB, application/x-gzip)
2006-04-11 13:02 UTC, Martyn Russell
  Details

Description Kenneth Haley 2006-02-14 23:15:45 UTC
Program stops getting watch callbacks before all data is received.
Comment 1 Kenneth Haley 2006-02-14 23:19:09 UTC
Created attachment 59382 [details]
test-case

test program attempts to retreive a list of news groups from a news server.  The program has a 10 second timer which will shut down the program.
Comment 2 Kenneth Haley 2006-02-15 00:34:47 UTC
Problem solved.  g_io_channel_read_line_string needs to be called until it returns G_IO_STATUS_AGAIN.  While this currently appears to be a windows only issue most likely caused by level-triggered versus edge-triggered events, it may appear on other platforms.

The GIOChannel docs need to be updated to tell programmes using buffered async io to read all data from the buffer before returning from the watch routine.
Comment 3 John Ehresman 2006-02-15 00:41:10 UTC
Hmm, is the problem the docs or the win32 impl?  What happens on Unix?  I suspect the watch function will be repeatedly called.
Comment 4 Tor Lillqvist 2006-02-15 07:32:49 UTC
It's hard to say. My gut feeling is that the same problem could occur on Unix too, but probably differences in socket buffer sizes and whatnot might make it harder to reproduce against the same servers (as in the test program).

Or actually, now that I have slept on it, I am not so sure. It might well be that this can't happen on Unix, and really is something that should be fixed in the Win32 implementation. Argh...
Comment 5 Kenneth Haley 2006-02-15 09:35:05 UTC
OK, lets pose it like this.  The server sends one packet containing multiple lines.  The program reads one line each time the watch function is called.  What happens on unix?  On windows it appears that the watch is only called once even though there is still data to be read.  Or at least this seems to be the way things work.  Of course my understanding of the interaction between g_io_channel and sockets could be wrong.
Comment 6 John Ehresman 2006-02-15 15:14:58 UTC
At the level of a select call, the socket will be reported has having data available if a read call would return data.  There is no concept of having been reported as readable previously.  I don't recall glib adding logic to to only report data available once, but I haven't looked at the code in a while.
Comment 7 Kenneth Haley 2006-02-16 11:35:42 UTC
You're right, WSAEventSelect will re-fire the event if only some of the data was read.  This makes me think the problem is with the buffering since read_line calls are being used.  Apparently glib is not indicating that data is available in its buffer.  But a bug like that would appear on all platforms wouldn't it?

Hmm, looking at the output from the test program I see the channel's line buffer being filled, a few network read events being received, and the program stalling once the buffer is empty.  I also notice that both revent & events are 0 during the last check.  This is the only time this occurs.  Of course this doesn't explain why there is no READ from the socket.
Comment 8 Kenneth Haley 2006-02-16 22:55:46 UTC
Perhaps the problem is in g_io_win32_prepare.  Specifically this part

if (WSAEventSelect (channel->fd, (HANDLE) watch->pollfd.fd,
			      event_mask) == SOCKET_ERROR)
	    ;			/* What? */
channel->event_mask = event_mask;

If the call fails, it will never be retried since channel->event_mask == event_mask.  This would prevent receiving network events.  Perhaps the assignment should only happen when the call is succesfull & output a debug msg when it fails.

if (WSAEventSelect (channel->fd, (HANDLE) watch->pollfd.fd,
			      event_mask) != SOCKET_ERROR)
	  channel->event_mask = event_mask;
else if (channel->debug)
	    g_print ("g_io_win32_prepare: WSAEventSelect(%d, %#x, {%s}) failed\n",
		     channel->fd, watch->pollfd.fd,
		     event_mask_to_string (event_mask));
Comment 9 Kenneth Haley 2006-02-22 06:42:45 UTC
After working through the states I think there is a problem in g_io_win32_check.  The event is being reset even if there is still data in the channels buffer.  Lets see, check is called with revents!=0 & there is no network activity but there is still data in the buffer, so the event should remain set since we don't know when it will be emptied.  I'm thinking an addtional condition like !(buffer_condition & FD_READ) should be added so the event is only reset when both the network & channels buffers are empty.

Scenario:
packet arrives with three lines.
check revents!=0 net!=0
dispatch reads all into buffer & takes one line
check revents!=0 net==0
reset event
dispatch extracts one line
event not signaled so no more callbacks

This would explain why the test-case freezes.
Comment 10 Tor Lillqvist 2006-02-22 08:00:55 UTC
Thanks for spending time on this. I hope you don't find the code too disgusting... If you have good ideas how the code could perhaps be restructured, please submit. Or if you have written up any documentation for yourself as you worked through how it works, it would be nice to add that to the source.

I have a distinct memory of discussing this more than my single comment above, but can't find anything in my mail archive, did we talk about this on IRC perhaps?

I did some experimentation with various hacks to giowin32.c to make it work better with the test program even without it trying to read as many lines at a time as possible, but didn't reach any fully satisfying solution. I'll get back and try what you say next week, as I am going for a short trip early tomorrow.

Comment 11 Kenneth Haley 2006-02-23 03:45:17 UTC
We did talk a little on irc before I filed the bug.  As for documentation there were three things I didn't understand till I searched the code.  I didn't realize that check is called when any polled event is triggered, not just the socket event.  I  finally understood that revents!=0 only when check is called because of the socket event.

The third issue is one I still don'r understand.  How can revent!=condition?  Unless I missed something in the code, it goes like this:
1. create watch sets events==condition
2. g_poll sets revents==events
3. check is called.  debug output revents=IN condition=IN|HUP|ERR

From what I understand of the code revents should be the same as condition when check is called.  Either I missed something, which is likely, or there is another bug hidding somewhere.
Comment 12 Kenneth Haley 2006-02-24 10:30:38 UTC
Created attachment 60050 [details] [review]
proposed patch

I verified that the problem was that no events were being triggered by adding a 10ms timeout to the test-case.  This allowed the program to continue reading data.  Since there are only two places that could cause this I've created a patch to giowin32.c based upon my previous suggestions.  Since I'm not set up to compile glib I haven't tested this patch.
Comment 13 Kenneth Haley 2006-02-24 10:51:40 UTC
Created attachment 60053 [details] [review]
possible loss of FD_CLOSE

This patch is for a related problem, it's intended to be applied after the first patch.  Since MSDN is not clear on the sematics of FD_CLOSE, I'm assuming it's not level triggered like FD_READ.  Therefore it's possible the the event could be lost while the app reads data from the channel buffer.  This patch should fix the problem.
Comment 14 Kenneth Haley 2006-03-06 21:53:50 UTC
I just noticed a small bug that prevents g_io_win32_prepare from returning true in most cases.  The last few lines should be :

gboolean ret=(watch->condition & buffer_condition)!=0;
ret |= channel->has_been_closed && (watch->condition & G_IO_HUP);
return ret;

This way prepare will return true if the buffer meets any of the watch conditions.  The origianl version required it to match all of the conditions which is not possible since the buffer only has G_IO_IN & G_IO_OUT.
Comment 15 Kenneth Haley 2006-03-28 06:50:08 UTC
ping.  Just a reminder to try and get this into cvs before 2.12 ships.
Comment 16 Tor Lillqvist 2006-03-31 22:34:59 UTC
BTW, I modified your test program slightly so that I could build it on Linux, too, to see how GLib is supposed to behave properly. Linux is after all the "reference" platform for GLib. I took out the use of a timeout function, as that is not how a real application would behave. (Instead I look for the "." line that finishes the list of newsgroups, and call g_main_loop_quit() then.) I added a call to set the socket to non-blocking mode, as that is what implicitly happens on Win32.

Somewhat surprisingly, it didn't work properly on Linux either. io_func() stopped being called when there was some dozen or so lines left to read. So the logic used in the test program is flawed also on Linux... (But after adding the patches you suggest to giowin32 the program indeed works on Win32...)

So, sure, I can commit the patch, but on the other hand, a program with logic like the sample program is flawed as it doesn't work properly on the "reference platform" ;)

I'll attach my version of the test program, and my current diff to giowin32.c (which is mostly what you suggested, just some non-semantic differences).
Comment 17 Tor Lillqvist 2006-03-31 22:36:58 UTC
Created attachment 62489 [details]
test program, my version. builds on Linux, too.
Comment 18 Tor Lillqvist 2006-03-31 22:38:27 UTC
Created attachment 62490 [details] [review]
diff to giowin32.c: essentially what the bug reporter has suggested
Comment 19 Tor Lillqvist 2006-03-31 22:55:59 UTC
However, if I modify the test program so that when io_func() is called to read, it tries to read as much as possible, until it gets G_IO_STATUS_AGAIN, then the test program works both on Linux and against an unmodified GLib on Win32...
Comment 20 Tor Lillqvist 2006-03-31 23:01:51 UTC
I forgot to mention that if I run the modified test program against GLib with giowin32.c patched as suggested, I get a nasty busy-wait effect: the io_func() is called repeatedly lots of times only to immediately get a G_IO_STATUS_AGAIN when it tries to read. That is not acceptable. And recall that the Linux test would indicate that the modifed test program (which reads as much as possible) is the correct way to do it? 

Thus I think that I'll let giowin32.c be as is... I could add the stickyness of FD_CLOSE.
Comment 21 Tor Lillqvist 2006-03-31 23:03:00 UTC
Created attachment 62492 [details]
test program, my version, with modification to read as much as possible each time in the callback
Comment 22 Kenneth Haley 2006-03-31 23:52:35 UTC
First, the timeout was just to make sure the program shut down on its own instead of my having to kill it.  Second, if you can't get all the data by reading one line at a time in non-blovking mode on linux then it's broken as well.  I suspect that it has the same basic problem as the windows version, ie. prepare is not indicating that there is data ready to be read.  Perhaps I left it out of the patch?  Basically prepare is comparing the buffer state, which is limited to G_IO_READ & G_IO_WRITE, directly to condition which can inlcude the other states.

buffer_cond = G_IO_READ | G_IO_WRITE
cond = G_IO_READ | G_IO_HUP

therefore (buffer_cond & cond)==cond can NEVER be true.  The return should just be 
return (buffer_cond & cond);
This way prepare properly indicates that the buffer already meets a watch condition, therefore poll should return imediatly.  The unix version has the same bug.
Comment 23 Kenneth Haley 2006-04-01 01:07:00 UTC
I just tested this idea on both the test case and the real program by setting the watch condition to G_IO_IN.  This allows both to work correcly.  The g_io_*_prepare functions mistakenly want the buffer condition to match ALL watch conditions instead of matching ANY like it should.

This problem affects both win32 and unix.
Comment 24 Martyn Russell 2006-04-11 13:02:01 UTC
Created attachment 63233 [details]
Tarball with two tests

OK, I am not completely sure if this bug is the same thing I am seeing, but I thought this is the best place to add my comments since Tor pointed me in this direction :)

My bug is not just related to async connections, it occurs on blocking connections too.

My problem is, that I set up a G_IO_OUT watch and connect, then when I get called back I watch for G_IO_IN, and I get input on the socket after sending some output with no problems. But as soon as I connect G_IO_ERR or G_IO_HUP (at the same time I connect G_IO_IN) in the G_IO_OUT callback, I get NO input from the output I write to the socket.

I have included this tarball which is basically a Makefile and two tests.  One is called test-giochannel and the other test-giochannel-simple. The simple test is basically a cut down Windows and Linux compatible test that shows this bug (with a non-blocking connection). The other test is a little more complicated and allows you to test with blocking or non-blocking connectivity. They both show the same bug (the complicated test is basically Loudmouth code cut down).

Note the tests demonstrate how the code works perfectly under Linux and fails under Windows.

Any ideas?
Comment 25 Dominic Lachowicz 2006-04-11 14:45:44 UTC
*** Bug 337991 has been marked as a duplicate of this bug. ***
Comment 26 Kenneth Haley 2006-04-12 06:56:04 UTC
Martyn, the problem you're seeing is due to the way the windows socket api works.  Only one set of watch conditions can be used at a time.  While it may be possible to rewrite the windows version to work around this, the best solution right now would be to change the way you use watches.  Create a single watch that looks for everything your interested in and point it to a proxy function that will call you're condition specific functions.

Rewriting the windows g_io_channel code to work around this problem may be possible, but it won't help you right now.  Also, if your code uses buffered io channels then you may run into the problem from comments 22 & 23.

Hope this helps.
Comment 27 Martyn Russell 2006-04-12 09:01:51 UTC
Mmm, only one set of watch conditions at one time? That seems quite limited and not consistent with the Linux platform's approach. I would have thought that ALL conditions would be watched and only those the user wants to know about are emitted from glib.

I will try it out for the time being since we will need to monitor HUP and ERR, but glib should really be consistent if possible.

My code wasn't buffered, I read your comments about that though, so luckly I think I am avoiding that scenario at the moment.

:) thanks
Comment 28 Tor Lillqvist 2006-04-12 09:56:47 UTC
Well, the implementation of watches on GIOChannels is totally different for Win32 and for Unix. There are many ways to use the GLib GIOChannel and main loop APIs. Without a set of test programs that would exercise all legal ways to use the APIs, it's no wonder if each new use case reveals some new problem.
Comment 29 Thomas Folz-Donahue 2007-01-14 20:34:22 UTC
I was just bitten by this bug trying to run the networked tetris program I wrote in pygtk on windows- whenever I use gobject.add_io_watch on a socket, it replaces all previous watches on that socket.
Comment 30 Tor Lillqvist 2008-08-19 01:01:04 UTC
The multiple watches problem is in bug #338943.
Comment 31 GNOME Infrastructure Team 2018-05-24 10:39:48 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/40.