Bug 425156 – GIOChannel deadlocks on a win32 socket (patch included)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 425156 - GIOChannel deadlocks on a win32 socket (patch included)


Summary:	GIOChannel deadlocks on a win32 socket (patch included)


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	win32
Version:	2.12.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-win32 maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-04-01 13:46 UTC by Juho Vähä-Herttua
Modified:	2008-12-10 10:36 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Suggested fix for the problem (against glib 2.12.11). (621 bytes, patch) 2007-04-01 13:48 UTC, Juho Vähä-Herttua	none	Details \| Review
Suggested fix for the problem (against glib 2.12.11). (1.34 KB, patch) 2007-04-01 13:58 UTC, Juho Vähä-Herttua	none	Details \| Review
Test case by Yu Kuan that reportedly triggers a similar case (2.14 KB, text/x-csrc) 2008-12-10 05:50 UTC, Juho Vähä-Herttua		Details
yu-kuan-1.c, the test case client, slightly modified (3.33 KB, text/plain) 2008-12-10 09:45 UTC, Tor Lillqvist		Details
yu-kuan-server.c, the corresponding trivial server (1.09 KB, text/plain) 2008-12-10 09:46 UTC, Tor Lillqvist		Details

Description Juho Vähä-Herttua 2007-04-01 13:46:11 UTC

Please describe the problem:
I'm doing XMMS2 port to Windows using glib for everything possible, also including the socket handling. We have a program that does a write on a socket, waits for a result and immediately does another write. This repeatedly hung up our server in an undeterministic way. (always at little different phase) So it was apparently a race condition somewhere.

What I found happening was:

1. Server is reading all the input from giochan and _check function determines that there is no input left (the handle is marked as non-signaled)
2. After the handle is marked as non-signaled the reader GSource is replaced with a writer GSource (with same handle as done in create_watch)
3. The poll will go to wait for that handle but it will not be signaled until there is readable data (never going to happen because the client is still waiting for that write)

How I fixed this was that in _prepare function we will check what the events were in last iteration and if they are different on this iteration, we will set the handle as signaled manually. (because nothing from last iteration can be trusted any more anyway) If the last iteration had no handles then we are probably running the iteration first time and SetEvent should not be set manually (so I added an "if (channel->event_mask)" there)

There was also another thing I think can be considered a race condition I included in my patch. I think the watch->pollfd.fd should be reset in WSAEnumNetworkEvents and manually set later if we don't want to reset it. Otherwise there might be another race condition. (WSAEnumNetworkEvents checks for the status and then status is set according to its result value and in between some other thread can set handle back to signaled state) WSAEnumNetworkEvents reset of the handle should be atomic as far as I know so there should be no risk.

Please comment, with this patch I got XMMS2 to work perfectly on Windows and it looks like we need to include our own binary glib for windows users. It would be great to get it upstream for convenience.

Steps to reproduce:
1. Create a new main loop and attach a GSource with G_IO_IN|G_IO_ERR|G_IO_OUT condition.
2. Every time when new data arrives, attach a GSource with G_IO_OUT immediately from another thread. (the old GSource can be destroyed if you wish)
3. Repeat until the program gets totally stuck

Actual results:
The mainloop gets stuck in the WSAWaitForMultipleObjects function with WRITE|CONNECT event never signaled.

Expected results:
The mainloop should call the GSource callback function when there's data writable like usually in G_IO_OUT case.

Does this happen every time?
Yes, every time with the case described above.

Other information:

Comment 1 Juho Vähä-Herttua 2007-04-01 13:48:12 UTC

Created attachment 85653 [details] [review]
Suggested fix for the problem (against glib 2.12.11).

First suggestion that worked just fine for me. Could fix some problems with multiple sources on same iochannel as well. (worked just fine for me when I had reader and writer there at the same time)

Comment 2 Juho Vähä-Herttua 2007-04-01 13:58:08 UTC

Created attachment 85655 [details] [review]
Suggested fix for the problem (against glib 2.12.11).

Same as the last one but in unified format for better readability.

Comment 3 Tor Lillqvist 2007-04-01 16:31:15 UTC

Sigh. I am very afraid of doing something that looks as drastic as you patch. Remember: It is not enough that *your* code happens to start working once you tinker with the giowin32.c code in some fashion. You also would need to show that all other code that uses giowin32.c still works afterwards ;) As much as I hate saying it, I would almost suggest you really do what you say, i.e. include a privately hacked library for your application. Sorry.

There are a couple of other patches waiting for giowin32.c. I really can't say when I will have time to look really carefully into them (and yours).

Comment 4 Juho Vähä-Herttua 2007-04-01 16:51:52 UTC

Well, the second part of the patch is not even necessary, what's really the whole point is just the single SetEvent there. I used three days debugging this stuff, from which over one day went trying to convince myself that glib doesn't have a bug and it's my code at fault. So I tried all possible combinations of glib calls I just could imagine to have the wanted result but all failed and I had to dig into glib itself.

The worst possible scenario that I could imagine could be caused by my patch was that poll will return once uselessly, but this should be handled correctly in _check function. (poll can return uselessly any time anyway because of the wakeup mutex, if the outside code calls g_main_context_wakeup) So personally I don't see it very drastic and I have tried to break it in different ways, although they've been all inside xmms2 and not as separate test cases. (I've managed to break the old code in pretty much all ways) Is there any unit test code for giochannel sockets? I didn't find anything from the tests/ directory.

But yeah, I suppose distributing a patched version is probably what we have to do. I'm not too interested in writing test cases to prove this if the review will still be pending due to the lack of time. But if you can tell me programs that use giowin32.c and sockets together, I can run some tests on them. Didn't find anything interesting with a quick search.

Comment 5 Tor Lillqvist 2007-04-01 20:04:51 UTC

> But if you can tell me programs that use giowin32.c and sockets together,
> I can run some tests on them.

The problem is that there is no definitive collection of test programs that would unambiguougsly define the expected behaviour of giochannels, main loops, and sockets (and that would implicitly define the ways the GLib API in questions are to be called to get defined behaviour).

The application I personally have some interest in is Evolution, but even for that I am not *that* interested in keeping it working on Windows (it's not part of my job any more), and I am afraid it will bitrot away from a Win32 point of view soon anyway. I am more scared what might happen to other people's pet apps. Some small change might mean one of them suddenly then doesn't work any longer.

Comment 6 Juho Vähä-Herttua 2007-04-02 08:05:15 UTC

Well, if evolution has been part of your job I suppose you know it (evolution, evolution-data-server, evolution-exchange) uses GIOChannel very lightly with just a single GSource attached for asynchronous reading, so there's no way this could affect it. However it seems that libsoup indeed does use asynchronous writing exactly the way I have been trying to implement it and I believe I could create an example program that could manage to hang it up with calling soup_socket_write and soup_socket_read rapidly from separate threads, have to see that if I get little extra time which seems unlikely considering how much I spent fighting with this already.

Still about the patch, if you open giowin32.c in your favourite editor and search for "->event_mask" you see that the event mask is always either:

1) Constructed directly from the condition flags in _prepare function
2) Set to zero when we want to block until there is readable data

In the first case my patch would work correctly, because if the old event_mask is different from the current mask, we definitely need to skip the poll and go directly to the _check function (don't worry, it will be reset in next iteration by WSAEnumNetworkEvents or ResetEvent). In the second case my patch would not change current behaviour since when event_mask is zero, it does nothing.

And the second part of the patch, you can google around internet with keywords WSAEnumNetworkEvents and ResetEvent and all forums and instructions can tell you that you shouldn't use ResetEvent with WSAEnumNetworkEvents since all it does is lead to race conditions. Actually the debug message indicates that it would indeed be reset in WSAEnumNetworkEvents but the code doesn't match the debug message. I suppose you could've had a better glance at the code already with the time used for writing two reply messages. :)

I also checked the win32 bug reports that are still open, there's also some things like "Sometimes a GIOFunc on Win32 is called with zero condition" or the mentioned "Callback is called, app expects data, gets none" mentioned by Owen in #338943. It's all just undocumented behaviour. However hangups of documented behaviour should have little higher priority.

I have to say I may be a bit emotionally attached because I've advocated the greatness of glib and its portability for a lot of people and I'm starting to feel it's only true as long as you keep away from Windows, and who wants to find out he's been lying? However I doubt I'll be satisfied with the basic "afraid to change anything because some unknown toy project may fail" answer especially when trying to make an official port of a relatively popular music player here.

But yeah, just try to get a bit time, the #357674 bug also looks quite serious and with waiting the unresolved bugs buffer will just get longer. There's no point in making glib sockets unusable in favor of some toy projects of which existence is disputed now, is there? Personally I believe many of the toy projects ever been using glib sockets on windows can be found right here on the bug tracker.

Comment 7 Daniel Svensson 2008-01-29 12:34:32 UTC

Soon a year has passed. Any ETA yet when this patch will be applied?

Comment 8 Juho Vähä-Herttua 2008-01-31 16:04:57 UTC

Apparently there has been some confusion about what is meant with documented behaviour. So I will paste some parts from the glib documentation at http://library.gnome.org/devel/glib/stable/glib-IO-Channels.html address:

g_io_add_watch_full():

condition :     the condition to watch for
func :          the function to call when the condition is satisfied

enum GIOCondition:

G_IO_IN 	There is data to read.
G_IO_OUT 	Data can be written (without blocking).

To conclude: if data can be written without blocking, but the function is not called even when the G_IO_OUT condition is satisfied, then the behavior is clearly different than in this documentation. I didn't find documentation that would say that the function is never called when the condition is not satisfied, and the other bugs are related to that. That is what makes this different.


The patch can be divided into two parts. The first added 6 lines are the deadlock bug. Everything else in the patch is just a fix to remove a race condition, the handle should be reset by WSAEnumNetworkEvents immediately when it should be reseted and not manually afterwards. For reference:


http://msdn2.microsoft.com/en-us/library/ms741572.aspx

hEventObject
    An optional handle identifying an associated event object to be reset.


http://msdn2.microsoft.com/en-us/library/ms685081(VS.85).aspx

The ResetEvent function is used primarily for manual-reset event objects, which must be set explicitly to the nonsignaled state. Auto-reset event objects automatically change from signaled to nonsignaled after a single waiting thread is released.


So in this case WSAEnumNetworkEvent supports auto-reseting the event, but for some reason GIOChannel code wants to reset it manually which can be racy.

To underline it: The first 6 lines of the patch are the main problem. If it is decided that the other part of the patch is not applied I don't really care, because I find it unlikely to cause any real trouble.

Comment 9 Tilman Sauerbeck 2008-12-09 18:49:12 UTC

Any news on this?

Comment 10 Juho Vähä-Herttua 2008-12-10 05:48:51 UTC

(In reply to comment #9)
> Any news on this?

At least could be mentioned here, because it looks like it wasn't... This very strongly looks like exactly same bug with the mailing list post:

http://mail.gnome.org/archives/gtk-devel-list/2008-May/msg00051.html

The patch there seems to be slightly more refined too. I am also attaching a file that according to Yu Kuan produces the bug. Now that I have access to Windows I could look at it again some time. However as it stands, I'm not sure if it's worthwhile, if the patch is not going to be applied anyway...

Comment 11 Juho Vähä-Herttua 2008-12-10 05:50:15 UTC

Created attachment 124326 [details]
Test case by Yu Kuan that reportedly triggers a similar case

Comment 12 Juho Vähä-Herttua 2008-12-10 09:38:34 UTC

And as a note again, the patch from Yu Kuan had apparently been merged 4,5 months ago in revision 7220:

http://svn.gnome.org/viewvc/glib/trunk/glib/giowin32.c?r1=7051&r2=7220

Because there was no bug report whatsoever in bugzilla, and the according message in the mailing list was not replied, it was kind of hard to find except by searching for SVN logs.

I used the exactly same program that reproduced this bug on win32. I ran it with vanilla glib 2.12.13 and 2.18.3 versions. The old one broke just like I described, the new one works just fine and I don't see problems with it.

Thank god this is finally resolved, it just took some one and half years...

Comment 13 Tor Lillqvist 2008-12-10 09:43:47 UTC

This just shows how important it is to provide a minimal and complete test case. Yu Kuan's one was exactly that. (Although it wasn't perfect; it didn't contain ifdefs for Unix, and he didn't provide the corresponding minimal server program.) If you could have bothered to do the same, this might have been resolved earlier.

Comment 14 Tor Lillqvist 2008-12-10 09:45:15 UTC

Created attachment 124333 [details]
yu-kuan-1.c, the test case client, slightly modified

More debugging output, also builds on Unix.

Comment 15 Tor Lillqvist 2008-12-10 09:46:03 UTC

Created attachment 124334 [details]
yu-kuan-server.c, the corresponding trivial server

Comment 16 Juho Vähä-Herttua 2008-12-10 10:01:40 UTC

Indeed, with minimal and complete test case it took just half years. Personally the main problem was losing access to any windows systems soon after reporting the bug, something gained back only in recent months. Therefore I tried to describe it clearly step by step.

When it comes to test cases and motivation to writing them, I have to end by quoting your own words from the same bug:

"It is not enough that *your* code happens to start working once you
tinker with the giowin32.c code in some fashion. You also would need to show
that all other code that uses giowin32.c still works afterwards ;)"

This was the second biggest reason why I didn't write a test case myself. If what you are saying now is true, please think more carefully what you say later on.

Comment 17 Tor Lillqvist 2008-12-10 10:14:15 UTC

Hmm, Yu Kuan's mail was from 2008-05-15, the patch was applied on 2008-07-20. That is more like, umm, two months to me.

Did you notice the smiley?

Comment 18 Juho Vähä-Herttua 2008-12-10 10:36:49 UTC

Yes, and I didn't find it funny considering the next sentence that followed it. Also the next post actually handled the same topic, but didn't have smileys at all. Call my sense of humor bad, too bad can't blame it on cultural differences here.

If you really want to discuss this topic, all I have to say is try to read your replies from as objective as possible point of view. After finishing that, try to think if they really encourage to write more test cases and work on it. I have nothing to prove, I'm just trying to explain what might help in the future when communicating bugs.

I thank for the patch in any case.