GNOME Bugzilla – Bug 147392
giowin32 leaks threads for socket channels [patch]
Last modified: 2011-02-18 16:13:57 UTC
1. Create a GIOChannel for a socket. 1a. g_io_channel_win32_new_socket 1b. tag = g_io_add_watch 1c. g_io_channel_unref 2. g_source_remove(tag). 3. Watch as the thread doesn't exit. Repeat until out of memory. However, the above steps work fine with g_io_channel_win32_new_fd. This patch sets the need_close to 1 in g_io_win32_finalize, which forces select_thread() out of its loop (it doesn't actually close the socket). I've tested this with XChat, which previously leaked threads, now it does not.
Created attachment 29457 [details] [review] Fixes GIOwin32 not stopping it's thread on socket g_source_remove
Adding the PATCH keyword.
Thanks! Patch applied to glib-2-4 and HEAD.
Won't this exit the select thread prematurely if there's more than one watch on the channel? I think adding an explicit call to channel_close would solve the problem in the motivating example. glib probably should end the select thread if there are no watches on the channel because otherwise channels that have close_on_unref set to true will leak channels. The tricky part is handling the case where all watches are removed and then one or more is re-added.
Hmm. (How does it work on Unix?) Do you think the patch should be reversed?
On Unix, the fd's are put in the main select/poll set so there is no separate thread. I think the patch will break programs that remove & then add watches, but the previous behavior leaked threads and channels when close_on_unref is true. I think it needs to be reopened and (eventually) fixed. I may be able to come with a patch for it.
John, do you still think the patch should be reversed? Do you have a better patch? Or should the documentation be improved, i.e. should GLib expect the application to call g_io_channel_close() and not just g_io_channel_unref()? (Or whatever.)
How about: if (channel->watches == NULL && channel->type == G_IO_WIN32_SOCKET) { /* tell select_thread() to exit */ channel->needs_close = 1; /* wake up select_thread() from its blocking select() */ send (channel->reset_send, send_buffer, sizeof (send_buffer), 0); } Multiple watches on the one socket use the same thread right? So we wait until there's no watches to force the thread to exit.
Yes, looks reasonable. I guess giowin32.c still needs to be fixed to handle the case John mentions, where all watches are removed and then later one or more are re-added. Would it be enough to set channel->thread_id to zero at the end of select_thread()? Then g_io_win32_create_watch() and g_io_channel_win32_make_pollfd() would call create_thread().
Created attachment 35588 [details] [review] Different locking strategy Here's what I've come up with, which is a bit more of a change than I first intended. The select_thread now locks the channel whenever it isn't calling select so it doesn't need to constantly lock & unlock it. It exits if there are no watches. I also drop the lock before creating all threads, which may cause problems for other channel types. Something I just realized, though, is that g_io_channel_ref / _unref are not thread safe. I wonder it the ref / unref could be eliminated from select_thread and the code would rely on the channel finalize to destroy the thread.
libsoup uses socket GIOChannels and watches in such a way that didn't work with current HEAD GLib. It adds and removed watches from the same channel. This breaks the current code completely. As it happens, I have already in two other places (D-BUS and ORBit2) used the WSAEventSelect()/WSAEnumNetworkEvents() API, with good results (once I understood all the quirks). I decided to try to rewrite giowin32.c to use it, too, for sockets. This means we can say goodbye and good riddance to the separate thread for each socket. The code gets much simpler, at least for sockets. (For C library file descriptors a separate reader thread is still needed, sigh.) I'll attach my current giowin32.c, please test it thoroughly. If it works OK for you, I'll commit it to HEAD, and maybe even to 2.6.
Created attachment 45420 [details] New giowin32.c
Does the new approach leave the sockets in blocking mode? As far as I recall, this was the reason for needing to use the select() call.
Nope, WSAEventSelect() automatically puts the socket into non-blocking mode. Does your applications require sockets to be blocking? But I don't think using select() was because of blocking/non-blocking concerns. I didn't find out about the WSAEventSelect() call until late last year, if I recall correctly... Or maybe I had noticed it earlier, but thought that it is too complex or not relevant. Hmm, if one uses a GIOChannel but not any watch, the socket should stay in blocking mode (if it initially is so). Only when the prepare method is first called does WSAEeventSelect() get called. I guess one could add code to put the socket back to blocking mode after the poll? But then for apps/libs that want to use non-blocking mode that would then be unnecessary overhead.
Yes, we are using blocking sockets, which could and maybe should be changed. But the fact that it's permitted means that some number of application out there are doing it and they may fail if the mode is changed. I think I looked into toggling between blocking & non-blocking and don't think you can do it on win32 (though I may be wrong) and what happens if the app code changes the mode while it's being watched?
OK, so maybe we need to keep the old implementation as an alternative? Perhaps selectable by setting an environment variable? For non-blocking applications/libs, I think the WSAEventSelect()-based approach is clearly superior.
I'd say use the new implementation if the socket is non-blocking when added -- this might break some programs that fiddle with the blocking mode, but I suspect they have other portability problems. Also, warn on all platforms that watching blocking sockets is deprecated and plan to remove the next time we can break backwards compatibility. I see this as an api with behavior on Unix that's nearly impossible to implement on win32.
But how does one find out whether a socket is in blocking or non-blocking mode...? ioctlsocket(FIONBIO) doesn't tell. Anyway, I found a bug in the g_io_channel_win32_make_pollfd() implementation in the new giowin32.c that manifested itself in GIMP when one tried to run script-fu scripts. Will attach a fixed version.
Created attachment 46365 [details] New giowin32.c
Created attachment 47495 [details] [review] giowin32.c patch (based on glib v2.4.7) proposed patch for resolving issues in the following cases : - several watches exist and only one watch is removed - new watches are created after the last watch has been removed This as been tested with the following gnet 2.0.5 examples which didn't work with the standard glib v2.4.7 : echoclient-gconn, echoserver-gserver and echoserver-async
Hmm, David, so with this patch you mean that we should keep the separate thread implementation for sockets, and we still be able to do complex watch manipulation? Are you strongly opposed to the WSAEventSelect()-based approach in comment #19 that does away with the threads (for sockets)? John, do you have any fresh follow-up to comment #18?
No, I'm not sure the separate thread implementation is good. I even just discovered some more problems with it. There are race conditions where pollfd.revents is resetted to 0 (in g_main_context_query()) before it is passed to the callback function (in g_io_win32_dispatch()) (even without the threads leak patch). I'm not opposed to the WSAEventSelect()-based approach. I would like to test your new version of giowin32.c but it doesn't compile in glib v2.4.7 : giowin32.c:45:20: gstdio.h: No such file or directory giowin32.c:48:20: galias.h: No such file or directory giowin32.c: In function `g_io_channel_new_file': giowin32.c:1278: warning: implicit declaration of function `g_open' giowin32.c: At top level: giowin32.c:1335: redefinition of `g_io_channel_new_file' giowin32.c:1198: `g_io_channel_new_file' previously defined here giowin32.c: In function `g_io_channel_new_file': giowin32.c:1342: warning: implicit declaration of function `g_io_channel_new_file_utf8' giowin32.c:1342: warning: assignment makes pointer from integer without a cast giowin32.c:1666:23: galiasdef.c: No such file or directory make[3]: *** [giowin32.lo] Error 1 Would it be difficult to make a glib 2.4 version ?
Hmm, for 2.4, you can just remove the inclusions of gstdio.h, galias.h and galiasdef.c, replace g_open() with open(), and remove the "backward compatibility" version of g_io_channel_new_file(). That should hopefully make it build. It might also be a good idea to look at the g_poll() Win32 implementation in gmain.c in 2.6.4 and make sure your g_poll() looks the same. There was at least one fix to it that was essential to make the CORBA and bonobo stuff work (or libsoup, I don't recall), see the ChangeLog entry from 2005-03-29.
I tested the WSAEventSelect()-based approach (your file, modified for v2.4 and update of g_poll in gmain.c, from 2.6.4) It doesn't seem to work with gnet examples echoclient-gconn, echoserver-gserver and echoserver-async, which do some watches manipulations. And to resolve the race condition mentionned in comment #22, I had to remove the following line in g_main_context_query(), in gmain.c : fds[n_poll].revents = 0; But actually I don't understand exactly how this field is used. I don't know if I may do that and I didn't check if it still work under Linux.
Actually removing the revents reset line in g_main_context_query() as stated in comment #24 doesn't resolve the race condition. It had disappeared while I added some g_print traces but reappeared when I removed them...
For the patch specified in comment #24 to work, I had to remove the revents reset in g_poll too. But as I said, I don't understand what I do. Another way to resolve the problem without modifying gmain.c was to modify g_io_win32_dispatch() in giowin32.c to use channel->revents instead of watch->pollfd.revents, where channel is defined as : GIOWin32Channel *channel = (GIOWin32Channel *)watch->channel; But still I don't understand exactly what I do. Could someone explain me the revents handling stuff ? This patch could break other things...
The socket thread should not go away when there are no watches on the channel, since programs like GNet do a drop then add of watches. Any patches should be reversed if they change that behavior. If memory serves, the glib main loop should call g_io_win32_sock_close() after the ref count is 0, which will trigger the thread to go away by casing the select() call to break out, by closesocket (win32_channel->fd); Andrew Lanoix
Andrew, so you think the patch in comment #1 should be reversed? That has been in GLib since 2.5.2... What GLib version do GNet-using applications typically come bundled with? What do you think of the threadless approach in comment #19? I would very much like to use that in GLib 2.8.0. I am building HEAD gnet now against HEAD glib (using the threadless socket giochannel watch), to see myself how the test programs work.
David, you should realize that the 2.4 branch of GLib is not maintained any longer. So suggesting patches for 2.4 is pointless. The current stable branch is 2.6, and it just had a release today, 2.6.5. I don't know whether any more 2.6 releases are likely. (2.8.0 will be released September-ish, I assume.) But anyway, if it would make it more likely that GNet apps would switch to at least 2.6.6 (if and when it appears), I certainly could reverse the patch from comment #1 in the 2.6 branch. (Hmm, assuming there hasn't been any other interfering changes inbetween that have mixed up things...)
Hmm, yes, the gnet Win32 code seems a bit screwed up. I recall pointing out (much) earlier in a mail message that it isn't a good idea to keep Unix and Win32 socket code as separate ifdeffed copies, when they are actually (at least originally) mostly identical. This will only lead to them diverging by accident without deliberate design, and people then thinking the differences are there on purpose. And that indeed seems to have happened at least in tcp.c, if you compare the Unix and Win32 versions of gnet_tcp_socket_new_async_direct() and gnet_tcp_socket_new_async_cb(). OK, maybe it works as it currently is with glib 2.4.7, but 2.4.7 is history. I'll try to fix it, make sure it works with the threadless giowin32.c, and submit a patch to the gnet maintainer. Then hopefully gnet can catch up on Win32 and start using GLib 2.8.0 once it is released.
I can't find any way to determine a socket's blocking mode on win32. Sorry, I simply assumed there was a way to do this when I wrote comment 17. I don't know what the best way to proceed here -- I don't like breaking working apps but the current thread based approach is buggy, though it probably could be fixed.
Tor, I tought glib v2.4 was still recommended on win32 because v2.6 could still contain bugs on win32 as stated on this page : http://www.gimp.org/~tml/gimp/win32/downloads.html If it's stable enough then I'll use glib 2.6. It should be great if you could fix gnet to work with your code without threads, because I assume the race condition bug should not occur if there are not several threads. (This race condition occurs also if we reverse the original patch!)
> Tor, I tought glib v2.4 was still recommended on win32 because v2.6 > could still contain bugs on win32 as stated on this page Hmm, yeah, I guess it's time to change that page ;-) The warning against 2.6 is not so much because of possible bugs (at least not any longer), but because of the file name encoding change. But yes, GLib 2.4.7 works fine for most stuff, but patches for it now are mostly pointless, which was what I meant. I spent some time the gone weekend trying to get my current threadless giowin32.c working with the test programs in HEAD gnet, but without success yet. I guess I was a bit too optimistic when writing comment #30. Oh well, I will hack more on it some time later...
We could depend on nonblocking being set via g_io_win32_set_flags(). I was just looking at this because I need a nonblocking mode for pipes.
Oh wow. This was a real horror to debug and to follow what happens. But now finally gnet's conn_http_test.exe works with a threadless giowin32.c, all tests say "OK". I had to modify giowin32.c still a bit more. The semantics of the network event API is really unfriendly, especially the fact that the FD_WRITE state is handled "slightly differently" as the docs put it. Maybe, for software written from scratch against the WinSock API, it makes sense, but for GLib, which is written with Unix select() semantics in mind, getting similar behaviour out of WinSock means lots of silly bookkeeping, otherwise you will lose events and/or not do the right callbacks at the right times. Anyway, I was able to clean up lots of the Win32 ifdefs and special-case code from gnet: No gnet_Hwnd window and odd PostMessage()s any longer. The same gthread-using code for async DNS lookups can be used as on Unix. I implement inet_pton() and inet_ntop() on Win32 using WSAStringToAddress() and WSAAddressToString(). gethostbyname() and gethostbyaddr() are thread-safe in WinSock. I will attach patches tomorrow once I've slept on this and read through the patches once more. I'll have to check also that other stuff that uses giochannels for sockets still works.
not NEEDINFO, rhight ?
Just a little info from MSDN on setting a socket back to blocking mode: The WSAAsyncSelect and WSAEventSelect functions automatically set a socket to nonblocking mode. To set the socket back to blocking mode, an application must first disable WSAAsyncSelect by calling WSAAsyncSelect with the lEvent parameter equal to zero, or disable WSAEventSelect by calling WSAEventSelect with the lNetworkEvents parameter equal to zero.
Yes, the new giowin32.c does call WSAEventSelect() with a zero lNetworkEvents parameter in g_io_win32_free(), so after the giochannel has gone the caller is free to set the socket back to non-blocking mode if necessary. GLib can't do it as it doesn't know whether it was blocking or non-blocking initially.
GLib 2.8 is getting closer. I will replace the giowin32.c with the new version in HEAD now. About the patch to gnet I promised to attach in comment #35, will do that in a moment.
Created attachment 48877 [details] [review] Umm, yes, here's my diff to gnet. actually from June 14 already
Hmm, is there really any need to keep this bug open? At least the Summary is no longer relevant as there aren't any threads to leak in giowin32.c for sockets any longer.
I think this can be closed.
I have applied the patch to gnet and it works fine, so I'm pretty sure this can be closed.
Great! Resolving this...