GNOME Bugzilla – Bug 792050
GResolver is not thread-safe
Last modified: 2018-02-02 17:12:27 UTC
Created attachment 366087 [details] minimal example It seems if i call lookup_records_async() while there is another query going on, all querys return empty I cannot reproduce this on Linux I attached sample code This is on Mac 10.13.2
It probably doesn’t help that there are no SRV records on _xmpps-client._tcp.jabber.org or _xmppa-client._tcp.jabber.org. What output do you get if you change it to resolve _xmpp-client._tcp.jabber.org and _xmpp-server._tcp.jabber.org?
Created attachment 366329 [details] minimal example
20:54:51 »_xmpp-client._tcp.jabber.org« kann vorübergehend nicht aufgelöst werden 20:54:51 »_xmpp-client._tcp.jabber.org« kann vorübergehend nicht aufgelöst werden 20:54:51 [GLib.Variant('(qqqs)', (30, 30, 5222, 'hermes2.jabber.org')), GLib.Variant('(qqqs)', (31, 30, 5222, 'hermes2v6.jabber.org'))] I pressed the button again to resolve the same records again: 20:54:59 »_xmpp-client._tcp.jabber.org« kann vorübergehend nicht aufgelöst werden 20:55:04 [GLib.Variant('(qqqs)', (30, 30, 5222, 'hermes2.jabber.org')), GLib.Variant('(qqqs)', (31, 30, 5222, 'hermes2v6.jabber.org'))] 20:55:04 [GLib.Variant('(qqqs)', (30, 30, 5222, 'hermes2.jabber.org')), GLib.Variant('(qqqs)', (31, 30, 5222, 'hermes2v6.jabber.org'))] 3 requests, 3 times with the same valid record if it works or not is currently a game by chance. also notice how it needs in the second attempt 5 seconds to resolve 2 of 3 requests i updated my example to reflect this result
That’s pretty convincing. Looking at the man pages, it seems that res_query() is not thread-safe on various platforms (it uses the global _res state), and we should be using res_nquery() instead. http://man7.org/linux/man-pages/man3/resolver.3.html https://stackoverflow.com/questions/3548481/is-res-query-thread-safe/4981981#4981981 I’ll look at making a fix.
(In reply to Philip Withnall from comment #4) > it seems that > res_query() is not thread-safe on various platforms (it uses the global _res > state) It's not thread-safe on *any* platform if people might be modifying _res, but... you know... "don't do that then". > and we should be using res_nquery() instead. Ooh. Pretty sure that didn't exist when we wrote GResolver. Looks like it still doesn't exist on FreeBSD and OpenBSD (based on some quick googling).
Created attachment 366381 [details] [review] build: Add missing config.h entries for HAVE_RES_QUERY Signed-off-by: Philip Withnall <withnall@endlessm.com>
Created attachment 366382 [details] [review] gio: Port GThreadedResolver to use res_nquery() to fix thread-safety res_query() uses global state in the form of the struct __res_state which contains the contents of resolv.conf (and other things). On Linux, this state seems to be thread-local, so there is no problem. On OS X, however, it is not, and hence multiple res_query() calls from parallel threads will compete and return bogus results. The fix for this is to use res_nquery(), introduced in BIND 8.2, which takes an explicit state argument. This allows us to manually store the state thread-locally. If res_nquery() isn’t available, we fall back to res_query(). It should be available on OS X though. As a data point, it’s available on Fedora 27. There’s a slight complication in the fact that OS X requires the state to be freed using res_ndestroy() rather than res_nclose(). Linux uses res_nclose(). (See, for example, the NetBSD man page: https://www.unix.com/man-page/netbsd/3/res_ninit/. The Linux one is incomplete and not so useful: http://man7.org/linux/man-pages/man3/resolver.3.html.) The new code will call res_ninit() once per res_nquery() task. This is not optimal, but no worse than before — since res_query() was being called in a worker thread, on Linux, it would implicitly initialise the thread-local struct __res_state when it was called. We’ve essentially just made that explicit. In practical terms, this means a stat("/etc/resolv.conf") call per res_nquery() task. In future, we could improve this by using an explicit thread pool with some manually-created worker threads, each of which initialises a struct __res_state on spawning, and only updates it on receiving the #GResolver::reload signal. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Dan, what do you think about the approach in these patches (and the potential future plans, although I’m not committing to writing them, so assume they will never be implemented…). I ask since you originally wrote GThreadedResolver. One thing I forgot to note above is that I’ve left the res_init() call in gresolver.c in place (without porting it to res_ninit()) since it seems to be doing no harm, and people might be relying on it updating the main thread’s resolver state for them.
(In reply to Dan Winship from comment #5) > (In reply to Philip Withnall from comment #4) > > it seems that > > res_query() is not thread-safe on various platforms (it uses the global _res > > state) > > It's not thread-safe on *any* platform if people might be modifying _res, > but... you know... "don't do that then". > > > and we should be using res_nquery() instead. > > Ooh. Pretty sure that didn't exist when we wrote GResolver. Looks like it > still doesn't exist on FreeBSD and OpenBSD (based on some quick googling). Oh, hello. You replied while I was writing my patches and I didn’t notice. :-) I read things which say that _res is thread-local on Linux, but tbh a lot of what I read seemed fairly uninformed, so anything could be true. res_nquery() seems to exist on NetBSD but not FreeBSD or OpenBSD. It does exist on OS X though, which is the key thing here.
(In reply to Dan Winship from comment #5) > (In reply to Philip Withnall from comment #4) > > it seems that > > res_query() is not thread-safe on various platforms (it uses the global _res > > state) > > It's not thread-safe on *any* platform if people might be modifying _res, > but... you know... "don't do that then". Does res_query() itself not modify the state?
_res is not *documented* as being modified by anything other than res_init(), and it's implied to just be a representation of resolv.conf. But I guess nothing stops implementations from storing per-query state in it.
Mmm. The documentation for these APIs seems to be a bit sketchy/incomplete, and the weird results on OS X do point towards a race. lovetox, are you able to recompile GLib with my two patches applied and see if they fix your issue? If you could, that would be very helpful.
Comment on attachment 366381 [details] [review] build: Add missing config.h entries for HAVE_RES_QUERY The configure checks for res_query aren't checking to see *if* res_query exists, they're checking to see *where* it exists. We require res_query() to exist (somewhere) on G_OS_UNIX (and we know it doesn't exist on G_OS_WIN32). Note that it's possible that some or all of the res_query-related configure checks only apply to no-longer-supported OSes or OS versions.
Comment on attachment 366382 [details] [review] gio: Port GThreadedResolver to use res_nquery() to fix thread-safety >+ AC_MSG_CHECKING([for res_ninit]) >+ AC_MSG_CHECKING([for res_nquery]) You check for these separately but you assume that res_ninit is available ifdef HAVE_RES_NQUERY in the code. (Which is fine.) >+ * What we have currently is no worse than using res_query() in worker >+ * threads, since it would transparently call res_init() for each new worker >+ * thread. */ FWIW this isn't quite true since GThreadedResolver uses GTask which uses GThreadPool, so threads do get reused. (This isn't an argument against the code; there are [non-glib] codebases that just call res_init() before every resolver call to ensure they pick up resolv.conf changes, and no one notices any performance problems there.) >+#elif defined(HAVE_RES_QUERY) > len = res_query (lrd->rrname, C_IN, rrtype, answer->data, answer->len); >+#else >+#error res_nquery() and res_query() not supported. >+#endif Don't need all this. configure will error out if res_query isn't found. >+#elif defined(HAVE_RES_NINIT) >+#error Your platform has res_ninit() but not res_nclose() or res_ndestroy(). Please file a bug. >+#endif Seems like that should be a configure-time error, not compile-time.
Review of attachment 366381 [details] [review]: Uff, that was stupid.
(In reply to Dan Winship from comment #14) > Comment on attachment 366382 [details] [review] [review] > gio: Port GThreadedResolver to use res_nquery() to fix thread-safety > > >+ AC_MSG_CHECKING([for res_ninit]) > >+ AC_MSG_CHECKING([for res_nquery]) > > You check for these separately but you assume that res_ninit is available > ifdef HAVE_RES_NQUERY in the code. (Which is fine.) Yeah, it seemed to be simplest that way. > >+ * What we have currently is no worse than using res_query() in worker > >+ * threads, since it would transparently call res_init() for each new worker > >+ * thread. */ > > FWIW this isn't quite true since GThreadedResolver uses GTask which uses > GThreadPool, so threads do get reused. (This isn't an argument against the > code; there are [non-glib] codebases that just call res_init() before every > resolver call to ensure they pick up resolv.conf changes, and no one notices > any performance problems there.) I’ll amend the comment. > >+#elif defined(HAVE_RES_QUERY) > > len = res_query (lrd->rrname, C_IN, rrtype, answer->data, answer->len); > >+#else > >+#error res_nquery() and res_query() not supported. > >+#endif > > Don't need all this. configure will error out if res_query isn't found. I put it in much like a g_assert_not_reached(): it’s mostly there as documentation to anyone reading the code that this situation really shouldn’t happen. i.e. Don’t read the code and think that omitting both calls is a possibility. > >+#elif defined(HAVE_RES_NINIT) > >+#error Your platform has res_ninit() but not res_nclose() or res_ndestroy(). Please file a bug. > >+#endif > > Seems like that should be a configure-time error, not compile-time. It seems unlikely that this would happen (so I don’t want to complicate the configure script with it), but like above I do want to put something in the code like a g_assert_not_reached().
Created attachment 366544 [details] [review] gio: Port GThreadedResolver to use res_nquery() to fix thread-safety res_query() uses global state in the form of the struct __res_state which contains the contents of resolv.conf (and other things). On Linux, this state seems to be thread-local, so there is no problem. On OS X, however, it is not, and hence multiple res_query() calls from parallel threads will compete and return bogus results. The fix for this is to use res_nquery(), introduced in BIND 8.2, which takes an explicit state argument. This allows us to manually store the state thread-locally. If res_nquery() isn’t available, we fall back to res_query(). It should be available on OS X though. As a data point, it’s available on Fedora 27. There’s a slight complication in the fact that OS X requires the state to be freed using res_ndestroy() rather than res_nclose(). Linux uses res_nclose(). (See, for example, the NetBSD man page: https://www.unix.com/man-page/netbsd/3/res_ninit/. The Linux one is incomplete and not so useful: http://man7.org/linux/man-pages/man3/resolver.3.html.) The new code will call res_ninit() once per res_nquery() task. This is not optimal, but no worse than before — since res_query() was being called in a worker thread, on Linux, it would implicitly initialise the thread-local struct __res_state when it was called. We’ve essentially just made that explicit. In practical terms, this means a stat("/etc/resolv.conf") call per res_nquery() task. In future, we could improve this by using an explicit thread pool with some manually-created worker threads, each of which initialises a struct __res_state on spawning, and only updates it on receiving the #GResolver::reload signal. Signed-off-by: Philip Withnall <withnall@endlessm.com>
(In reply to Philip Withnall from comment #16) > > You check for these separately but you assume that res_ninit is available > > ifdef HAVE_RES_NQUERY in the code. (Which is fine.) > > Yeah, it seemed to be simplest that way. I meant to imply "so don't even bother checking for res_ninit". Your call. > > Don't need all this. configure will error out if res_query isn't found. > > I put it in much like a g_assert_not_reached(): it’s mostly there as > documentation to anyone reading the code that this situation really > shouldn’t happen. i.e. Don’t read the code and think that omitting both > calls is a possibility. Hm. It doesn't look like "assert not reached" to me; it looks like "TODO". Maybe because it reminds me of the #else clause in gcredentials.c: #else #ifdef __GNUC__ #warning Please add GCredentials support for your OS #endif #endif But here we know that if they don't have res_nquery() then they definitely have res_query(). So just write: #ifdef HAVE_RES_NQUERY len = res_nquery (... #else len = res_query (... #endif
(In reply to Dan Winship from comment #18) > (In reply to Philip Withnall from comment #16) > > > You check for these separately but you assume that res_ninit is available > > > ifdef HAVE_RES_NQUERY in the code. (Which is fine.) > > > > Yeah, it seemed to be simplest that way. > > I meant to imply "so don't even bother checking for res_ninit". Your call. I’d marginally prefer to have lack of res_ninit() caught by configure than at compile time. Also, HAVE_RES_NINIT is used in the #error section for if they don’t have res_ndestroy() or res_nclose(). > > > Don't need all this. configure will error out if res_query isn't found. > > > > I put it in much like a g_assert_not_reached(): it’s mostly there as > > documentation to anyone reading the code that this situation really > > shouldn’t happen. i.e. Don’t read the code and think that omitting both > > calls is a possibility. > > Hm. It doesn't look like "assert not reached" to me; it looks like "TODO". > Maybe because it reminds me of the #else clause in gcredentials.c: > > #else > #ifdef __GNUC__ > #warning Please add GCredentials support for your OS > #endif > #endif > > But here we know that if they don't have res_nquery() then they definitely > have res_query(). So just write: > > #ifdef HAVE_RES_NQUERY > len = res_nquery (... > #else > len = res_query (... > #endif Ah, indeed, thanks. Fixed.
Created attachment 366646 [details] [review] gio: Port GThreadedResolver to use res_nquery() to fix thread-safety res_query() uses global state in the form of the struct __res_state which contains the contents of resolv.conf (and other things). On Linux, this state seems to be thread-local, so there is no problem. On OS X, however, it is not, and hence multiple res_query() calls from parallel threads will compete and return bogus results. The fix for this is to use res_nquery(), introduced in BIND 8.2, which takes an explicit state argument. This allows us to manually store the state thread-locally. If res_nquery() isn’t available, we fall back to res_query(). It should be available on OS X though. As a data point, it’s available on Fedora 27. There’s a slight complication in the fact that OS X requires the state to be freed using res_ndestroy() rather than res_nclose(). Linux uses res_nclose(). (See, for example, the NetBSD man page: https://www.unix.com/man-page/netbsd/3/res_ninit/. The Linux one is incomplete and not so useful: http://man7.org/linux/man-pages/man3/resolver.3.html.) The new code will call res_ninit() once per res_nquery() task. This is not optimal, but no worse than before — since res_query() was being called in a worker thread, on Linux, it would implicitly initialise the thread-local struct __res_state when it was called. We’ve essentially just made that explicit. In practical terms, this means a stat("/etc/resolv.conf") call per res_nquery() task. In future, we could improve this by using an explicit thread pool with some manually-created worker threads, each of which initialises a struct __res_state on spawning, and only updates it on receiving the #GResolver::reload signal. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Any feedback on this version, Dan?
Please someone review or commit that
Review of attachment 366646 [details] [review]: This looks okay to me; the error message could do with a better description of what a user is supposed to do, but that can be fixed when pushing it. ::: gio/gthreadedresolver.c @@ +875,3 @@ + res_nclose (&res); +#elif defined(HAVE_RES_NINIT) +#error Your platform has res_ninit() but not res_nclose() or res_ndestroy(). Please file a bug. "Please file a bug" where?
Thanks for the review; pushed with a link to Bugzilla added to the error message. Attachment 366646 [details] pushed as 40be86b - gio: Port GThreadedResolver to use res_nquery() to fix thread-safety