After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 792050 - GResolver is not thread-safe
GResolver is not thread-safe
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gio
2.53.x
Other Mac OS
: High major
: 2.56
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2017-12-29 17:52 UTC by lovetox
Modified: 2018-02-02 17:12 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
minimal example (1.43 KB, text/plain)
2017-12-29 17:52 UTC, lovetox
  Details
minimal example (1.53 KB, text/x-python-script)
2018-01-04 21:01 UTC, lovetox
  Details
build: Add missing config.h entries for HAVE_RES_QUERY (1.22 KB, patch)
2018-01-05 14:37 UTC, Philip Withnall
rejected Details | Review
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety (10.25 KB, patch)
2018-01-05 14:38 UTC, Philip Withnall
none Details | Review
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety (10.33 KB, patch)
2018-01-09 12:04 UTC, Philip Withnall
none Details | Review
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety (10.25 KB, patch)
2018-01-11 11:21 UTC, Philip Withnall
committed Details | Review

Description lovetox 2017-12-29 17:52:34 UTC
Created attachment 366087 [details]
minimal example

It seems if i call lookup_records_async() while there is another query going on, all querys return empty

I cannot reproduce this on Linux

I attached sample code

This is on Mac 10.13.2
Comment 1 Philip Withnall 2018-01-03 11:37:40 UTC
It probably doesn’t help that there are no SRV records on _xmpps-client._tcp.jabber.org or _xmppa-client._tcp.jabber.org.

What output do you get if you change it to resolve _xmpp-client._tcp.jabber.org and _xmpp-server._tcp.jabber.org?
Comment 2 lovetox 2018-01-04 21:01:16 UTC
Created attachment 366329 [details]
minimal example
Comment 3 lovetox 2018-01-04 21:01:47 UTC
20:54:51 »_xmpp-client._tcp.jabber.org« kann vorübergehend nicht aufgelöst werden
20:54:51 »_xmpp-client._tcp.jabber.org« kann vorübergehend nicht aufgelöst werden
20:54:51 [GLib.Variant('(qqqs)', (30, 30, 5222, 'hermes2.jabber.org')), GLib.Variant('(qqqs)', (31, 30, 5222, 'hermes2v6.jabber.org'))]

I pressed the button again to resolve the same records again:

20:54:59 »_xmpp-client._tcp.jabber.org« kann vorübergehend nicht aufgelöst werden
20:55:04 [GLib.Variant('(qqqs)', (30, 30, 5222, 'hermes2.jabber.org')), GLib.Variant('(qqqs)', (31, 30, 5222, 'hermes2v6.jabber.org'))]
20:55:04 [GLib.Variant('(qqqs)', (30, 30, 5222, 'hermes2.jabber.org')), GLib.Variant('(qqqs)', (31, 30, 5222, 'hermes2v6.jabber.org'))]

3 requests, 3 times with the same valid record

if it works or not is currently a game by chance.

also notice how it needs in the second attempt 5 seconds to resolve 2 of 3 requests

i updated my example to reflect this result
Comment 4 Philip Withnall 2018-01-05 13:26:13 UTC
That’s pretty convincing. Looking at the man pages, it seems that res_query() is not thread-safe on various platforms (it uses the global _res state), and we should be using res_nquery() instead.

http://man7.org/linux/man-pages/man3/resolver.3.html
https://stackoverflow.com/questions/3548481/is-res-query-thread-safe/4981981#4981981

I’ll look at making a fix.
Comment 5 Dan Winship 2018-01-05 14:00:01 UTC
(In reply to Philip Withnall from comment #4)
> it seems that
> res_query() is not thread-safe on various platforms (it uses the global _res
> state)

It's not thread-safe on *any* platform if people might be modifying _res, but... you know... "don't do that then".

> and we should be using res_nquery() instead.

Ooh. Pretty sure that didn't exist when we wrote GResolver. Looks like it still doesn't exist on FreeBSD and OpenBSD (based on some quick googling).
Comment 6 Philip Withnall 2018-01-05 14:37:56 UTC
Created attachment 366381 [details] [review]
build: Add missing config.h entries for HAVE_RES_QUERY

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 7 Philip Withnall 2018-01-05 14:38:03 UTC
Created attachment 366382 [details] [review]
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety

res_query() uses global state in the form of the struct __res_state
which contains the contents of resolv.conf (and other things). On Linux,
this state seems to be thread-local, so there is no problem. On OS X,
however, it is not, and hence multiple res_query() calls from parallel
threads will compete and return bogus results.

The fix for this is to use res_nquery(), introduced in BIND 8.2, which
takes an explicit state argument. This allows us to manually store the
state thread-locally. If res_nquery() isn’t available, we fall back to
res_query(). It should be available on OS X though. As a data point,
it’s available on Fedora 27.

There’s a slight complication in the fact that OS X requires the state
to be freed using res_ndestroy() rather than res_nclose(). Linux uses
res_nclose().

(See, for example, the NetBSD man page:
https://www.unix.com/man-page/netbsd/3/res_ninit/. The Linux one is
incomplete and not so useful:
http://man7.org/linux/man-pages/man3/resolver.3.html.)

The new code will call res_ninit() once per res_nquery() task. This is
not optimal, but no worse than before — since res_query() was being
called in a worker thread, on Linux, it would implicitly initialise the
thread-local struct __res_state when it was called. We’ve essentially
just made that explicit. In practical terms, this means a
stat("/etc/resolv.conf") call per res_nquery() task.

In future, we could improve this by using an explicit thread pool with
some manually-created worker threads, each of which initialises a struct
__res_state on spawning, and only updates it on receiving
the #GResolver::reload signal.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 8 Philip Withnall 2018-01-05 14:40:02 UTC
Dan, what do you think about the approach in these patches (and the potential future plans, although I’m not committing to writing them, so assume they will never be implemented…). I ask since you originally wrote GThreadedResolver.

One thing I forgot to note above is that I’ve left the res_init() call in gresolver.c in place (without porting it to res_ninit()) since it seems to be doing no harm, and people might be relying on it updating the main thread’s resolver state for them.
Comment 9 Philip Withnall 2018-01-05 14:41:43 UTC
(In reply to Dan Winship from comment #5)
> (In reply to Philip Withnall from comment #4)
> > it seems that
> > res_query() is not thread-safe on various platforms (it uses the global _res
> > state)
> 
> It's not thread-safe on *any* platform if people might be modifying _res,
> but... you know... "don't do that then".
> 
> > and we should be using res_nquery() instead.
> 
> Ooh. Pretty sure that didn't exist when we wrote GResolver. Looks like it
> still doesn't exist on FreeBSD and OpenBSD (based on some quick googling).

Oh, hello. You replied while I was writing my patches and I didn’t notice. :-)

I read things which say that _res is thread-local on Linux, but tbh a lot of what I read seemed fairly uninformed, so anything could be true.

res_nquery() seems to exist on NetBSD but not FreeBSD or OpenBSD. It does exist on OS X though, which is the key thing here.
Comment 10 Philip Withnall 2018-01-05 14:42:17 UTC
(In reply to Dan Winship from comment #5)
> (In reply to Philip Withnall from comment #4)
> > it seems that
> > res_query() is not thread-safe on various platforms (it uses the global _res
> > state)
> 
> It's not thread-safe on *any* platform if people might be modifying _res,
> but... you know... "don't do that then".

Does res_query() itself not modify the state?
Comment 11 Dan Winship 2018-01-07 16:34:34 UTC
_res is not *documented* as being modified by anything other than res_init(), and it's implied to just be a representation of resolv.conf. But I guess nothing stops implementations from storing per-query state in it.
Comment 12 Philip Withnall 2018-01-07 16:46:52 UTC
Mmm. The documentation for these APIs seems to be a bit sketchy/incomplete, and the weird results on OS X do point towards a race.

lovetox, are you able to recompile GLib with my two patches applied and see if they fix your issue? If you could, that would be very helpful.
Comment 13 Dan Winship 2018-01-08 22:50:19 UTC
Comment on attachment 366381 [details] [review]
build: Add missing config.h entries for HAVE_RES_QUERY

The configure checks for res_query aren't checking to see *if* res_query exists, they're checking to see *where* it exists. We require res_query() to exist (somewhere) on G_OS_UNIX (and we know it doesn't exist on G_OS_WIN32).

Note that it's possible that some or all of the res_query-related configure checks only apply to no-longer-supported OSes or OS versions.
Comment 14 Dan Winship 2018-01-08 23:00:14 UTC
Comment on attachment 366382 [details] [review]
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety

>+  AC_MSG_CHECKING([for res_ninit])
>+  AC_MSG_CHECKING([for res_nquery])

You check for these separately but you assume that res_ninit is available ifdef HAVE_RES_NQUERY in the code. (Which is fine.)

>+   * What we have currently is no worse than using res_query() in worker
>+   * threads, since it would transparently call res_init() for each new worker
>+   * thread. */

FWIW this isn't quite true since GThreadedResolver uses GTask which uses GThreadPool, so threads do get reused. (This isn't an argument against the code; there are [non-glib] codebases that just call res_init() before every resolver call to ensure they pick up resolv.conf changes, and no one notices any performance problems there.)

>+#elif defined(HAVE_RES_QUERY)
>       len = res_query (lrd->rrname, C_IN, rrtype, answer->data, answer->len);
>+#else
>+#error res_nquery() and res_query() not supported.
>+#endif

Don't need all this. configure will error out if res_query isn't found.

>+#elif defined(HAVE_RES_NINIT)
>+#error Your platform has res_ninit() but not res_nclose() or res_ndestroy(). Please file a bug.
>+#endif

Seems like that should be a configure-time error, not compile-time.
Comment 15 Philip Withnall 2018-01-09 11:56:34 UTC
Review of attachment 366381 [details] [review]:

Uff, that was stupid.
Comment 16 Philip Withnall 2018-01-09 12:03:56 UTC
(In reply to Dan Winship from comment #14)
> Comment on attachment 366382 [details] [review] [review]
> gio: Port GThreadedResolver to use res_nquery() to fix thread-safety
> 
> >+  AC_MSG_CHECKING([for res_ninit])
> >+  AC_MSG_CHECKING([for res_nquery])
> 
> You check for these separately but you assume that res_ninit is available
> ifdef HAVE_RES_NQUERY in the code. (Which is fine.)

Yeah, it seemed to be simplest that way.

> >+   * What we have currently is no worse than using res_query() in worker
> >+   * threads, since it would transparently call res_init() for each new worker
> >+   * thread. */
> 
> FWIW this isn't quite true since GThreadedResolver uses GTask which uses
> GThreadPool, so threads do get reused. (This isn't an argument against the
> code; there are [non-glib] codebases that just call res_init() before every
> resolver call to ensure they pick up resolv.conf changes, and no one notices
> any performance problems there.)

I’ll amend the comment.

> >+#elif defined(HAVE_RES_QUERY)
> >       len = res_query (lrd->rrname, C_IN, rrtype, answer->data, answer->len);
> >+#else
> >+#error res_nquery() and res_query() not supported.
> >+#endif
> 
> Don't need all this. configure will error out if res_query isn't found.

I put it in much like a g_assert_not_reached(): it’s mostly there as documentation to anyone reading the code that this situation really shouldn’t happen. i.e. Don’t read the code and think that omitting both calls is a possibility.

> >+#elif defined(HAVE_RES_NINIT)
> >+#error Your platform has res_ninit() but not res_nclose() or res_ndestroy(). Please file a bug.
> >+#endif
> 
> Seems like that should be a configure-time error, not compile-time.

It seems unlikely that this would happen (so I don’t want to complicate the configure script with it), but like above I do want to put something in the code like a g_assert_not_reached().
Comment 17 Philip Withnall 2018-01-09 12:04:47 UTC
Created attachment 366544 [details] [review]
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety

res_query() uses global state in the form of the struct __res_state
which contains the contents of resolv.conf (and other things). On Linux,
this state seems to be thread-local, so there is no problem. On OS X,
however, it is not, and hence multiple res_query() calls from parallel
threads will compete and return bogus results.

The fix for this is to use res_nquery(), introduced in BIND 8.2, which
takes an explicit state argument. This allows us to manually store the
state thread-locally. If res_nquery() isn’t available, we fall back to
res_query(). It should be available on OS X though. As a data point,
it’s available on Fedora 27.

There’s a slight complication in the fact that OS X requires the state
to be freed using res_ndestroy() rather than res_nclose(). Linux uses
res_nclose().

(See, for example, the NetBSD man page:
https://www.unix.com/man-page/netbsd/3/res_ninit/. The Linux one is
incomplete and not so useful:
http://man7.org/linux/man-pages/man3/resolver.3.html.)

The new code will call res_ninit() once per res_nquery() task. This is
not optimal, but no worse than before — since res_query() was being
called in a worker thread, on Linux, it would implicitly initialise the
thread-local struct __res_state when it was called. We’ve essentially
just made that explicit. In practical terms, this means a
stat("/etc/resolv.conf") call per res_nquery() task.

In future, we could improve this by using an explicit thread pool with
some manually-created worker threads, each of which initialises a struct
__res_state on spawning, and only updates it on receiving
the #GResolver::reload signal.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 18 Dan Winship 2018-01-09 14:57:27 UTC
(In reply to Philip Withnall from comment #16)
> > You check for these separately but you assume that res_ninit is available
> > ifdef HAVE_RES_NQUERY in the code. (Which is fine.)
> 
> Yeah, it seemed to be simplest that way.

I meant to imply "so don't even bother checking for res_ninit". Your call.

> > Don't need all this. configure will error out if res_query isn't found.
> 
> I put it in much like a g_assert_not_reached(): it’s mostly there as
> documentation to anyone reading the code that this situation really
> shouldn’t happen. i.e. Don’t read the code and think that omitting both
> calls is a possibility.

Hm. It doesn't look like "assert not reached" to me; it looks like "TODO". Maybe because it reminds me of the #else clause in gcredentials.c:

    #else
      #ifdef __GNUC__
      #warning Please add GCredentials support for your OS
      #endif
    #endif

But here we know that if they don't have res_nquery() then they definitely have res_query(). So just write:

    #ifdef HAVE_RES_NQUERY
            len = res_nquery (...
    #else
            len = res_query (...
    #endif
Comment 19 Philip Withnall 2018-01-11 11:14:48 UTC
(In reply to Dan Winship from comment #18)
> (In reply to Philip Withnall from comment #16)
> > > You check for these separately but you assume that res_ninit is available
> > > ifdef HAVE_RES_NQUERY in the code. (Which is fine.)
> > 
> > Yeah, it seemed to be simplest that way.
> 
> I meant to imply "so don't even bother checking for res_ninit". Your call.

I’d marginally prefer to have lack of res_ninit() caught by configure than at compile time.

Also, HAVE_RES_NINIT is used in the #error section for if they don’t have res_ndestroy() or res_nclose().

> > > Don't need all this. configure will error out if res_query isn't found.
> > 
> > I put it in much like a g_assert_not_reached(): it’s mostly there as
> > documentation to anyone reading the code that this situation really
> > shouldn’t happen. i.e. Don’t read the code and think that omitting both
> > calls is a possibility.
> 
> Hm. It doesn't look like "assert not reached" to me; it looks like "TODO".
> Maybe because it reminds me of the #else clause in gcredentials.c:
> 
>     #else
>       #ifdef __GNUC__
>       #warning Please add GCredentials support for your OS
>       #endif
>     #endif
> 
> But here we know that if they don't have res_nquery() then they definitely
> have res_query(). So just write:
> 
>     #ifdef HAVE_RES_NQUERY
>             len = res_nquery (...
>     #else
>             len = res_query (...
>     #endif

Ah, indeed, thanks. Fixed.
Comment 20 Philip Withnall 2018-01-11 11:21:26 UTC
Created attachment 366646 [details] [review]
gio: Port GThreadedResolver to use res_nquery() to fix thread-safety

res_query() uses global state in the form of the struct __res_state
which contains the contents of resolv.conf (and other things). On Linux,
this state seems to be thread-local, so there is no problem. On OS X,
however, it is not, and hence multiple res_query() calls from parallel
threads will compete and return bogus results.

The fix for this is to use res_nquery(), introduced in BIND 8.2, which
takes an explicit state argument. This allows us to manually store the
state thread-locally. If res_nquery() isn’t available, we fall back to
res_query(). It should be available on OS X though. As a data point,
it’s available on Fedora 27.

There’s a slight complication in the fact that OS X requires the state
to be freed using res_ndestroy() rather than res_nclose(). Linux uses
res_nclose().

(See, for example, the NetBSD man page:
https://www.unix.com/man-page/netbsd/3/res_ninit/. The Linux one is
incomplete and not so useful:
http://man7.org/linux/man-pages/man3/resolver.3.html.)

The new code will call res_ninit() once per res_nquery() task. This is
not optimal, but no worse than before — since res_query() was being
called in a worker thread, on Linux, it would implicitly initialise the
thread-local struct __res_state when it was called. We’ve essentially
just made that explicit. In practical terms, this means a
stat("/etc/resolv.conf") call per res_nquery() task.

In future, we could improve this by using an explicit thread pool with
some manually-created worker threads, each of which initialises a struct
__res_state on spawning, and only updates it on receiving
the #GResolver::reload signal.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 21 Philip Withnall 2018-01-17 16:12:07 UTC
Any feedback on this version, Dan?
Comment 22 lovetox 2018-01-26 23:51:24 UTC
Please someone review or commit that
Comment 23 Emmanuele Bassi (:ebassi) 2018-02-02 16:43:25 UTC
Review of attachment 366646 [details] [review]:

This looks okay to me; the error message could do with a better description of what a user is supposed to do, but that can be fixed when pushing it.

::: gio/gthreadedresolver.c
@@ +875,3 @@
+  res_nclose (&res);
+#elif defined(HAVE_RES_NINIT)
+#error Your platform has res_ninit() but not res_nclose() or res_ndestroy(). Please file a bug.

"Please file a bug" where?
Comment 24 Philip Withnall 2018-02-02 17:12:22 UTC
Thanks for the review; pushed with a link to Bugzilla added to the error message.

Attachment 366646 [details] pushed as 40be86b - gio: Port GThreadedResolver to use res_nquery() to fix thread-safety