GNOME Bugzilla – Bug 756430
g_rw_lock_reader_lock() can return without locking, or error
Last modified: 2017-11-03 20:52:57 UTC
g_rw_lock_reader_lock() returns void. On POSIX systems it just calls pthread_rwlock_rdlock() which can return several different error values, any of which indicate that the lock is not taken. These errors are not propagated or visible to the caller of the glib function. Code that relies on the declared semantics of g_rw_lock_read_lock() can end up causing memory corruption because it does not in fact ensure that the protected data/code is locked as intended. OS X supports recursive RW locks, and so this does not occur on that particular variant of POSIX, but Linux does not. We discovered this by (incorrectly) having a thread which took a read/write lock and then attempted to take a read lock. Using pthreads directly on Linux, this should result in EDEADLCK. Using glib's wrappers, g_rw_lock_reader_lock() returns as if there is no problem, and the lock is now held. This also results in undefined behaviour for the unlock, since the lock is not actually held.
There are earlier bugs about the differences between how glibc pthreads doesn't do error checking by default, but e.g. FreeBSD does etc. I think in the end the best bet is to use platform debug tools for threading (e.g. helgrind), rather than GLib attempting to impose a platform-independent threading error checking mechanism.
This isn't about debugging. The declared semantics of g_rw_lock_reader_lock() do not match its behaviour. Those semantics say that on return, the lock is held. This isn't true.
> Those semantics say that on return, the lock is held. This isn't true. Its true unless an error occurred. I think your original complaint was more to the point: > These errors are not propagated or visible to the caller of the glib function.
The operation invoked by g_rw_lock_reader_lock() can fail. But the function does not propagate or make visible the failure. Using helgrind can work once you have some idea that a problem is lock-related. But we were debugging random memory corruption, occuring in a single thread. Note that the documentation for the unlock function notes the risks of unlocking an unlocked mutex. We don't know for sure if the corruption was caused by the unlock itself, or something else, but it was fixed when we stopped causing (silent) failure in the lock call. To have this function present in the glib API is basically offering the illusion of its functionality with no suggestion or hint as to the problems it could cause. Is this acceptable?
It can't return *useful* errors. If it returns an error, then there's a bug in your program, and there's nothing you can programmatically do to recover from it. But we could make the glib function warn/return-if-fail/abort if the pthread function returned an error.
Could we handle EAGAIN meaningfully ?
I think EAGAIN here is like EMFILE: it means that either you have a resource leak, or else you need to bump up your process limits. I don't think there's a meaningful way to actually handle it.
(In reply to Dan Winship from comment #5) > It can't return *useful* errors. If it returns an error, then there's a bug > in your program, and there's nothing you can programmatically do to recover > from it. The GThreads RW lock API is clearly modelled on POSIX, and POSIX clearly states that a rdlock can fail to be acquired. Any sane consideration of RW locks from a theoretical perspective would also indicate that an attempt to acquire the lock can fail. Why would glib use a void type for a function that may not be able to do what it claims? Are you suggesting that it makes no sense for pthread_rwlock_rdlock() to return an error either, since there's nothing you can programmatically do to recover from it? If it is really believed that there's no sensible code path after a failed rdlock attempt, then the glib version should abort, if for no reason that to prevent the program from entering "undefined behaviour" territory when it unlocks.
Is anything going to be done about this? Nobody has responded to my comparison of the POSIX RW Lock API and Glib behaviour. I cannot see how the current glib behaviour can be remotely acceptable: there is NOTHING wrong with a read lock failing to be acquired (it just means that the write lock is currently held).
(In reply to Paul Davis from comment #9) > there is NOTHING wrong > with a read lock failing to be acquired (it just means that the write lock > is currently held). The pthread_rwlock_rdlock() man page does seem to go out of its way to imply that for some reason, but it eventually clarifies that "If the read lock is not acquired, the calling thread shall block until it can acquire the lock." The only cases where it returns without acquiring the lock are in situations that you can't plausibly recover from (programmer error or resource exhaustion).
I think you simply want pthread_rwlock_tryrdlock ?
Created attachment 362909 [details] [review] gthread: Emit a critical if g_rw_lock_reader_lock() fails It can only fail if there’s been a leak or programmer error, so this is really unlikely to happen. At least make it obvious something has gone wrong, though, rather than silently carrying on and returning as if the reader lock has been acquired. Signed-off-by: Philip Withnall <withnall@endlessm.com>
As others have said, you’re only going to get an error return here if there’s a programmer error. We can’t change the API of g_rw_lock_reader_lock() to expose the failure, but we can emit a critical warning, which will at least help with debugging the programmer error (and is consistent with our reporting of other programmer errors).
Created attachment 362913 [details] [review] gthread: Emit a critical if g_rw_lock_reader_lock() fails It can only fail if there’s been a leak or programmer error, so this is really unlikely to happen. At least make it obvious something has gone wrong, though, rather than silently carrying on and returning as if the reader lock has been acquired. Do the same for g_rw_lock_writer_lock(). It should be safe to use g_critical() for reporting the problems, since GRWLock is not used in gmessages.c, and printing a critical seems better than aborting, just in case we do hit the ‘maximum number of reader locks’ error code. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Review of attachment 362913 [details] [review]: ::: glib/gthread-posix.c @@ +542,3 @@ + + if (retval != 0) + g_critical ("Failed to get RW lock (error %d)", retval); Let's use %p here so if the process dumps core a developer can see *which* lock? And why not `g_strerror (errno)`?
Created attachment 362925 [details] [review] gthread: Emit a critical if g_rw_lock_reader_lock() fails It can only fail if there’s been a leak or programmer error, so this is really unlikely to happen. At least make it obvious something has gone wrong, though, rather than silently carrying on and returning as if the reader lock has been acquired. Do the same for g_rw_lock_writer_lock(). It should be safe to use g_critical() for reporting the problems, since GRWLock is not used in gmessages.c, and printing a critical seems better than aborting, just in case we do hit the ‘maximum number of reader locks’ error code. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Review of attachment 362925 [details] [review]: LGTM.
Attachment 362925 [details] pushed as fc817eb - gthread: Emit a critical if g_rw_lock_reader_lock() fails