GNOME Bugzilla – Bug 697316
abort trying to unlock an non existent mutex
Last modified: 2013-05-01 08:45:37 UTC
Hi. I'm having a crash in tracker due to what it seems like unlocking a non existing or already unlocked mutex... On BSD systems, this is an issue because PTHREAD_MUTEX_DEFAULT is set to PTHREAD_MUTEX_STRICT_NP which means we end up with an abort(). $ /usr/local/libexec/tracker-miner-fs (tracker-miner-fs:14000): Tracker-WARNING **: Locale '0' is not set, defaulting to C locale Abort trap (core dumped) (gdb) bt
+ Trace 231730
Ok it seems that the issue is that tracker_sparql_backend_get() does this: g_mutex_lock (&tracker_sparql_backend_door); { ... if something bad, then goto __finally3 } result = _result_; { g_mutex_unlock (&tracker_sparql_backend_door); } return result; { __finally3: { g_mutex_unlock (&tracker_sparql_backend_door); } So when tracker_sparql_backend_get behaves correctly, you will hit the second g_mutex_unlock() in __finally3 on a mutex that has already been unlocked. I am not familiar with vala so I cannot provide a patch for tracker-backend.vala, but directly removing the first g_mutex_unlock() call from tracker-backend.c does the trick for me and AFAIU, __finally3 is always reached so there is no chance to leave a locked mutex around. At least, that's my understanding...
(In reply to comment #1) > So when tracker_sparql_backend_get behaves correctly, you will hit the > second g_mutex_unlock() in __finally3 on a mutex that has already been > unlocked. There is a return statement before __finally3, which means that __finally3 will never be reached without 'goto __finally3'.
(In reply to comment #2) > (In reply to comment #1) > > So when tracker_sparql_backend_get behaves correctly, you will hit the > > second g_mutex_unlock() in __finally3 on a mutex that has already been > > unlocked. > > There is a return statement before __finally3, which means that __finally3 will > never be reached without 'goto __finally3'. Ah right, I missed that. The C file is quite hard to read with all the line numbers comments :-/ The issue still stands though; I'm not quite sure why I'm getting this double unlock.
So, I recompiled our thread library using #define PTHREAD_MUTEX_DEFAULT PTHREAD_MUTEX_NORMAL With this, the crash does not occur. So there is indeed an issue with unlocking an unlocked/uninitialized mutex... I am really lost in vala, and could use some help :-/ Thanks.
After some investigation with Antoine, this is down to this in Glib: #ifdef PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP { pthread_mutexattr_t attr; pthread_mutexattr_init (&attr); pthread_mutexattr_settype (&attr, PTHREAD_MUTEX_ADAPTIVE_NP); pattr = &attr; } #endif if G_UNLIKELY ((status = pthread_mutex_init (mutex, pattr)) != 0) g_thread_abort (status, "pthread_mutex_init"); #ifdef PTHREAD_ADAPTIVE_MUTEX_NP pthread_mutexattr_destroy (&attr); #endif -- On Linux systems, the definition _INITIALIZER_NP exists so we don't have a problem with the _unlock() issue because it just fails if it can't _unlock(). For BSD (and likely freebsd, netbsd) it seems _DEFAULT is set to PTHREAD_MUTEX_STRICT_NP which will abort on _unlock() failures. That's why Antoine's #define fixes it for him in libpthread. I would suggest that we add a line there after the if G_UNLIKELY (...), something like: + #ifndef PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP + pthread_mutexattr_settype (&attr, PTHREAD_MUTEX_NORMAL); + #endif Otherwise, the behaviour is largely inconsistent across GLib platforms. Jürg, any thoughts before I see what the GLib team have to say about this idea?
> On Linux systems, the definition _INITIALIZER_NP exists so we don't have a > problem with the _unlock() issue because it just fails if it can't _unlock(). > For BSD (and likely freebsd, netbsd) it seems _DEFAULT is set to > PTHREAD_MUTEX_STRICT_NP which will abort on _unlock() failures. That's why > Antoine's #define fixes it for him in libpthread. I forgot to add in case it wasn't obvious, but it seems _INITIALIZER_NP is undefined, hence _DEFAULT being used, which then infers _STRICT_NP.
To extend a bit, if you set the mutex type to PTHREAD_MUTEX_ERRORCHECK on Linux, you should see the same error as I do: GLib (gthread-posix.c): Unexpected error from C library during 'pthread_mutex_unlock': Operation not permitted. Aborting.
Hi Martyn, Antoine asked me to take a look at this. It appears to be a bug in vala, but I'm not 100% on that since this my first time looking at a vala program. In anycase, what is happening is the static initilization of the 'door' mutex is occuring *after* it has been locked in the get() method. This has the effect of wiping out the existing lock and leaving the lock in an unlocked state. Antoine is correct, this bug should be visable when the mutex type is set to PTHREAD_MUTEX_ERRORCHECK on any platform. Take a look at the following stack trace which occurs after the 'door' lock has been locked (more comments after trace below):
+ Trace 231798
In frame 6 calls tracker_sparql_connection_get(), door is locked (however nothing has called tracker_sparql_backend_class_init() yet. Then at line 224 the static initialization of the class is triggered: #5 0x000013490ee2e83b in tracker_sparql_backend_get (cancellable=0x0, error=0x7f7ffffccd18) at /home/martyn/Source/tracker/src/libtracker-sparql-backend/tracker-backend.vala:224 224 result = new Tracker.Sparql.Backend (); This leads down to tracker_sparql_backend_class_init() being called after the lock was taken. tracker_sparql_backend_class_init() initializes the mutex and leaves the mutex unlocked. Later then it is unlocked but it is already unlocked. It seems to me vala should automatically trigger static initialization of static member variables as the first step of executing a static class function. Regards, -Kurt
I am not so sure this is Vala. I recently changed this code in commit 1e1d8b0cd8996702695eabf9073810e88257195e I did this because the upstream GLib mutex APIs have been deprecated (GStaticMutex), so I switched to GMutex. Previously (as you can see), we were using GStaticMutex and it was using static initialisation with G_STATIC_MUTEX_INIT (which AFAICS) just sets the structure to NULL (or 0s it). However, when we called g_static_mutex_lock(), that API creates the new lock _BEFORE_ locking it. You can see this with the g_static_mutex_get_mutex_impl() function and the functions defined in glib/gthread.h like G_LOCK_DEFINE_STATIC. So my question was during all this, is this really a Vala thing (i.e. race condition or sequencing issue) or is it a GMutex vs GStaticMutex thing. The answer is definitely the later. I don't believe this is a Vala thing at all, but what's clear is, we were abusing the static mutex code paths before and now we can't rely on those. It seems to me that the g_mutex_lock() API we're now using has no checks for initialisation which the former g_static_mutex_lock() had. Sadly, the documentation states we should use g_mutex_lock() but doesn't make any point about race conditions or initialisation cases like this one that may occur. So we're probably not the only ones seeing this issue. The solution here is to either add code to check for initialisation before using the door.lock() calls or add some way to have a GStaticMutex type functionality in again. These are really the same solution. The GStaticMutex functionality was exactly what we needed here because we have a singleton and don't want the creation of that to be a race condition. We could potentially use some g_atomic_pointer_compare_and_exchange() type API here to fix this (which is replicating what some of the mutex APIs were/are doing anyway). I don't consider this a Vala issue though, this is all about the underlying GLib mutex APIs we're using and the _get() API for the connection doesn't require an existing singleton or TrackerSparqlConnection object/class to exist before calling _get(). Jürg any comments?
hi Martyn, GMutex indeed handles the 'static initialisation' case and is documented as such, even including example code to demonstrate: """ Example 3. Using GMutex to protected a shared variable int give_me_next_number (void) { static GMutex mutex; static int current_number = 0; int ret_val; g_mutex_lock (&mutex); ret_val = current_number = calc_next_number (current_number); g_mutex_unlock (&mutex); return ret_val; } Notice that the GMutex is not initialised to any particular value. Its placement in static storage ensures that it will be initialised to all-zeros, which is appropriate. If a GMutex is placed in other contexts (eg: embedded in a struct) then it must be explicitly initialised using g_mutex_init(). """ It might be that you have found a really tricky situation in which there is a race in this initialisation but very many people are using it in exactly this way already, without problems. I think that Antoine's analysis of the problem is very likely correct: you're probably unlocking a mutex that's not locked. We've seen many issues like this and the BSD guys are always arguing that we should change to error-checking mutexes in Linux by default specifically so developers would catch these kinds of problems sooner.
(In reply to comment #10) > hi Martyn, > > GMutex indeed handles the 'static initialisation' case and is documented as > such, even including example code to demonstrate: > > """ > > Example 3. Using GMutex to protected a shared variable > > int > give_me_next_number (void) > { > static GMutex mutex; > static int current_number = 0; > int ret_val; > > g_mutex_lock (&mutex); > ret_val = current_number = calc_next_number (current_number); > g_mutex_unlock (&mutex); > > return ret_val; > } > > > Notice that the GMutex is not initialised to any particular value. Its > placement in static storage ensures that it will be initialised to all-zeros, > which is appropriate. Yea, we're actually using it in Vala like this: https://git.gnome.org/browse/tracker/tree/src/libtracker-sparql-backend/tracker-backend.vala#n209 So the: static Mutex door = Mutex (); The corresponding C is: static GMutex tracker_sparql_backend_door; static GMutex tracker_sparql_backend_door = {0}; > If a GMutex is placed in other contexts (eg: embedded in a struct) then it must > be explicitly initialised using g_mutex_init(). Yea, we're not doing that. It's part of a class's private members which needs to be initiated before anything else to avoid singleton creation race conditions. > I think that Antoine's analysis of the problem is very likely correct: you're > probably unlocking a mutex that's not locked. We've seen many issues like this Antoine's analysis is the next logical progression from what i've described. That is to say that what he describes happens next anyway. I was trying to identify why. I checked his stack trace and the sources and confirmed what Antoine was finding. So run through this with me then Ryan because I am not totally convinced: 1. We have a GMutex initialised with {0}; 2. We call _lock() which looks like this: g_mutex_lock (GMutex *mutex) { if G_UNLIKELY ((status = pthread_mutex_lock (g_mutex_get_impl (mutex))) != 0) ... } 3. Here is the _impl() function: g_mutex_get_impl (GMutex *mutex) { pthread_mutex_t *impl = g_atomic_pointer_get (&mutex->p); ... } So mutex->p should be NULL, where &mutex has a value and the contents of the union have been zeroed with {0}. So looking at the code right now, this should work as you say in your example without any changes. So what am I missing?
Reading your code, I'm pretty sure I found the problem: In the class_init C code emitted by Vala, I see this: g_mutex_init (&_tmp0_); tracker_sparql_backend_door = _tmp0_; } And I notice that you are using the '_door' lock from static functions (ie: it's possible that you take the lock before the class is initialised). I think this sequence of events is extremely likely: 1) user calls static new Connection get() 2) lock is taken 3) new instance of object is created while lock is held, causing class_init to happen 4) class_init does _mutex_init(), replacing the already-existing mutex with a new one 5) you unlock this new mutex. boom.
(In reply to comment #12) > Reading your code, I'm pretty sure I found the problem: > > In the class_init C code emitted by Vala, I see this: > > g_mutex_init (&_tmp0_); > tracker_sparql_backend_door = _tmp0_; > } > > > And I notice that you are using the '_door' lock from static functions (ie: > it's possible that you take the lock before the class is initialised). > > I think this sequence of events is extremely likely: > > 1) user calls static new Connection get() > 2) lock is taken > 3) new instance of object is created while lock is held, causing class_init > to happen > 4) class_init does _mutex_init(), replacing the already-existing mutex with > a new one > 5) you unlock this new mutex. boom. Yes. This is exactly what is happening. It is also the reason I think this is a vala bug. Accessing a static member variable should trigger static initialization of the class prior to using that variable - not after.
It certainly does seem like Vala could be doing better here. It should either: a) use a ctor to setup the value, instead of class_init b) use some sort of 'ensure' g_once_* type of thing at all points of access c) similarly (as you suggest) hold a class ref at all points of access I really like 'a', but it would depend on constructor support in the C compiler, which is a non-standard feature.
Ryan, Kurt, thank you both for the additional comments and analysis. I will cook up a patch here and do some testing to see what makes sense. My initial thought is to try and NOT initialise with Mutex(), I suspect this is what is adding the g_mutex_init(&_tmp0_); C code. Failing that, I quite like the look of 'b'. I think 'a' still allows for a race condition of the singleton creation and you could end up with 2 classes competing there. Thanks again for the help! :)
(In reply to comment #15) > I think 'a' still allows for a race condition of the singleton > creation and you could end up with 2 classes competing there. When I said ctor, I meant in the __attribute__((constructor)) case (ie: run once, before main() runs).
(In reply to comment #14) > It certainly does seem like Vala could be doing better here. (a) is not an option for Vala due to the lack of portability you already mentioned. One option may be to simply make the issue obvious to the developer. E.g., only accept static variable initializers that are accepted as initializer expression in C as well. Require the developer to explicitly add an assignment to class_init or some other function for more complex initializers.
Created attachment 242960 [details] [review] Suggest patch to fix this bug Antoine, Kurt, any chance you could test this patch? Ryan, I tried this patch and it seems to work here on Ubuntu 13.04 with GLib 2.36. The patch seems to remove the g_mutex_init() call from the generated sources that Vala spits out. This should avoid the re-initialisation of the mutex. The atomic operations in the mutex internals should avoid race conditions too AFAICS. Jürg, if you think this isn't enough or prefer we tackle this another way, let me know. Thanks all.
(In reply to comment #18) > Created an attachment (id=242960) [details] [review] > Suggest patch to fix this bug > > Antoine, Kurt, any chance you could test this patch? Sure but: CDPATH="${ZSH_VERSION+.}:" && cd . && /usr/local/bin/valac --pkg gio-2.0 -g --target-glib=2.34 ../../src/libtracker-sparql/tracker-sparql-0.16.vapi ../../src/libtracker-bus/tracker-bus.vapi ../../src/libtracker-direct/tracker-direct.vapi -C tracker-backend.vala error: ../../src/libtracker-bus/tracker-bus.vapi not found error: ../../src/libtracker-direct/tracker-direct.vapi not found Compilation failed: 2 error(s), 0 warning(s) I suppose I need to use a git checkout and cannot just patch the latest released tarball?
Bah, removing the vala.stamp files did the trick and I was able to package this. IT WORKS :-) No more abort(). I've already committed it to the OpenBSD ports tree since even if it is eventually not accepted as the correct solution, it is still better than what we have right now. Thank you guys!
(In reply to comment #20) > Bah, removing the vala.stamp files did the trick and I was able to package > this. > IT WORKS :-) No more abort(). > > I've already committed it to the OpenBSD ports tree since even if it is > eventually not accepted as the correct solution, it is still better than what > we have right now. > > Thank you guys! Great. I am going to commit this anyway. I think if we need a different solution, we will likely have to build on this patch anyway. Since it works here and fixes your issue, it's better than no commit at all in the short term. This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report. I should be releasing with this later today (0.16.1).
I should add, I checked with Jürg on IRC before pushing.