GNOME Bugzilla – Bug 709223
problem with toggleref thread-safety
Last modified: 2013-11-27 22:51:41 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=1008099 Truncated backtrace: Thread no. 1 (10 frames) #0 sem_post at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_post.S:33 #1 PyThread_release_lock at /usr/src/debug/Python-2.7.5/Python/thread_pthread.h:349 #2 g_object_ref at gobject.c:3069 #3 g_settings_backend_dispatch_signal at gsettingsbackend.c:329 #4 g_settings_backend_changed at gsettingsbackend.c:400 #5 dconf_engine_change_notify at dconfsettingsbackend.c:250 #6 dconf_engine_handle_dbus_signal at dconf-engine.c:1164 #7 emit_signal_instance_in_idle_cb at gdbusconnection.c:3743 #11 g_main_context_iteration at gmain.c:3773 #12 dconf_gdbus_worker_thread at dconf-gdbus-thread.c:81 <mclasen> desrt: didn't you fix this dconf issue a while ago ? https://bugzilla.redhat.com/show_bug.cgi?id=1008099 ? <Services> Bug 1008099: unspecified, unspecified, ---, pikachu.2014, NEW , [abrt] gnome-tweak-tool-3.9.91-1.fc20: sem_post: Process /usr/bin/python2.7 was killed by signal 11 (SIGSEGV) <desrt> mclasen: interesting <desrt> mclasen: i think we're looking at the lack of threadsafety in our signal disconnection.... <desrt> for g_signal_connect_object(), i mean <mclasen> hmm <desrt> actually, this is a weird trace <desrt> g_object_ref() calls into PyThread_release_lock()?! <desrt> is this some toggleref thing going down? <mclasen> probably, thats what i thought <desrt> so the toggleref could be happening from the dconf worker thread <desrt> maybe python doesn't like that <desrt> lol. comment above this piece of code in gsettings is priceless <desrt> /* We're in a little bit of a tricky situation here. We need to hold <desrt> * a lock while traversing the list, but we don't want to hold the <desrt> * lock while calling back into user code. <desrt> g_mutex_lock (&backend->priv->lock); <desrt> ... <desrt> g_object_ref (watch->target); <desrt> i wonder if it's documented anywhere at all what the expectations are with respect to togglerefs and deadlocks <desrt> i think it's caused by gsettings, under its lock, taking a ref on an object that just had its last reference dropped from another thread <desrt> which is done holding a different lock <desrt> switching this over to GWeakRef may fix it <desrt> because the weak ref will already have been cleared by that point, so we won't even attempt to take the ref <desrt> ignoring bindings, this code is safe because it only does a harmless signal dispatch to no listeners once it gets into the thread that owns that gsettings object -- as is permitted on a disposed object <desrt> but in light of togglerefs, i guess the situation is more complicated <owen> desrt: I'm not sure how you think toggle references changes things here <desrt> owen: imagine i'm the python binding... <owen> desrt: they don't generally cause resurrenction <desrt> owen: the user's code is holding onto a GSettings python proxy object, and it just drops the reference <desrt> that was the only ref -- there was no ref from gobject land <desrt> so now python destroys the proxy, and this leads to the gobject getting unref called <desrt> good so far? <owen> desrt: sure, sounds right <desrt> at the same time, someone in another thread calls g_object_ref() <owen> desrt: This is impossible - "there was no ref from gobject land" <desrt> increasing the refcount 1->2 <owen> (leaving aside weakrefs) <desrt> owen: weakref <desrt> this is a weakref which i assume is safe because i access it under a lock and only call g_object_ref() -- which is completely valid up to and including the time that weak callbacks are made <owen> desrt: weak refs are pretty evil in combination with toggle refs - https://bugzilla.gnome.org/show_bug.cgi?id=670200 was about this - I think with the GDBus bus objects IIRC (halfline may recall better) <Services> Bug 670200: major, Normal, ---, gnome-shell-maint, RESOLVED FIXED, Hang during garbage collection <desrt> i may end up with a ref to a disposed object, but calling g_signal_emit() on that is also defined as safe <desrt> on the other hand... the toggle callback ariving in pythonland to the already-GC'd python object .... that's not going to go over so well <desrt> and i think even using GWeakRef may not help here <desrt> unless we can figure out a way to cancel the finalize of the python object if the dropping of the last ref on the gobject didn't destroy it <desrt> and/or maybe synchronise the registration/removal of toggle notifies with gweakrefs <owen> desrt: that doesn't sound plausible. I think the best you can do is set things up so you get a new python proxy for the resurrected object <desrt> no matter what we do we can be in a case where we have an incoming toggle notify from a random thread after the python runtime has called finalize on the python object <owen> this is problematical only if the resurrected object is a python derivation, but that won't be the case for you <desrt> owen: i'd rather avoid reviving the object, actually.... if it's on the way to its death, i'm not as happy not to dispatch the signal to it <desrt> clearly nobody cares anymore <desrt> the problem is that we have no synchronisation between the point where python tells us "okay, this object is getting finalized" and gobject.c <owen> desrt: Are you sure that the problem isn't more simply toggle refs arriving from different threads and the python runtime not handling that <desrt> owen: of course it could be that... but thinking about this problem has made me realise that there is a near-unsolvable subproblem here <owen> desrt: We did a *very* detailed analysis of this for the gjs case <owen> desrt: And in that case, deferring toggle references back to the main thread at idle did work <desrt> hmm. python has a two-stage destroy, like we do <owen> desrt: If the python code is actually multithreaded itself, then that's an extra layer, of course <desrt> i'm not sure how that's possible <desrt> in main thread: <desrt> - user drops last ref on python object <desrt> - python lock acquired <desrt> <context switch> <desrt> in dconf thread: <desrt> - we dispatch a signal, and see that we have a weakref on a gobject that has only 1 ref on it (from pythonland), which has not yet been dropped. <desrt> - we g_object_ref() this object, which results in a toggle notify being called <desrt> - the refcount is now 2. we enter python land, and attempt to acquire the lock, blocking because the main thread is holding it <desrt> back to the main thread: <desrt> - we handle the dispose, removing the toggle ref notify function (which is already dispatched in the other thread....) <desrt> - we called g_object_unref() but that doesn't destroy the object... only reduces its refcount to 1. therefore dispose is not called[1] <desrt> - we finish up and drop the lock <desrt> back in dconf thread: <desrt> - we succeed in acquiring the lock, but the object is no longer there <desrt> [1]: if dispose() _had_ have been called, then our weak notify for gsettings would have fired and tried to remove the object from the list of objects to dispatch, acquiring the same lock we're already holding while calling g_object_ref() and deadlocking, rather than crashing <mclasen> at which point did the object get finalized ? <desrt> never <mclasen> you said further up 'but that doesn't destroy the object' <desrt> the python object got freed at 'we finish up and drop the lock' <mclasen> oh, python object <desrt> pyg_toggle_notify in pygobject looks like this: <desrt> PyGObject *self = (PyGObject*) data; /* data is the callback user_data */ <desrt> acquire_global_lock(); <desrt> Py_INCREF(self); <desrt> *boom* <desrt> that self pointer is no longer valid <owen> desrt: shouldn't the toggle ref have been removed *before* self became no longer valid (at least in the sense of undefined memory)? <desrt> owen: it was... but it doesn't matter... the callback had already been dispatched in the other thread (where it was waiting on the python lock) <desrt> and as soon as python finishes up its destruction process and releases the lock, that thread will start running <desrt> toggleref notifies are completely unsynchronised with registration/removal of toggleref functions, and once we dispatch the callback we have no way to cancel that... <desrt> and pygobject utterly fails to entertain this possibility <desrt> mclasen: in any case, i'm pretty sure this is a pygobject bug upstream, if you want to forward it... <desrt> it's going to be a bastard to fix <owen> desrt: seems like essentially the same bug that we had and fixed with gjs, but not clear that there is a corresponding fix here <desrt> owen: the only thing i can think of is to somehow store a ref to the python object in qdata of the gobject so that it doesn't get freed until the gobject is truly dead.... and when we get to python land, we can call get_qdata() and see it return NULL, and deal with that gracefully
further discussion happened, and the game plan, I think is: 1) pass NULL for the userdata to g_object_add_toggle_ref . I doesn't have a hard reference on the PyObject, so it can't pass it as user data. 2) store a weak reference to the PyObject on the GObject (with e.g. g_object_set_data (object, "pyobject", self); or the same idea but with set_qdata) 3) clear the weak reference to the PyObject on the GObject when the PyObject is finalized (with e.g. g_object_set_data (object, "pyobject", self) or the same idea but with steal_qdata) 4) in the toggle reference notify function retrieve the pyobject from the GObject after the lock is acquired (with e.g. g_object_get_data (object, "pyobject") or the same idea but with get_qdata) 5) If the result is NULL then return early
(In reply to comment #1) > 3) clear the weak reference to the PyObject on the GObject when the PyObject is > finalized (with e.g. g_object_set_data (object, "pyobject", self) s/self/NULL/ obviously
Makes sense, it looks like we already store the py wrapper as qdata, so its just a matter of using it in the toggle ref notify as opposed to the userdata (and not passing userdata in general).
Created attachment 256225 [details] [review] Use qdata for wrapper retrieval in toggle reference notifications Replace usage of user data holding PyGObject wrappers in toggle ref notifications with GObject qdata retrieval. This fixes thread safety issues where a toggle notify may be called from another thread during the PyGObject wrappers dealloc. In this case the toggle notify is blocked because the GIL is held in dealloc, and when it continues, the user data would be holding an invalid PyGObject wrapper. Using qdata solves this by ensuring the wrapper retrieval is done within the safety of the GIL and may turn up as NULL. Notes: I've run this change through the PyGI test suite in both Python 2.7 and 3.3 (which exercises the toggle refs). But real world testing is still needed to ensure it actually fixes the problem. Beyond code review, review of the commit log summary both in terms of accurately describing the problem and solution would be helpful because this is tricky stuff.
*** Bug 709481 has been marked as a duplicate of this bug. ***
Review of attachment 256225 [details] [review]: patch seems to follow the script outlined above and otherwise looks okay to me
Pushing to 3.10 and master branches. Attachment 256225 [details] pushed as 90beeee - Use qdata for wrapper retrieval in toggle reference notifications
*** Bug 709876 has been marked as a duplicate of this bug. ***
*** Bug 710530 has been marked as a duplicate of this bug. ***
Considering that my bug was DUP'ed towards this one and in light that this report was made against GNOME 3.10.1 (where pygobject 3.10.1 contains this fix), I am afraid either the DUP is wrong or this bug not fixed.
(In reply to comment #10) > Considering that my bug was DUP'ed towards this one and in light that this > report was made against GNOME 3.10.1 (where pygobject 3.10.1 contains this > fix), I am afraid either the DUP is wrong or this bug not fixed. Okay thanks. The stack trace seems the same hence the dup. I'm going to re-open bug 710530 and provide a patch with a temporary hack there, help testing would be appreciated. Details: I'm fairly certain most of the problems we are seeing is simply because Pythons GIL is not enabled by default. So regardless of how we factor the toggle ref code, there is never a real Python lock being held (even if it looks like it in the toggle ref callback). GIL creation needs to be done explicitly for extension modules which use non-Python originating threads calling into Python (code not using the "threading" module). http://docs.python.org/2/c-api/init.html#PyEval_InitThreads To add confusion, we have GObject.threads_init() which creates the GIL and I fumbled in 3.10 by deprecating it (bug 710447). However, I don't believe this to be part of the cause of the gnome-tweak-tool problems because I don't see that gnome-tweak-tool has ever used it. If something as fundamental as dconf can cause toggle notifies in Python from different threads, I think we should just always enable the GIL when using PyGI. Since we cannot predict what GI repositories might call back into Python from a thread, not doing this will basically become a support drag. The only aversion is noted in the PyEval_InitThreads documentation: "...the lock operations slow the interpreter down a bit. Therefore, the lock is not created initially." For my own curiosity and to gain a better understanding, it would be nice to know why these problems are showing up in 3.10 and not seen in earlier versions of gnome-tweak-tool? Was something changed in the stack that suddenly exposed these problems?
> For my own curiosity and to gain a better understanding, it would be nice to > know why these problems are showing up in 3.10 and not seen in earlier versions > of gnome-tweak-tool? Was something changed in the stack that suddenly exposed > these problems? I had a look, and I could not see any low-in-the-stack changes in gnome-tweak-tool; 3.10 was a UI refactor release. The only new API I am using in g-t-t this cycle is GDBus. FWIW I think PyGI should always enable threads, especially if the intent of deprecating Glib.threads_init() was for PyGI to 'always work'. Requiring a new, different, python-only call seems counter to that.
I've committed a fix which always enables the GIL for PyGI: https://git.gnome.org/browse/pygobject/commit/?id=65b8f7b It is still a bit unclear to me if we want to do this for 3.10.2 though.
Hey John, Have you tested this patch out yet? How do I apply it myself to test it out by the way? Do I need to build python to be able to do that?
Created attachment 259546 [details] [review] Fix toggleref safety problems by always enabling the GIL Call PyEval_InitThreads for the base gi module import. This forces the Python internals create the GIL and always support threading with the various thread state enter/exit funcs. This is needed since we cannot predict which GI repositories might accept Python callbacks and run them in non-Python threads or trigger toggle ref notifications in a thread other than main. Notes: This is essentially a revert of: https://git.gnome.org/browse/pygobject/commit/?h=pygobject-3-10&id=7036e9d79ed49f with the addition of always forcing Python threads to be enabled.
(the last patch was for 3.10)
Comment on attachment 259546 [details] [review] Fix toggleref safety problems by always enabling the GIL Thanks Simon!
*** Bug 712557 has been marked as a duplicate of this bug. ***