GNOME Bugzilla – Bug 695960
Leaked GSources in user's GMainContext after EBookClient finalizes
Last modified: 2013-09-14 16:56:06 UTC
Created attachment 239022 [details] stack traces from test-client-self Here is where the fun starts when implementing this manual threading system on top of D-Bus... The problem can be found by running tests/libebook/client/test-client-self Run the test 20 times in a row, I can usually reproduce the failure/crash after about 6 or 7 times. There are a two notable problems here: a.) e-book-client.c needs to use g_weak_ref_init()/g_weak_ref_clear() to be safe, since it allocates the weak refs on the heap and not statically. b.) When the EBookClient finalizes, it can leave GSources remaining in the callers' GMainContext. While 'a' is easily fixed, 'b' causes more problems (and can be seen easily with test-client-self) To fix 'b', you must track all of the source ids added to the GMainContext which belongs to the caller (it's not true that unreffing the main context from EClient->dispose() will cause any sources to be removed from the context, the EClient implementations do not own the caller's context and thus is responsible for removing any sources added to it at finalize time). I tried hacking an AyncQueue into EBookClient for this but gave up, a better solution would be add api to EClient which would automatically track the added source ids which it is responsible for removing later (and probably simplifying the code in EBookClient which adds idle callbacks). An even better and safer solution would be to just give up on the whole complex dedicated D-Bus thread, and just use normal GDBusProxy semantics (IMO, the whole complexity around this threading is still not worth whatever benefit it seems to bring).
Somehow I just don't know how to attach a stack trace properly, instead I'll just paste it here. This is what happens when you run the second iteration of test-client-self, while the first iteration of test-client-self has completed, but irresponsibly leaked a GSource into the caller's main context which immediately fires as soon as the main loop restarts. It will happen in any case that the EBookClient manages to finalize before the source had a chance to get triggered in the callers' main context.
+ Trace 231650
(In reply to comment #0) > There are a two notable problems here: > a.) e-book-client.c needs to use g_weak_ref_init()/g_weak_ref_clear() > to be safe, since it allocates the weak refs on the heap and not > statically. g_weak_ref_init() is only needed when allocating on a stack frame where there's garbage in memory. Zero-filled memory is sufficient. I asked Ryan exactly this awhile ago: Apr 16 16:41:39 <mbarnes>»······desrt: quick question: a GWeakRef embedded in zero-filled memory is sufficiently initialized to not have to call g_weak_ref_init(..., NULL)... correct? Apr 16 16:59:12 <desrt>»mbarnes: correct Apr 16 16:59:17 <desrt>»mbarnes: that's one of the nice features of the API, indeed Similarly, g_weak_ref_set(weakref, NULL) is sufficient to release the weak reference. > b.) When the EBookClient finalizes, it can leave GSources remaining > in the callers' GMainContext. Which the GMainContext frees when its own reference count hits zero. The callback functions attached to those GSources are essentially no-ops when the GWeakRef carried in the payload for the callback turns out to be empty. I intended it to be unnecessary to track all the GSources we attach. Although the name vanished case does seem to be racing here. I'll look at fixing that.
Fixed in: https://git.gnome.org/browse/evolution-data-server/commit/?id=98fb014dba84094ccb82356a7d6d5eb16b26bd59
(In reply to comment #2) > (In reply to comment #0) > > b.) When the EBookClient finalizes, it can leave GSources remaining > > in the callers' GMainContext. > > Which the GMainContext frees when its own reference count hits zero. The > callback functions attached to those GSources are essentially no-ops when the > GWeakRef carried in the payload for the callback turns out to be empty. I > intended it to be unnecessary to track all the GSources we attach. Actually I thought that too but no. Remember that the GMainContext which is referred to by EClient is *not* created by the EClient, it is not owned by the EClient and when the EClient finalizes... the GMainContext (which belongs to the user) reference count does *not* reach zero. So it is not finalized, and any sources added to it by the EClient are not automatically removed. Rather, the EClient calls g_main_context_ref_thread_default() when it starts up... adding an additional ref to the built-in GLib default GMainContext... any sources added to the user's GMainContext must be removed by the EClient before finalizing... because the GSources added to the client's mainloop belong to the EClient, while the GMainContext does not.
Applied patch and recompiled. As I suspected, I can still reproduce the race, new stack trace here. Mostly the same, only now book_client_name_vanished_cb() is on line 488 instead of 464.
+ Trace 231651
I never said the GMainContext is disposed at the same time as the EClient. The GSources we attach are always removed after they are dispatched, since the idle callbacks always return FALSE. If the idle callback is dispatched after the EClient is disposed, then the callback's attempt to acquire a strong reference on the EClient fails and it simply returns FALSE. It's a no-op, in other words. Any undispatched GSources are cleaned up when the GMainContext itself is finalized. So I still don't see any need to track them.
(In reply to comment #6) > I never said the GMainContext is disposed at the same time as the EClient. > > The GSources we attach are always removed after they are dispatched, since the > idle callbacks always return FALSE. > > If the idle callback is dispatched after the EClient is disposed, then the > callback's attempt to acquire a strong reference on the EClient fails and it > simply returns FALSE. It's a no-op, in other words. > > Any undispatched GSources are cleaned up when the GMainContext itself is > finalized. Right, however the GMainContext is _not_ finalized at the time that the EClient is finalized. Which means that at any time that the EBookClient is unreffed and there remains an undispatched GSource, once the main loop continues and dispatches the GSource (which the EBookClient created)... we get this crash. This will happen regardless of whether: a.) The EBookClient is unreffed in the middle of a program with a running main loop b.) The EBookClient is unreffed and then the main loop quit, and then resumed later on (which happens to be the case in our tests) > > So I still don't see any need to track them. So long as the EBookClient leaves GSources leaked into the calling code's GMainContext after the EBookClient finalizes, we will see this crash
Alternatively, one would think that the GWeakRef would make this code safe, and that g_weak_ref_get() in the latent GSource would simply return NULL. If that's the expectation of the GWeakRef API, then I would venture to say that it's the GWeakRef code which is broken (however it makes more sense to me that GWeakRef simply does not work like that, and that it's up to us to ensure we don't access a finalized client after it was finalized).
The whole purpose of GWeakRef is to be thread-safe when the object it's referencing is finalized.
Indeed it's possible we're looking at a bug in GWeakRef then. Seems pretty clear in the docs that it should be working: "A structure containing a weak reference to a GObject. It can either be empty (i.e. point to NULL), or point to an object for as long as at least one "strong" reference to that object exists. Before the object's GObjectClass.dispose method is called, every GWeakRef associated with becomes empty (i.e. points to NULL)" However, I find it odd that in the stack trace we find that g_weak_ref_get() calls g_object_ref(), causing the G_IS_OBJECT() assertion to raise. Perhaps I should fashion a test case against glib for this as well.
Just updated glib to fresh master. Bug still is reproducible, just do: a.) ulimit -c unlimited b.) ./test-client-self Repeat 'b' many times... it will crash eventually, 'a' will of course ensure that there is a 'core' file for gdb... I'll look into creating a test case reproducing this for glib tomorrow, then I'll open a bug against glib and make this one block on it (perhaps it would be better to transfer this bug to glib, but then it would be gone from the EDS buglist, and as a result we could risk forgetting about it... so better make a new bug I think).
I have a theory. In the D-Bus dedicated thread... the EBookClient is given directly to the SignalClosure using g_weak_ref_set(). Now... when g_weak_ref_set() enters... it does: g_return_if_fail (G_IS_OBJECT ()); And we don't see any assertions... but *then* it goes and acquires the GRWLock for 'weak_ref_locations'. Simultaneously, g_object_unref() is being called in the main thread/context, once g_object_unref() releases the lock... the D-Bus worker thread unlocks and proceeds to store a finalized/invalid pointer in the 'weak_ref_locations' list. THAT is where the bug starts to show. Solution: o The worker thread must also store a thread-wide GWeakRef on the EClient pointer o When creating a SignalClosure in e-book-client.c, we should call g_weak_ref_get() on the thread-wide GWeakRef, this will give us a temporary reference which we can carry until the GWeakRef is given to the SignalClosure o Then unref the temporary strong reference after creating the SignalClosure I'm pretty convinced that this is where things are actually going wrong.
Then again... it might be safer all around to just bookkeep any GSources which are added to the client's main loop. Consider that; to do this reliably: o The worker thread must hold a strong reference at least temporarily in order to create the GWeakRef o Because the thread holds a temporary strong reference, it's possible for the client to finalize *from the D-Bus worker thread*, probably quite an undesirable can of worms open from there o We couldn't properly assert that EClient classes are finalized directly after removing the last reference (as we currently do in test cases). An interesting alternative, would be to use something like 'g_weak_ref_copy (&thread_ref, &closure_ref);', in order to safely copy the weak reference without ever mandating a strong reference. However currently no such function exists.
Created attachment 239060 [details] [review] Synchronize removal of watcher in D-Bus thread in dispose WARNING: This patch does not work, it's meant more as an example I'm sure I can see a part of what is going wrong here, and the attached patch is an attempt to fix it. NOTE: I did also try the basic queueing of the removal of the watcher_id from the appropriate D-Bus thread, but that also had problems, so I tried to synchronize it with the client (user) thread, still has problems. Now, the explanation of the previously attached stack traces is as follows: o When e_book_client_finalize() runs, it calls g_bus_unwatch_name() on the watcher_id directly... from "whatever thread" the client is finalized in. o This causes the GWeakRef attached to the watcher_id to be cleared and freed "immediately" o Consequently, in the already attached stack traces, the book_client_name_vanished_cb() is already preparing to run in it's D-Bus thread... theres nothing about g_bus_unwatch_name() being called from another thread which protects it from being run. o book_client_name_vanished_cb() proceeds to access the GWeakRef which was "immediately freed" by e_book_client_finalize() in the main thread. The patch I'm attaching here, is an _attempt_ to address that, actually it does address that problem but then unmasks further problems related to the D-Bus threading (which I'll post stack traces for in further comments).
Now first problem I encountered is a lockup, while the code attached in the example patch seems very straight forward (no way that that code will cause a lockup so long as the book client is finalized in the main thread as expected), there seems to be a lockup anyway. This lockup shows that for some reason, the EBookClient is *not* finalized in the main client thread, but rather in a deep D-Bus thread as a consequence of a D-Bus object property change notification. Here are the stack traces for all active threads at the time of the lockup (with the afore mentioned example patch applied): Thread 1:
+ Trace 231653
Gah, buzilla ate up and mangled my stack trace... view the raw stack trace for a more comprehensive log. Also, the rest of the comments which bugzilla happily hides from the above comment are these: In most cases the lockup does not occur and everything passes, but in some other cases we also get a crash (also with the above patch applied), this crash (which I've seen a couple times, same crash) occurs deep inside e_book_client_remove_contact_by_uid(). Again related to the GWeakRef:
+ Trace 231654
Here are some thoughts on how we can simplify things and make it a bit safer (if we still insist on keeping this threading model instead of simply writing out Async calls, which I can't stress enough, would be much safer). Basically, we could simplify things by creating one D-Bus thread per EBookClient: o Creating an EBookClient would create it's own D-Bus thread, the async/sync initable code could block until the newly created thread finishes creating the underlying GDBusProxy and connecting to signals in it's local thread. o The EBookClient's thread data pointer would belong to the thread and it would include a single GWeakRef to the EBookClient. o Since the thread data created by the EBookClient is created before starting the thread, and the GWeakRef also created before starting the thread, the GWeakRef becomes safer (no races in the creation of the GWeakRef since it's created safely in the client-side main thread). The thread-side functions would only ever access EBookClient via it's own GWeakRef. o When the EBookClient finalizes, it would schedule an idle in the D-Bus related thread which would *exit* the D-Bus thread... in that case the client would be finalized and the GWeakRef invalidated... and the thread would asynchronously disconnect it's signals from the GDBusProxy, unref the proxy, quit it's internal mainloop and return.
(In reply to comment #17) > Basically, we could simplify things by creating one D-Bus thread > per EBookClient: Please please please, NO! If you ever happen to read any crash reports, or simply backtraces, it's always a pain to see all unrelated threads in the backtrace. It was done in a similar way with CamelFolderSummary, and it was a mistake, which I fixed (after introduced myself), after some releases when I found the db-summary threads being too awful for seeing them in backtraces, adding just a noise there. That said, using thread per EClient would be just a poor workaround for a wrong design/usage/.... And, of course, this would be one more good reason why not do [1], which Matthew reintroduced for Contacts view [2], while the [1] gives good reasons why not do that. [1] https://git.gnome.org/browse/evolution/commit/?id=662ed02cc773a4fd2d90ed [2] https://git.gnome.org/browse/evolution/commit/?id=7bd9b878a383628dcebda2
(In reply to comment #18) > (In reply to comment #17) > > Basically, we could simplify things by creating one D-Bus thread > > per EBookClient: > > Please please please, NO! I think you know that my opinion is that the threads are altogether an added complexity in execution that we don't need at all in the first place (async calls and signal notifications can be made directly in the caller's GMainContext, no need for threads at all). But that idea aside, to make things less complex while using threads, either a.) One thread per client ... or at least b.) One thread data per client all belonging to the single "D-Bus thread" Either way... the GWeakRef to the client should be created at the time that the client is created, in the main calling thread context, where a strong reference to the client exists. After that, the threaded functions should only ever try to access the EBookClient via the GWeakRef which was created in the calling context. At dispose time, a message should be queued into the D-Bus thread context to displose of the thread-side data for a given client... the GWeakRef which was created in the user's thread context (where a strong reference existed) will eventually be destroyed in the D-Bus thread's main context, perhaps after the underlying GDBusProxy is safely finalized (as I understand it, the GDBusProxy belongs to the thread data related to a Client... but belongs to the thread... along with the GWeakRef... those need to be cleaned up in the D-Bus thread).
Closing this as NOTABUG since the description mis-characterizes the behavior as a leak, when in fact all GSources are accounted for at all times and free either after dispatching or when the GMainContext is finalized.
Woah wait a minute. Does ./test-client-self have no race condition any more ? Can we close this with a reference to the commit which causes the EBookClient finalization to be safe ?
The good news is that I cannot get test-client-self to crash anymore. It took me a while, but I bisected it down to this commit: https://git.gnome.org/browse/evolution-data-server/commit/?id=35645dc4b8f80e0a398ae395c6c21bfc6f70807d Which probably fixes the crash combined with the commit Matthew mentions in comment 3 It worries me however that I can't properly explain how the above commit would fix the race at finalize, the one which I see (and still seems to present a risk in today's code) is as follows: o e_book_client_name_vanished_cb is scheduled but does not yet run o e_book_client_finalize() calls g_bus_unwatch_name(), which does two things: a.) Prevents further invocations of e_book_client_name_vanished_cb(), but it's too late at this point, cause it's already scheduled in the D-Bus thread b.) Frees the GWeakRef which the name_vanished_cb() still needs o e_book_client_name_vanished_cb() runs in the D-Bus thread and accesses the GWeakRef which has recently been freed by the main thread. I still have a hard time trusting this D-Bus thread, but I suppose it will take further testing to track down the races.