GNOME Bugzilla – Bug 690126
Do not quit mainloop while there are pending GSources
Last modified: 2014-06-23 18:56:53 UTC
This is something between GLib and GDBus. GDBusProxy calls g_dbus_connection_signal_unsubscribe() on its g_dbus_proxy_finalize(), which calls call_destroy_notify(), which schedules idle callback with reffed main context. This is fine until the main context's life time is limited for the life time of the GDBusProxy descendant, and this main context is also set as a default thread main context. Here's an IRC chat we had today: <mcrha> hi all, who can I ask for a glib/gdbus issue, leaking GMainContext is it's just a one-time context created with its own mainloop in a thread which creates (and frees) a GDBusProxy descendant? <davidz> mcrha: I can help. Best is to ask on #gtk+ on GimpNet since a lot of GLib contributors are not on RH irc (desrt, ebassi etc.) <davidz> but we can chat here if you prefer (less traffic) <mcrha> hi davidz, I do not care, I can move there :) <davidz> there's a discussion going on there, let's just chat here <davidz> so, is the thread short- or long-lived? * mcrha was aout to say the same <mcrha> short time, see lifetime of this: http://git.gnome.org/browse/evolution-data-server/tree/libedataserver/e-source-registry.c#n1711 <davidz> if it's short-lived, my gut feeling is that it sounds wrong to have a GMainContext attached <davidz> I mean, if it's a worker thread you should be using the _sync() version of various functions <mcrha> it's just a local main_context, which is freed at the end of the function <davidz> anyway. it should still work <davidz> ok <davidz> should work, what's the problem? * davidz seems to remember a problem like this that the vala guys had <mcrha> the problem is with GDBus <mcrha> the gdbusconnect.c's call_destroy_notify() schedules a GSource <mcrha> and re's the context <davidz> right, yeah, I remember that <mcrha> *refs <mcrha> ok, then you know that after killing GMainLoop the old main context is left out there <davidz> so the problem is that your since the thread is short-lived, you kill the GMainContext while there is work pending <davidz> right <mcrha> yes <davidz> you are not the first one to run into this problem <mcrha> :) <mcrha> am I the only one feeling it's not my fault? :) <davidz> kinda, yeah <mcrha> I mean, in a nice way <davidz> me too :-) <mcrha> :) <mcrha> ok, what is the recommendation? <halfline> if code is scheduled to run, though, isn't the problem that the loop is getting quit before that code has a chance to run? <davidz> halfline: I think my conclusion was something like that <halfline> i mean it's not just a leaked reference, it's code designed to run not getting run <mcrha> hi halfline, code on which side? <davidz> IIRC the problem is that GDBusProxy's finalizer calls GDBusConnection.signal_unsubscribe() <halfline> mcrha: you said gdbusconnect.c scheduled code to run, didn't you? <mcrha> halfline, yes, I did <mcrha> davidz, true <halfline> mcrha: so that side <mcrha> halfline, good :) <davidz> the solution, I think, is simply, to run the mainloop until there are no more events <mcrha> should I propose such fix to GLib? will anyone listen to it? <davidz> there is already a bug open IIRC <davidz> let me check <davidz> I can't find the bug <mcrha> to give you a little background on this, how I found it, I've an RH report about "out of file descriptors" after certain actions, and I debugged this down to GDBus - and it was quite painful debugging :) <mcrha> no problem, I can file one myself <davidz> well <davidz> file a bug, please... but I'm not convinced it's a GLib bug <halfline> so can you work around the problem by just doing g_clear_object (&auth_context->dbus_auth) before you quit the main loop ? <davidz> I mean, the problem is that you are destroying a GMainContext while there are events pending... <mcrha> the thing is whether it'll be useful, I know how it works with bugs in evo world, ... <mcrha> halfline, it's done within auth_context_free(), which is done ... (checking) <halfline> i'm saying don't do it in auth_context_free <halfline> do it earlier <mcrha> I see, the mainloop is quit already in that call <halfline> as i understand it, the problem is gbus is schedule something on an event loop that's no longer running <halfline> *gdbus <halfline> right <mcrha> true <mcrha> halfline, should there be any waiting? <mcrha> I mean I cannot just "unref GDBusProxy; stop_mainloop()", can I? <mcrha> hmm <halfline> i was hoping that would work, but looking in gmain.c it won't <davidz> mcrha, halfline: the reason this is happening is GDBusConnection.signal_subscribe() calls the passed in @user_data_free_func in the thread-default context <davidz> which I think is the right thing to do <davidz> the problem is that GDBusProxy is using this <halfline> so you could add a weak ref on the proxy object <halfline> and main loop quit from there <davidz> well <davidz> the problem is that mcrha's application is making an assumption <davidz> which is that there are no more events pending at the time it discards the GMainContext <davidz> or, rather, at the time it decides to no longer iterate any loops associated with it <davidz> I do not think that is a good assumption for an application to make <davidz> that's why I don't think there's a problem with GDBus here <halfline> at the time it calls g_main_loop_quit you mean <davidz> well, you can say that <halfline> but there's no good way to know when to call g_main_loop_quit <halfline> in general <davidz> (but it's not really accurate as it could be creating a new GMainLoop and run that) <davidz> but, yes, that is the problem, in a nutshell <davidz> ideally we would change GDBusProxy so it does not use a GDestroyNotify <davidz> but it's kinda tricky <halfline> maybe we need a g_main_loop_quit_when_there_are_no_more_attached_sources <davidz> right * mcrha doesn't like the whole concept of a thread default main context and its usage over glib applications, but that's completely different question :) He prefers explicit parameters for this <davidz> g_main_context_pending() ? <halfline> i don't think that's good enough * davidz neither... <davidz> mcrha: well, thread-default main context was kinda bolted on after the fact... but even, then, having _each_ operation take an additional parameter is just... cumbersome and annoying <davidz> so I think it's a nice trade-off... <davidz> anyway <davidz> mcrha: please do file a glib bug (component gdbus) and we can take it from there... <mcrha> it could be like libsoup, you define one for the object, and that's it <davidz> mcrha: (please also copy paste this IRC conversation) <mcrha> davidz, I thought I'll just do "flush pending ops" un g_main_loop_unref(), if there are any pending, and that's it. <mcrha> in time of the last reference <mcrha> I'm not sure what might be wrong with g_main_context_pending() <mcrha> unless g_timeout_add() not triggering in to the pending <halfline> because it doesn't tell you if there are attached sources, it just tells you if the attached sources are good to dispatch now <halfline> mcrha: right <mcrha> I see :-/ <halfline> i think g_main_context_pending() would work for this case probably <halfline> or even just g_main_loop_quit from an idle handler would probably work <davidz> yeah, events are ordered... <mcrha> considering that any library can attach its sources in any time... (if I generalize the issue with the _pending()) <halfline> (of course you'll still need the g_clear_object (&auth_context->dbus_auth) first) <davidz> so that would work if, and only if, no new events are put on in event handlers <davidz> (we must go deeper) <davidz> ideally we could change GDBusProxy to not suffer from this <mcrha> davidz, do you know the call_destroy_notify()? <davidz> e.g. pass a NULL GDestroyNotify to GDBusConnection.signal_subscribe ... <halfline> really seems like the right fix is a new g_main_loop_quit_when_sourceless function <davidz> halfline: yes, g_main_loop_quit_when_sourceless() would be nice... <mcrha> may it do the g_main_loop_quit() automatically? <mcrha> I mean, it should not hurt, only help, if it does <davidz> it's a behavioral change so essentially an ABI break (can easily imagine it breaking some apps) <halfline> right, g_main_loop_quit_when_sourceless would quit automatically when the source list goes empty <mcrha> davidz, I hope not :) <davidz> mcrha: just saying that I don't think we can change g_main_quit() to do it ... e.g. that we need to add a new function as halfline says <mcrha> davidz, or a GMainLoop flag defaulting to "quit only when sourceless"? <mcrha> my point is that I'm not going to change all the g_main_loop_quit() in all the applications <mcrha> because I believe the broken applications will be significantly less than those which will benefit from this <davidz> mcrha: well, easiest is to see if we can fix GDBusProxy so it doesn't cause this problem
Kiid of like https://bugzilla.gnome.org/show_bug.cgi?id=639770 except for worker threads...
FWIW, tracker ran into the same problem a while ago http://permalink.gmane.org/gmane.comp.gnome.svn/528142
FWIW, I think the solution is to add a new flag to GDBusSignalFlags G_DBUS_SIGNAL_FLAGS_CALL_DESTROY_NOTIFY_IN_GDBUS_WORKER_THREAD which will cause the GDestroyNotify functions (called when unsubscribing from the signal) in the GDBus worker thread instead of the thread-default GMainContext when registering. In the GDBusProxy case, it's this simple function http://git.gnome.org/browse/glib/tree/gio/gdbusproxy.c?id=2.35.1#n116 which only frees memory meaning that it's fine.
Created attachment 231516 [details] [review] Patch Here's a patch that does this. Milan, is it possible you can test it? Thanks!
(In reply to comment #3) > FWIW, I think the solution is to add a new flag to GDBusSignalFlags > > G_DBUS_SIGNAL_FLAGS_CALL_DESTROY_NOTIFY_IN_GDBUS_WORKER_THREAD The question I have is though - is GDBus the only thing of this form? What about e.g. GSettings? I think I like the "workaround" of scheduling an idle to quit the mainloop as someone mentioned, if it works. That'd be pretty easy to enshrine as g_main_loop_quit_on_idle() or whatever. But if this really is specific to the interaction of the GDBus worker thread and the app's thread, then maybe a GDBus flag is right.
(In reply to comment #5) > (In reply to comment #3) > > FWIW, I think the solution is to add a new flag to GDBusSignalFlags > > > > G_DBUS_SIGNAL_FLAGS_CALL_DESTROY_NOTIFY_IN_GDBUS_WORKER_THREAD > > The question I have is though - is GDBus the only thing of this form? What > about e.g. GSettings? I actually think GDBus is somewhat unique in running GDestroyNotify callbacks on the thread-default context (for GDBusConnection's signal_subscribe(), register_object() and register_subtree() methods at least). Other code seems not specify when notifiers are run, for better or worse. But in GDBus I ended up needing it. > I think I like the "workaround" of scheduling an idle to quit the mainloop as > someone mentioned, if it works. I think it will fail if one of the callbacks called after scheduling the idle, schedules another callback (if you know what I mean). > That'd be pretty easy to enshrine as > g_main_loop_quit_on_idle() or whatever. > > But if this really is specific to the interaction of the GDBus worker thread > and the app's thread, then maybe a GDBus flag is right. It's a bit ugly, yea, but it solves a concrete problem so... but yeah.. anyway, since GDBusProxy is the only user, we could make the flag private and not expose it to user apps (use e.g. bit 31 or something).
(In reply to comment #6) > > > I think I like the "workaround" of scheduling an idle to quit the mainloop as > > someone mentioned, if it works. > > I think it will fail if one of the callbacks called after scheduling the idle, > schedules another callback (if you know what I mean). also falls over if there are any timeouts scheduled before the idle is scheduled.
The patch works for me. Here [1] is a scratch build with it for F18, in case anyone else is interested (until it's deleted). [1] http://koji.fedoraproject.org/koji/taskinfo?taskID=4791508
I'll throw up a more general patch to implement the g_main_loop_quit_when_there_are_no_more_attached_sources I meantioned in the irc conversation.
Created attachment 231582 [details] [review] mainloop: add API for quitting automitically main loops are shared resources, so it's not always easy to know when everyone is done using them and it's safe to quit. The main loop itself knows this information, though. This commit adds new API for telling the main loop to quit automatically when all sources are removed.
Note I'm not specifically proposing we do attachment 231582 [details] [review] instead of attachment 231516 [details] [review] . We could do either, or both, depending on consensus.
I've wanted something like this a bunch of times when writing test cases. Especially when trying to make them properly-valgrindable; you need to make sure all pending callbacks have been run before finishing, since the callbacks may be holding refs on random objects.
Review of attachment 231582 [details] [review]: I like the idea, dislike the name. "_queue_quit()"? "_quit_idle()"?
Those two names sort of imply that it's equivalent to g_idle_add (g_main_loop_quit, loop) (leaving aside g_idle_add doesn't use the loop's context, and g_main_loop_quit doesn't return FALSE); but this does something different than that. I mean, it's not "quit when no sources have pending events", it's "quit when there are no sources". Also, another difference, is this is something that can be done up front before running the main loop, not just when the caller is ready to quit. I'm not tied to _set_quit_automatically though. We could do "set_auto_quit", or "set_quit_on_no_sources", or "set_quit_when_unused" or "set_quit_after_use" or something else.
GDBus-specific solutions aside for a moment... Would it not be possible for users of GMainContext in these short-lived-thread situations to ensure that they do: while (g_main_context_iteration (context, FALSE)); before freeing the GMainContext as a new "best practice"? The only API I could imagine adding here is one that does the above as part of some sort of unref() or pop_thread_default() combined operation -- and I'm not sure we really need that...
that suffers from the same problem as walters g_main_loop_queue_quit proposal. It doesn't account for a pending timeout.
g_main_context_flush()?
I could imagine a g_main_context_flush() call, but it would have to do more than flush pending events to completely solve the problem. It would have to run until there are no more timeouts attached, so it would be more of a g_main_context_finish call than a g_main_context_flush call. Of course, in scenarios where g_main_context_finish() would be used, if it did work that way, there would be little need for a g_main_loop_run() call or even GMainLoop at all anymore. Or maybe we don't care about the pending timeout case. We call that more rare and not our problem. In which case, g_main_loop_queue_quit works, or g_main_context_flush or g_main_context_flush_pop_and_unref or whatever.
(In reply to comment #18) > I could imagine a g_main_context_flush() call, but it would have to do more > than flush pending events to completely solve the problem. It would have to > run until there are no more timeouts attached, so it would be more of a > g_main_context_finish call than a g_main_context_flush call. I think these are all good points (and should definitely go into the docs for the method) but I don't see a good reason why it we can't call it flush(). After all - unless there's a compelling reason - we usually don't encode every little bit of semantic into the method name. For example, in this case, I don't think we're going to add other methods that are almost-similar-but-not-quite so I think we're safe. (Btw, I also think that all GMainLoop instances attached to a flushing GMainContext would quit when flushing is done. But we don't need to encode that into the method name either - but we do need to document it in the docs.)
(In reply to comment #18) > I could imagine a g_main_context_flush() call, but it would have to do more > than flush pending events to completely solve the problem. Yes, the intended meaning was "run until there are no more sources or pollfds attached". "g_main_loop_flush()" would probably make more sense. Although that doesn't sound as right... I agree with Colin in not liking "automatic" though. we could just add g_main_context_has_sources() and then people could loop running g_main_context_iterate() until g_main_context_has_sources() returns FALSE... or "g_main_context_is_empty()" > Or maybe we don't care about the pending timeout case. we do
oh okay, I don't care what name we use as long as we can agree on the semantics and document it.
I talked to desrt, walters, and danw about this today on irc. desrt pointed out that the main loop doesn't exactly know when it's safe to quit. It only knows about attached sources, but a worker thread could plan to attach a source to the context later. The only way the main loop could know for sure that it's okay to quit is if the source list was empty and all outside refs to the loop's context were removed. Quitting when the main loop's context goes away has a bit of a foul odor to it, though. We would probably have to invent some new kind of "don't quit yet" reference or something. So that's hard and not obviously right. Another alternative is to say the timeout case doesn't *really* matter, the future source from another thread case doesn't *really* matter, and the only case that matters is the queued idle case. Solving that case might be just be good enough. Or we could punt entirely and say the programmer should know how the code it calls uses the main loop it provides (and urge libraries to document that).
I'm planning to push the patch in comment 4 since it's kinda orthogonal to the more general problem - if anybody has any strong objections to this, now would be a good time to voice them. Thanks!
Just for reference, the IRC conversation: <desrt> halfline: can you explain the timeout usecase to me? <desrt> (if you are awake yet) :) <desrt> and in particular: do you propose that we dispatch timeouts immediately, or block and wait for them to fire? <halfline> desrt: hey <halfline> so the point is, a main loop is a shared resource <halfline> and it could be the creator of the main loop doesn't know fully when other users of that main loop are done with it <halfline> but the creator of the main loop is the one doing the quitting <halfline> if the creator quits prematurely problems can crop up <halfline> so my proposal was that the creator of the main loop shouldn't be the one that quits the main loop <halfline> instead the main loop should quit when there are no users left <halfline> (sources) <halfline> i don't think timeouts should get blasted early <halfline> i think the main loop should run until the timeouts are removed <halfline> we put the onus on the code creating the main loop to be the one to quit the main loop, but don't provide any api to know when the loop can be safely quit <halfline> we could provide that api, or we could remove the burden. my proposal argues for removing the burden <desrt> halfline: so you're suggesting that the mainloop should control the life of the thread, effectively <halfline> not the thread, the life of the loop <halfline> the main loop should control its own life <halfline> it knows when it's not needed anymore <desrt> we're talking about the case of a worker thread, though, right? <desrt> well... not worker, but not the main thread, either <halfline> right, but the thread goes on after g_main_loop_run() finishes <halfline> in fact, in the original case that spurred the bug <desrt> you're effectively asking that g_main_loop_quit() stop existing <halfline> the problem was because of gdbus attaching a source after g_main_loop_run finished <desrt> which, as the author of GApplication, i have to admit i understand the appeal of... <walters> were we to start again, i'd probably have made g_main_loop_quit() have those semantics <halfline> well my proposal puts a boolean on the loop object <halfline> if the boolean is true, g_main_loop_quit shouldn't rarely get used <halfline> if the boolean is false, g_main_loop_quit still needs to get used <halfline> s/shouldn't rarely/should rarely/ <halfline> basically you tell the loop whether or not it manages itself <desrt> i'm not happy <halfline> and of course even if it is managing itself, g_main_loop_quit would still work <desrt> consider this... <desrt> i push the default main context in a thread <desrt> do an async call <desrt> the async result object for the call take a ref on the context <desrt> i pop it <desrt> there are no sources on the loop <desrt> i want to quit.... <desrt> but meanwhile there is this pending event that just happens not to be a GSource yet <desrt> that kind of separation seems a bit arbitrary to me <desrt> ie: what we'd really want is to have the mainloop run for as long as a ref is held on the context -- and that's just insane <desrt> unless we define some new sort of ref for contexts that says "i will be adding a callback here soon, maybe" <desrt> and now we're getting really close to g_application_hold().... <halfline> can you explain the "pending even that just happens not to be a GSource yet" in greater detail for me? <desrt> the usual way that async calls for something like GDBus work is like so: <desrt> i have a thread-default main context that's pushed to a TLS pointer -- it gets read at the time of the _async() call and a ref is held on that context <desrt> usually by the GAsyncResult object that will eventually be used to report the completion (by convention we typically create GSimpleAsyncResult/GTask at the time of the async call and attach the context there) <desrt> so now i have a ref on a context but absolutely nothing else <desrt> this async call is exactly the same as a timeout, really: in the future, the main context will dispatch a callback <desrt> so then the async call becomes finished <desrt> what happens then is that we schedule an idle on the main context in order to report it <desrt> the main context that was the thread-default one at the time the call was made... <halfline> oh i see where you're going with this <desrt> that main context could be 'dangling' <desrt> ie: the only ref held on it at that point may be the one that was held by the async result state object <desrt> so we'll schedule the idle into the context and unref it -- destroying it <desrt> meanwhile the loop will have 'automatically' quit long ago for lack of having any actual sources in it <halfline> hmm <desrt> i really think we either need to have a much larger hammer here or we need to just tell programmers that we have higher expentations of them <desrt> ie: if there may be something "pending" on the context (in the sense of an actual source, or a potential future source, as above) then they are responsible for continuing to run it <halfline> yea i think i agree with you <walters> desrt: in this case though, the app programmer knows when all async calls they've made are complete; it'd be unexpected if there was an async op queued internally by glib outstanding when all ops they'd started had completed <desrt> walters: could easily be the case that some library has queued an 'internal' async call on their context, though... <desrt> we do that sort of thing in libraries all the time.... <halfline> though if we expect more from the programmer, we need to expect more from the provided documentation too <desrt> halfline: the other alternative is that we get even more pragmatic <desrt> and we realise that idles really are special... <desrt> ie: stop trying to solve the problem for timeouts <desrt> afaik there's no "real problem" here that involves timeouts <halfline> yea i meantioned that as a possibility on the bug report <desrt> so that goes back to my suggestion of doing a while(iteration(false)) <halfline> danw was pretty adamant we solve the timeout case too <desrt> and maybe having some API based on that <desrt> i don't agree with that <desrt> timeouts are even less likely to be a real problem than pending async callbacks <halfline> well he didn't give his reasons, so maybe he just hasn't persuaded you yet. :-) <halfline> danw: around? <danw> yes <halfline> danw: can you read scrollback and chime in with your thoughts ? <walters> maybe async ops should hold a weak ref on the context - if it gets destroyed, the callbacks get moved to the glib worker thread <desrt> walters: no way dude <walters> yeah...it's kinda crazy <desrt> gotta fly <desrt> will talk later <halfline> oh ok <danw> halfline, desrt: i wouldn't say i was "adamant" about timeouts, it just seemed arbitrary to me to make it work for some sources but not others <danw> but desrt's point about async callbacks is a good one. there might not be any sane API that has the semantics I want. And I don't remember the detailed specifics of what I was up to when I was doing these things
Probably too late for 2.36 on this one now, but... (In reply to comment #24) > <desrt> consider this... > <desrt> i push the default main context in a thread > <desrt> do an async call > <desrt> the async result object for the call take a ref on the context > <desrt> i pop it > <desrt> there are no sources on the loop > <desrt> i want to quit.... > <desrt> but meanwhile there is this pending event that just happens not to be a > GSource yet > <desrt> that kind of separation seems a bit arbitrary to me > <desrt> ie: what we'd really want is to have the mainloop run for as long as a > ref is held on the context -- and that's just insane > <desrt> unless we define some new sort of ref for contexts that says "i will be > adding a callback here soon, maybe" Or we could reorganize how GTask and similar APIs work. It could attach the completion source to the GMainContext right away (ie, in g_task_new()), but then call g_source_set_ready_time(-1). Then when it was ready to dispatch it, call g_source_set_ready_time(0). Then a "wait until all sources have been removed from the context" API could work again.
This is WONTFIX territory for me... I'm not sure it's worth expending that much effort.
FWIW, we hit a bug in GJS where this would have been useful, bug 697436. Maybe we could add the g_main_context_has_sources() call Dan proposed in comment 20. That callers can provide "good enough" solutions for their own uses even if it's not sufficient to give a fool proof quit-when-ready api (that could be put in glib proper).
interesting side discussion: <desrt> i consider GMainLoop semi-deprecated <walters> woah wha? <desrt> seriously <walters> what replaces it? <desrt> i don't think it's a very good thing to use for... almost anything, really <desrt> patterns like i did in this patch <desrt> GApplication has some similar logic in side of it <desrt> while (use_count) g_main_context_iterate(); <walters> interesting <desrt> the idea of having some weird object that iterates a maincontext until you quit() it from somewhere else is weird and tends to cause bugs like this one... <walters> i am unable to fit the possible ramifications of this in my mental model of application programming at the moment <walters> i definitely like a lot how it makes it easy to encode the termination condition in exactly one place <walters> that being the part i keep screwing up <desrt> it also is less code....