GNOME Bugzilla – Bug 736806
gtask: Fix reference count loop causing leaks
Last modified: 2014-09-23 07:09:42 UTC
Patches attached. Bit of an edge case, but it’s possible to hit this by firing off a lot of GTasks to limited-size thread pools very quickly.
Created attachment 286379 [details] [review] gtask: Fix a signed/unsigned integer comparison The GSource times assigned to creation_time are always signed.
Created attachment 286380 [details] [review] gtask: Eliminate a reference counting loop between GTask and cancellable The GTask holds a reference to its cancellable, so for the cancelled signal handler to hold a reference to the GTask creates a reference loop. This reference counting loop was causing GTasks which have failed to be added to the thread pool queue (g_thread_pool_push() in g_task_start_task_thread()) to be leaked, as the signal handler was never removed in that case. A follow-up commit fixes that issue too.
Created attachment 286381 [details] [review] gtask: Fix cleanup on an error path in g_task_thread_complete() If pushing a task to the thread pool queue fails in g_task_start_task_thread(), the task is left in a half-constructed state, with its cancelled signal handler connected, but thread_complete set to TRUE. g_task_thread_complete() was previously bailing immediately, causing the signal handler to never be disconnected. In combination with a reference loop this was causing GTasks in this state to be leaked. Change g_task_thread_complete() to always clean up the GTask’s state, regardless of whether thread_complete is set. thread_complete now only controls whether g_task_return() is called at the end of g_task_thread_complete(), which is probably its originally intended purpose.
Comment on attachment 286380 [details] [review] gtask: Eliminate a reference counting loop between GTask and cancellable >The GTask holds a reference to its cancellable, so for the cancelled >signal handler to hold a reference to the GTask creates a reference >loop. The additional reference is needed for thread-safety reasons. If the task completes and is cancelled at roughly the same time, then task_thread_cancelled() might get called in the cancelling thread before g_task_thread_complete() in the task thread can remove the signal handler. We need to guarantee that task continues to exist until task_thread_cancelled() returns, and the only way to do that is for that thread to have its own ref. Though looking at it more carefully now, the current code is wrong, because the cancellation thread's ref gets dropped when the signal handler is removed (from the task thread). So that needs to be fixed... >This reference counting loop was causing GTasks which have failed to be >added to the thread pool queue (g_thread_pool_push() in >g_task_start_task_thread()) to be leaked, as the signal handler was >never removed in that case. A follow-up commit fixes that issue too. So with that follow-up commit, this change is not necessary, right?
Comment on attachment 286381 [details] [review] gtask: Fix cleanup on an error path in g_task_thread_complete() > thread_complete now only controls whether g_task_return() is called at > the end of g_task_thread_complete() The task->blocking_other_task part needs to happen only if (!task_already_complete) too. (And the cancellation removal is only necessary if !task_already_complete, although it doesn't *hurt* to do it in both cases...) > which is probably its originally intended purpose. The original purpose of thread_complete was to deal with the complete/cancel race condition as mentioned in my last comment. Using it for tasks-that-should-never-run was sort of a hack... >+ /* The task belatedly completed after having been cancelled >+ * (or was cancelled in the midst of being completed). >+ */ >+ task_already_complete = task->thread_complete; That comment was previously inside an if(), meaning at that point in the code, the statement it made was always true. Now it's not. It needs to say something like "Check if the task belatedly completed..."
(In reply to comment #4) > Though looking at it more carefully now, the current code is wrong, because the > cancellation thread's ref gets dropped when the signal handler is removed (from > the task thread). So that needs to be fixed... No, my bad; the code is right. g_signal_connect_data() uses a GClosure internally, so the GDestroyNotify can't be called while the signal handler is running.
(In reply to comment #5) > (From update of attachment 286381 [details] [review]) > > thread_complete now only controls whether g_task_return() is called at > > the end of g_task_thread_complete() > > The task->blocking_other_task part needs to happen only if > (!task_already_complete) too. I was treating blocking_other_task as a flag that the thread pool’s max threads had been changed, and hence that it needs resetting on task completion. If task_already_complete is TRUE, blocking_other_task is FALSE, so having the condition there is harmless. But doing it this way round (without explicit reference to task_already_completed in the condition) is more robust to future changes, e.g. if the thread pool was changed even when task_already_complete is TRUE. > (And the cancellation removal is only necessary > if !task_already_complete, although it doesn't *hurt* to do it in both > cases...) The cancellation removal is the entire point of the patch. The code: g_thread_pool_push (task_pool, g_object_ref (task), &task->error); if (task->error) task->thread_complete = TRUE; in g_task_start_task_thread() is called after the signal handler is connected, and hence the signal handler *has* to be removed regardless of whether thread_complete is TRUE or FALSE. > >+ /* The task belatedly completed after having been cancelled > >+ * (or was cancelled in the midst of being completed). > >+ */ > >+ task_already_complete = task->thread_complete; > > That comment was previously inside an if(), meaning at that point in the code, > the statement it made was always true. Now it's not. It needs to say something > like "Check if the task belatedly completed..." Good point, changed.
Created attachment 286404 [details] [review] gtask: Fix a signed/unsigned integer comparison The GSource times assigned to creation_time are always signed.
Created attachment 286405 [details] [review] gtask: Document signal handler reference counting Explain why the signal handler holds a reference to the GTask, even though that causes a reference loop at first glance.
Created attachment 286406 [details] [review] gtask: Fix cleanup on an error path in g_task_thread_complete() If pushing a task to the thread pool queue fails in g_task_start_task_thread(), the task is left in a half-constructed state, with its cancelled signal handler connected, but thread_complete set to TRUE. g_task_thread_complete() was previously bailing immediately, causing the signal handler to never be disconnected. In combination with a reference loop this was causing GTasks in this state to be leaked. Change g_task_thread_complete() to always clean up the GTask’s state, regardless of whether thread_complete is set. thread_complete now only controls whether g_task_return() is called at the end of g_task_thread_complete(), which is probably its originally intended purpose.
(In reply to comment #4) > >This reference counting loop was causing GTasks which have failed to be > >added to the thread pool queue (g_thread_pool_push() in > >g_task_start_task_thread()) to be leaked, as the signal handler was > >never removed in that case. A follow-up commit fixes that issue too. > > So with that follow-up commit, this change is not necessary, right? Yup. I’ve changed it into a commit adding a comment so hopefully nobody makes the same mistake in future.
Comment on attachment 286405 [details] [review] gtask: Document signal handler reference counting >+ * Accordingly, the signal handler *must* be removed once the task has >+ * completed. */ put the "*/" on the next line please
(In reply to comment #7) > > (And the cancellation removal is only necessary > > if !task_already_complete, although it doesn't *hurt* to do it in both > > cases...) > > The cancellation removal is the entire point of the patch. Ah, well, yes, I was mistakenly referring to the original semantics of task_complete again... Anyway, my point was that in the both-completed-and-cancelled case, we will now end up calling g_signal_handlers_disconnect_by_func() twice. Which is harmless, but unnecessary. (But which also suggests that we're overloading the meaning of task_complete too much...)
(In reply to comment #7) > > The task->blocking_other_task part needs to happen only if > > (!task_already_complete) too. > > I was treating blocking_other_task as a flag that the thread pool’s max threads > had been changed, and hence that it needs resetting on task completion. If > task_already_complete is TRUE, blocking_other_task is FALSE, so having the > condition there is harmless. No, that's only true in the complete-immediately case. In the both-completed-and-cancelled case, blocking_other_task might be TRUE, and if so, your patch would result in the thread pool being shrunk twice instead of just once. Hm... should we even be passing a GError to g_thread_pool_push()? The docs say: * @error can be %NULL to ignore errors, or non-%NULL to report * errors. An error can only occur when a new thread couldn't be * created. In that case @data is simply appended to the queue of * work to do. So as long as there are *any* threads in the pool, the task will eventually get run, so we shouldn't return an error to the caller. And it can only fail if you hit the thread limit, in which case your program will probably crash soon anyway... Were you actually hitting this error?
Created attachment 286589 [details] [review] gtask: Fix a signed/unsigned integer comparison The GSource times assigned to creation_time are always signed.
Created attachment 286590 [details] [review] gtask: Document signal handler reference counting Explain why the signal handler holds a reference to the GTask, even though that causes a reference loop at first glance.
Created attachment 286591 [details] [review] gtask: Fix cleanup on an error path in g_task_thread_complete() If pushing a task to the thread pool queue fails in g_task_start_task_thread(), the task is left in a half-constructed state, with its cancelled signal handler connected, but thread_complete set to TRUE. g_task_thread_complete() was previously bailing immediately, causing the signal handler to never be disconnected. In combination with a reference loop this was causing GTasks in this state to be leaked. Change g_task_thread_complete() to always clean up the GTask’s state, regardless of whether thread_complete is set. thread_complete now only controls whether g_task_return() is called at the end of g_task_thread_complete(), which is probably its originally intended purpose.
Created attachment 286592 [details] [review] gtask: Ignore errors from g_thread_pool_push() g_thread_pool_push() only returns an error if it fails to spawn a new thread. However, it unconditionally adds the task to its worker queue, so: • if _any_ threads exist in the pool, the task will eventually be handled; and • if _no_ threads exist in the pool, the task will be handled if one is eventually successfully spawned. If no more threads are ever spawned, the process probably has bigger problems than a single GTask which is taking forever to complete.
Created attachment 286598 [details] [review] gtask: Fix a signed/unsigned integer comparison The GSource times assigned to creation_time are always signed.
Created attachment 286599 [details] [review] gtask: Document signal handler reference counting Explain why the signal handler holds a reference to the GTask, even though that causes a reference loop at first glance.
Created attachment 286600 [details] [review] gtask: Fix cleanup on an error path in g_task_thread_complete() If pushing a task to the thread pool queue fails in g_task_start_task_thread(), the task is left in a half-constructed state, with its cancelled signal handler connected, but thread_complete set to TRUE. g_task_thread_complete() was previously bailing immediately, causing the signal handler to never be disconnected. In combination with a reference loop this was causing GTasks in this state to be leaked. Change g_task_thread_complete() to always clean up the GTask’s state, regardless of whether thread_complete is set. thread_complete now only controls whether g_task_return() is called at the end of g_task_thread_complete(), which is probably its originally intended purpose.
Created attachment 286601 [details] [review] gtask: Ignore errors from g_thread_pool_push() g_thread_pool_push() only returns an error if it fails to spawn a new thread. However, it unconditionally adds the task to its worker queue, so: • if _any_ threads exist in the pool, the task will eventually be handled; and • if _no_ threads exist in the pool, the task will be handled if one is eventually successfully spawned. If no more threads are ever spawned, the process probably has bigger problems than a single GTask which is taking forever to complete.
(In reply to comment #14) > (In reply to comment #7) > > > The task->blocking_other_task part needs to happen only if > > > (!task_already_complete) too. > > > > I was treating blocking_other_task as a flag that the thread pool’s max threads > > had been changed, and hence that it needs resetting on task completion. If > > task_already_complete is TRUE, blocking_other_task is FALSE, so having the > > condition there is harmless. > > No, that's only true in the complete-immediately case. In the > both-completed-and-cancelled case, blocking_other_task might be TRUE, and if > so, your patch would result in the thread pool being shrunk twice instead of > just once. Ah, yes. Protecting the max threads decrement with an if (!task_already_complete) should do it. Updated patch above. > Hm... should we even be passing a GError to g_thread_pool_push()? The docs say: > > * @error can be %NULL to ignore errors, or non-%NULL to report > * errors. An error can only occur when a new thread couldn't be > * created. In that case @data is simply appended to the queue of > * work to do. > > So as long as there are *any* threads in the pool, the task will eventually get > run, so we shouldn't return an error to the caller. But perhaps we should return an error to the caller in the case that there are *no* threads in the pool, and creating the first one fails. Except that even in that case, the task is still added to the pool’s queue, and could eventually be executed if the pool manages to spawn a thread in future. So I guess we should handle the case of zero threads in the pool by assuming it will create some in future, and hence completely ignore errors. Patch above. > And it can only fail if you hit the thread limit, in which case your program > will probably crash soon anyway... > > Were you actually hitting this error? Pretty sure we were. Due to bug #736809 we were ending up with hundreds of TLS worker threads, so could have been hitting the limit…but the limit is typically 100000s, so perhaps not. :-\ In any case, the state of the GTask after it had stalled exactly matches the fix, and I couldn’t find any other cause. This is a less than ideal diagnosis though.
Comment on attachment 286600 [details] [review] gtask: Fix cleanup on an error path in g_task_thread_complete() The "Ignore errors from g_thread_pool_push()" patch removes the code path that would result in the signal handler getting leaked, so this is now unnecessary. (And it makes g_task_thread_complete() kind of a mess, so I'm happier without it.)
Sorted. Thanks for the reviews. :-) Attachment 286598 [details] pushed as c6838ff - gtask: Fix a signed/unsigned integer comparison Attachment 286599 [details] pushed as 925913d - gtask: Document signal handler reference counting Attachment 286601 [details] pushed as f41ebeb - gtask: Ignore errors from g_thread_pool_push()