Bug 736806 – gtask: Fix reference count loop causing leaks

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 736806 - gtask: Fix reference count loop causing leaks


Summary:	gtask: Fix reference count loop causing leaks


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	gio
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-09-17 14:26 UTC by Philip Withnall
Modified:	2014-09-23 07:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gtask: Fix a signed/unsigned integer comparison (764 bytes, patch) 2014-09-17 14:26 UTC, Philip Withnall	accepted-commit_after_freeze	Details \| Review
gtask: Eliminate a reference counting loop between GTask and cancellable (1.69 KB, patch) 2014-09-17 14:26 UTC, Philip Withnall	rejected	Details \| Review
gtask: Fix cleanup on an error path in g_task_thread_complete() (2.31 KB, patch) 2014-09-17 14:26 UTC, Philip Withnall	needs-work	Details \| Review
gtask: Fix a signed/unsigned integer comparison (764 bytes, patch) 2014-09-17 16:33 UTC, Philip Withnall	accepted-commit_after_freeze	Details \| Review
gtask: Document signal handler reference counting (1.21 KB, patch) 2014-09-17 16:33 UTC, Philip Withnall	accepted-commit_after_freeze	Details \| Review
gtask: Fix cleanup on an error path in g_task_thread_complete() (2.32 KB, patch) 2014-09-17 16:33 UTC, Philip Withnall	none	Details \| Review
gtask: Fix a signed/unsigned integer comparison (764 bytes, patch) 2014-09-19 07:56 UTC, Philip Withnall	none	Details \| Review
gtask: Document signal handler reference counting (1.21 KB, patch) 2014-09-19 07:57 UTC, Philip Withnall	none	Details \| Review
gtask: Fix cleanup on an error path in g_task_thread_complete() (2.48 KB, patch) 2014-09-19 07:57 UTC, Philip Withnall	none	Details \| Review
gtask: Ignore errors from g_thread_pool_push() (1.55 KB, patch) 2014-09-19 07:57 UTC, Philip Withnall	none	Details \| Review
gtask: Fix a signed/unsigned integer comparison (764 bytes, patch) 2014-09-19 08:49 UTC, Philip Withnall	committed	Details \| Review
gtask: Document signal handler reference counting (1.22 KB, patch) 2014-09-19 08:49 UTC, Philip Withnall	committed	Details \| Review
gtask: Fix cleanup on an error path in g_task_thread_complete() (2.48 KB, patch) 2014-09-19 08:49 UTC, Philip Withnall	rejected	Details \| Review
gtask: Ignore errors from g_thread_pool_push() (1.55 KB, patch) 2014-09-19 08:49 UTC, Philip Withnall	committed	Details \| Review

Description Philip Withnall 2014-09-17 14:26:11 UTC

Patches attached. Bit of an edge case, but it’s possible to hit this by firing off a lot of GTasks to limited-size thread pools very quickly.

Comment 1 Philip Withnall 2014-09-17 14:26:13 UTC

Created attachment 286379 [details] [review]
gtask: Fix a signed/unsigned integer comparison

The GSource times assigned to creation_time are always signed.

Comment 2 Philip Withnall 2014-09-17 14:26:17 UTC

Created attachment 286380 [details] [review]
gtask: Eliminate a reference counting loop between GTask and cancellable

The GTask holds a reference to its cancellable, so for the cancelled
signal handler to hold a reference to the GTask creates a reference
loop.

This reference counting loop was causing GTasks which have failed to be
added to the thread pool queue (g_thread_pool_push() in
g_task_start_task_thread()) to be leaked, as the signal handler was
never removed in that case. A follow-up commit fixes that issue too.

Comment 3 Philip Withnall 2014-09-17 14:26:20 UTC

Created attachment 286381 [details] [review]
gtask: Fix cleanup on an error path in g_task_thread_complete()

If pushing a task to the thread pool queue fails in
g_task_start_task_thread(), the task is left in a half-constructed
state, with its cancelled signal handler connected, but thread_complete
set to TRUE.

g_task_thread_complete() was previously bailing immediately, causing the
signal handler to never be disconnected. In combination with a reference
loop this was causing GTasks in this state to be leaked. Change
g_task_thread_complete() to always clean up the GTask’s state,
regardless of whether thread_complete is set.

thread_complete now only controls whether g_task_return() is called at
the end of g_task_thread_complete(), which is probably its originally
intended purpose.

Comment 4 Dan Winship 2014-09-17 15:29:14 UTC

Comment on attachment 286380 [details] [review]
gtask: Eliminate a reference counting loop between GTask and cancellable

>The GTask holds a reference to its cancellable, so for the cancelled
>signal handler to hold a reference to the GTask creates a reference
>loop.

The additional reference is needed for thread-safety reasons. If the task completes and is cancelled at roughly the same time, then task_thread_cancelled() might get called in the cancelling thread before g_task_thread_complete() in the task thread can remove the signal handler. We need to guarantee that task continues to exist until task_thread_cancelled() returns, and the only way to do that is for that thread to have its own ref.

Though looking at it more carefully now, the current code is wrong, because the cancellation thread's ref gets dropped when the signal handler is removed (from the task thread). So that needs to be fixed...

>This reference counting loop was causing GTasks which have failed to be
>added to the thread pool queue (g_thread_pool_push() in
>g_task_start_task_thread()) to be leaked, as the signal handler was
>never removed in that case. A follow-up commit fixes that issue too.

So with that follow-up commit, this change is not necessary, right?

Comment 5 Dan Winship 2014-09-17 15:36:13 UTC

Comment on attachment 286381 [details] [review]
gtask: Fix cleanup on an error path in g_task_thread_complete()

> thread_complete now only controls whether g_task_return() is called at
> the end of g_task_thread_complete()

The task->blocking_other_task part needs to happen only if (!task_already_complete) too. (And the cancellation removal is only necessary if !task_already_complete, although it doesn't *hurt* to do it in both cases...)

> which is probably its originally intended purpose.

The original purpose of thread_complete was to deal with the complete/cancel race condition as mentioned in my last comment. Using it for tasks-that-should-never-run was sort of a hack...

>+  /* The task belatedly completed after having been cancelled
>+   * (or was cancelled in the midst of being completed).
>+   */
>+  task_already_complete = task->thread_complete;

That comment was previously inside an if(), meaning at that point in the code, the statement it made was always true. Now it's not. It needs to say something like "Check if the task belatedly completed..."

Comment 6 Dan Winship 2014-09-17 15:50:41 UTC

(In reply to comment #4)
> Though looking at it more carefully now, the current code is wrong, because the
> cancellation thread's ref gets dropped when the signal handler is removed (from
> the task thread). So that needs to be fixed...

No, my bad; the code is right. g_signal_connect_data() uses a GClosure internally, so the GDestroyNotify can't be called while the signal handler is running.

Comment 7 Philip Withnall 2014-09-17 16:28:01 UTC

(In reply to comment #5)
> (From update of attachment 286381 [details] [review])
> > thread_complete now only controls whether g_task_return() is called at
> > the end of g_task_thread_complete()
> 
> The task->blocking_other_task part needs to happen only if
> (!task_already_complete) too.

I was treating blocking_other_task as a flag that the thread pool’s max threads had been changed, and hence that it needs resetting on task completion. If task_already_complete is TRUE, blocking_other_task is FALSE, so having the condition there is harmless. But doing it this way round (without explicit reference to task_already_completed in the condition) is more robust to future changes, e.g. if the thread pool was changed even when task_already_complete is TRUE.

> (And the cancellation removal is only necessary
> if !task_already_complete, although it doesn't *hurt* to do it in both
> cases...)

The cancellation removal is the entire point of the patch.

The code:
  g_thread_pool_push (task_pool, g_object_ref (task), &task->error);
  if (task->error)
    task->thread_complete = TRUE;
in g_task_start_task_thread() is called after the signal handler is connected, and hence the signal handler *has* to be removed regardless of whether thread_complete is TRUE or FALSE.

> >+  /* The task belatedly completed after having been cancelled
> >+   * (or was cancelled in the midst of being completed).
> >+   */
> >+  task_already_complete = task->thread_complete;
> 
> That comment was previously inside an if(), meaning at that point in the code,
> the statement it made was always true. Now it's not. It needs to say something
> like "Check if the task belatedly completed..."

Good point, changed.

Comment 8 Philip Withnall 2014-09-17 16:33:13 UTC

Created attachment 286404 [details] [review]
gtask: Fix a signed/unsigned integer comparison

The GSource times assigned to creation_time are always signed.

Comment 9 Philip Withnall 2014-09-17 16:33:17 UTC

Created attachment 286405 [details] [review]
gtask: Document signal handler reference counting

Explain why the signal handler holds a reference to the GTask, even
though that causes a reference loop at first glance.

Comment 10 Philip Withnall 2014-09-17 16:33:21 UTC

Created attachment 286406 [details] [review]
gtask: Fix cleanup on an error path in g_task_thread_complete()

If pushing a task to the thread pool queue fails in
g_task_start_task_thread(), the task is left in a half-constructed
state, with its cancelled signal handler connected, but thread_complete
set to TRUE.

g_task_thread_complete() was previously bailing immediately, causing the
signal handler to never be disconnected. In combination with a reference
loop this was causing GTasks in this state to be leaked. Change
g_task_thread_complete() to always clean up the GTask’s state,
regardless of whether thread_complete is set.

thread_complete now only controls whether g_task_return() is called at
the end of g_task_thread_complete(), which is probably its originally
intended purpose.

Comment 11 Philip Withnall 2014-09-17 16:34:10 UTC

(In reply to comment #4)
> >This reference counting loop was causing GTasks which have failed to be
> >added to the thread pool queue (g_thread_pool_push() in
> >g_task_start_task_thread()) to be leaked, as the signal handler was
> >never removed in that case. A follow-up commit fixes that issue too.
> 
> So with that follow-up commit, this change is not necessary, right?

Yup. I’ve changed it into a commit adding a comment so hopefully nobody makes the same mistake in future.

Comment 12 Dan Winship 2014-09-18 17:47:28 UTC

Comment on attachment 286405 [details] [review]
gtask: Document signal handler reference counting

>+       * Accordingly, the signal handler *must* be removed once the task has
>+       * completed. */

put the "*/" on the next line please

Comment 13 Dan Winship 2014-09-18 17:50:12 UTC

(In reply to comment #7)
> > (And the cancellation removal is only necessary
> > if !task_already_complete, although it doesn't *hurt* to do it in both
> > cases...)
> 
> The cancellation removal is the entire point of the patch.

Ah, well, yes, I was mistakenly referring to the original semantics of task_complete again... Anyway, my point was that in the both-completed-and-cancelled case, we will now end up calling g_signal_handlers_disconnect_by_func() twice. Which is harmless, but unnecessary. (But which also suggests that we're overloading the meaning of task_complete too much...)

Comment 14 Dan Winship 2014-09-18 18:09:38 UTC

(In reply to comment #7)
> > The task->blocking_other_task part needs to happen only if
> > (!task_already_complete) too.
> 
> I was treating blocking_other_task as a flag that the thread pool’s max threads
> had been changed, and hence that it needs resetting on task completion. If
> task_already_complete is TRUE, blocking_other_task is FALSE, so having the
> condition there is harmless.

No, that's only true in the complete-immediately case. In the both-completed-and-cancelled case, blocking_other_task might be TRUE, and if so, your patch would result in the thread pool being shrunk twice instead of just once.


Hm... should we even be passing a GError to g_thread_pool_push()? The docs say:

 * @error can be %NULL to ignore errors, or non-%NULL to report
 * errors. An error can only occur when a new thread couldn't be
 * created. In that case @data is simply appended to the queue of
 * work to do.

So as long as there are *any* threads in the pool, the task will eventually get run, so we shouldn't return an error to the caller.

And it can only fail if you hit the thread limit, in which case your program will probably crash soon anyway...

Were you actually hitting this error?

Comment 15 Philip Withnall 2014-09-19 07:56:55 UTC

Created attachment 286589 [details] [review]
gtask: Fix a signed/unsigned integer comparison

The GSource times assigned to creation_time are always signed.

Comment 16 Philip Withnall 2014-09-19 07:57:00 UTC

Created attachment 286590 [details] [review]
gtask: Document signal handler reference counting

Explain why the signal handler holds a reference to the GTask, even
though that causes a reference loop at first glance.

Comment 17 Philip Withnall 2014-09-19 07:57:03 UTC

Created attachment 286591 [details] [review]
gtask: Fix cleanup on an error path in g_task_thread_complete()

If pushing a task to the thread pool queue fails in
g_task_start_task_thread(), the task is left in a half-constructed
state, with its cancelled signal handler connected, but thread_complete
set to TRUE.

g_task_thread_complete() was previously bailing immediately, causing the
signal handler to never be disconnected. In combination with a reference
loop this was causing GTasks in this state to be leaked. Change
g_task_thread_complete() to always clean up the GTask’s state,
regardless of whether thread_complete is set.

thread_complete now only controls whether g_task_return() is called at
the end of g_task_thread_complete(), which is probably its originally
intended purpose.

Comment 18 Philip Withnall 2014-09-19 07:57:07 UTC

Created attachment 286592 [details] [review]
gtask: Ignore errors from g_thread_pool_push()

g_thread_pool_push() only returns an error if it fails to spawn a new
thread. However, it unconditionally adds the task to its worker queue,
so:
 • if _any_ threads exist in the pool, the task will eventually be
   handled; and
 • if _no_ threads exist in the pool, the task will be handled if one
   is eventually successfully spawned.
If no more threads are ever spawned, the process probably has bigger
problems than a single GTask which is taking forever to complete.

Comment 19 Philip Withnall 2014-09-19 08:49:42 UTC

Created attachment 286598 [details] [review]
gtask: Fix a signed/unsigned integer comparison

The GSource times assigned to creation_time are always signed.

Comment 20 Philip Withnall 2014-09-19 08:49:46 UTC

Created attachment 286599 [details] [review]
gtask: Document signal handler reference counting

Explain why the signal handler holds a reference to the GTask, even
though that causes a reference loop at first glance.

Comment 21 Philip Withnall 2014-09-19 08:49:51 UTC

Created attachment 286600 [details] [review]
gtask: Fix cleanup on an error path in g_task_thread_complete()

If pushing a task to the thread pool queue fails in
g_task_start_task_thread(), the task is left in a half-constructed
state, with its cancelled signal handler connected, but thread_complete
set to TRUE.

g_task_thread_complete() was previously bailing immediately, causing the
signal handler to never be disconnected. In combination with a reference
loop this was causing GTasks in this state to be leaked. Change
g_task_thread_complete() to always clean up the GTask’s state,
regardless of whether thread_complete is set.

thread_complete now only controls whether g_task_return() is called at
the end of g_task_thread_complete(), which is probably its originally
intended purpose.

Comment 22 Philip Withnall 2014-09-19 08:49:56 UTC

Created attachment 286601 [details] [review]
gtask: Ignore errors from g_thread_pool_push()

g_thread_pool_push() only returns an error if it fails to spawn a new
thread. However, it unconditionally adds the task to its worker queue,
so:
 • if _any_ threads exist in the pool, the task will eventually be
   handled; and
 • if _no_ threads exist in the pool, the task will be handled if one
   is eventually successfully spawned.
If no more threads are ever spawned, the process probably has bigger
problems than a single GTask which is taking forever to complete.

Comment 23 Philip Withnall 2014-09-19 12:46:40 UTC

(In reply to comment #14)
> (In reply to comment #7)
> > > The task->blocking_other_task part needs to happen only if
> > > (!task_already_complete) too.
> > 
> > I was treating blocking_other_task as a flag that the thread pool’s max threads
> > had been changed, and hence that it needs resetting on task completion. If
> > task_already_complete is TRUE, blocking_other_task is FALSE, so having the
> > condition there is harmless.
> 
> No, that's only true in the complete-immediately case. In the
> both-completed-and-cancelled case, blocking_other_task might be TRUE, and if
> so, your patch would result in the thread pool being shrunk twice instead of
> just once.

Ah, yes. Protecting the max threads decrement with an if (!task_already_complete) should do it. Updated patch above.

> Hm... should we even be passing a GError to g_thread_pool_push()? The docs say:
> 
>  * @error can be %NULL to ignore errors, or non-%NULL to report
>  * errors. An error can only occur when a new thread couldn't be
>  * created. In that case @data is simply appended to the queue of
>  * work to do.
> 
> So as long as there are *any* threads in the pool, the task will eventually get
> run, so we shouldn't return an error to the caller.

But perhaps we should return an error to the caller in the case that there are *no* threads in the pool, and creating the first one fails. Except that even in that case, the task is still added to the pool’s queue, and could eventually be executed if the pool manages to spawn a thread in future.

So I guess we should handle the case of zero threads in the pool by assuming it will create some in future, and hence completely ignore errors. Patch above.

> And it can only fail if you hit the thread limit, in which case your program
> will probably crash soon anyway...
> 
> Were you actually hitting this error?

Pretty sure we were. Due to bug #736809 we were ending up with hundreds of TLS worker threads, so could have been hitting the limit…but the limit is typically 100000s, so perhaps not. :-\  In any case, the state of the GTask after it had stalled exactly matches the fix, and I couldn’t find any other cause. This is a less than ideal diagnosis though.

Comment 24 Dan Winship 2014-09-20 21:27:45 UTC

Comment on attachment 286600 [details] [review]
gtask: Fix cleanup on an error path in g_task_thread_complete()

The "Ignore errors from g_thread_pool_push()" patch removes the code path that would result in the signal handler getting leaked, so this is now unnecessary. (And it makes g_task_thread_complete() kind of a mess, so I'm happier without it.)

Comment 25 Philip Withnall 2014-09-23 07:09:24 UTC

Sorted. Thanks for the reviews. :-)

Attachment 286598 [details] pushed as c6838ff - gtask: Fix a signed/unsigned integer comparison
Attachment 286599 [details] pushed as 925913d - gtask: Document signal handler reference counting
Attachment 286601 [details] pushed as f41ebeb - gtask: Ignore errors from g_thread_pool_push()