Bug 503574 – huge thread usage ...

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 503574 - huge thread usage ...


Summary:	huge thread usage ...


Status:	RESOLVED FIXED

Product:	evolution-data-server
Classification:	Platform
Component:	general
Version:	2.22.x (obsolete)
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	---
Assigned To:	Chenthill P
QA Contact:	Evolution QA team

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-12-14 11:03 UTC by Michael Meeks
Modified:	2013-09-14 16:50 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
masses of threads (132.04 KB, text/plain) 2007-12-14 11:03 UTC, Michael Meeks		Details
Patch from Chenthill (544 bytes, patch) 2008-01-07 05:33 UTC, Srinivasa Ragavan	none	Details \| Review
Patch to run deltas in a single thread. (6.62 KB, patch) 2008-01-16 05:47 UTC, Chenthill P	none	Details \| Review
updated patch with review comments incoporated. (6.77 KB, patch) 2008-01-17 04:44 UTC, Chenthill P	committed	Details \| Review

Description Michael Meeks 2007-12-14 11:03:23 UTC

For me, often e-d-s appears to grow wrt. memory use over time, though very intermittently, until it turns into some vast giant of a process :-)

Recently I had the beasty deadlock on me, and I was surprised to see:
"Thread 195" in the gdb trace (attached).

Is it possible that our memory usage problems, could be caused by a runaway / unrestrained thread-pool problem ? Almost all of the threads were at:

+ Trace 181855


which looks odd: reading the code - there is apparently a *lot* of spawning of this thread going on - sometimes it's spawned after a 60second timeout too (why?).

The logic here is unclear to me, but it seems that if get_deltas doesn't return inside a minute (the default) then we go on endlessly spawning these threads :-) - which is not cool: all of them blocking on the static 'connecting' mutex: ie. what we see.

It seems clear that we shouldn't be launching duplicate threads to do the same work on a timeout here. The quick fix is to have a [ mutex protected ] list of threads already doing work here, and not dispatch duplicate threads. Of course - it's possible we also want some cancellation for very-long-running (failed) get_deltas threads too - not sure.

HTH.

Comment 1 Michael Meeks 2007-12-14 11:03:59 UTC

Created attachment 100941 [details]
masses of threads

Comment 2 Srinivasa Ragavan 2007-12-14 11:11:30 UTC

Hmm.. were you resuming from a suspend? I know there were some issues like that.
 Sounds serious though. I have heard/seen it at times. Confirming though.

Comment 3 Srinivasa Ragavan 2007-12-14 11:12:03 UTC

Target for 2.12.3 and 2.21.4/5

Comment 4 Michael Meeks 2007-12-14 11:13:46 UTC

    Not resuming from suspend no, just left the laptop running overnight with the
    calendar open [ as I always do ]. Good news is: should be easy to fix.

Comment 5 Michael Meeks 2007-12-14 11:27:01 UTC

Having said that, it looks like no thread is actually executing the get_deltas method which is even more curious: since the lock being blocked on is of purely local scope ;-)

a quick review shows ~all explicit 'return's to unlock the lock - which is good [ albeit I'd prefer a 'goto exit;' style pattern to avoid cut/paste, error & code bloating.

Unless there is some strange memory corruption going on (quite possible given the  FIXME comment there) that is: most odd.

Comment 6 Srinivasa Ragavan 2007-12-14 12:46:27 UTC

Meeks, I had chat with Chen and he is looking into it. Incidentally he was working on the same side of code to implement refresh/fetch-deltas to be done on a Send/Receive. He would be looking at both and should be in for 2.21.4 probably. 

Nice to see that you are using trunk :)

Comment 7 Srinivasa Ragavan 2008-01-07 05:33:55 UTC

Created attachment 102298 [details] [review]
Patch from Chenthill

Comment 8 Srinivasa Ragavan 2008-01-07 05:39:21 UTC

Committed to stable/trunk.

Comment 9 Michael Meeks 2008-01-07 11:45:05 UTC

Getting into the habit of using trylock is really not a wonderful idea, though it may work here.

The *root* question is - wtf. is going on here ? :-) why do we have this static local lock, what does it protect ? there seem to be no 'static' local members for it to protect.

Are we really just using it to exclude multiple threads from entering this method ? if so, why ?

Also, this still really concerns me:

	if (cache_keys) {
		/*FIXME this is a memory leak, but free'ing it causes crash in gslice */
//		g_slist_free (cache_keys);
	}

:-)

Comment 10 Srinivasa Ragavan 2008-01-07 13:41:20 UTC

Meeks right. Chen is on his vacation and this is his quick patch for the moment. I'm sure that he will be back to answer/fix it all.

My fault - Don't close the bug and keep it open to fix it right.

Comment 11 Chenthill P 2008-01-09 10:59:36 UTC

Ah I never knew the bug id for this particular issue to see the description. I was searching all through bugzilla.novell.com. The GroupWise server used to crash when get_deltas was called on multiple threads for the backend's such as calendar, tasks and memos. That was a temporary fix made to avoid the server side crash. I will remove the lock and test it with the latest server.

Am obsoleting the fix as it would not solve the issue. I will attach an updated fix for solving the above mentioned issues.

Comment 12 Michael Meeks 2008-01-09 11:09:05 UTC

chenthill - heh :-) of course - we still don't want to create more than 1 thread doing this at idle I guess.

Just a thought - why don't we create a thread that just loops doing this refresh ? of the form:

while (TRUE) {
    if (!g_cond_timed_wait (exit_cond, dummy_mutex, <timeout-seconds>))
        return;
    ... do refresh as before ...
};

I forget if there are lots of starts / stops required for this one.

Comment 13 Chenthill P 2008-01-10 10:04:20 UTC

(In reply to comment #12)
> chenthill - heh :-) of course - we still don't want to create more than 1
> thread doing this at idle I guess.
We would require three threads for calendar, tasks and memos since they don't share the same memory.

> 
> Just a thought - why don't we create a thread that just loops doing this
> refresh ? of the form:
> 
> while (TRUE) {
>     if (!g_cond_timed_wait (exit_cond, dummy_mutex, <timeout-seconds>))
>         return;
>     ... do refresh as before ...
> };
> 
> I forget if there are lots of starts / stops required for this one.
The above approach is a better one and am writing a patch in those lines.

Comment 14 Chenthill P 2008-01-16 05:47:05 UTC

Created attachment 102973 [details] [review]
Patch to run deltas in a single thread. 

Will be submitting another patch for canceling the server requests while going offline or when the backend goes down.

Comment 15 Michael Meeks 2008-01-16 12:33:31 UTC

For the delta_thread, I would expect something more like:

delta_thread (gpointer data)
{
	ECalBackendGroupwise *cbgw = data;
	ECalBackendGroupwisePrivate *priv = cbgw->priv;
	GTimeVal timeout;

	timeout.tv_sec = 0;
	timeout.tv_usec = 0;

	while (TRUE) {
	  gboolean succeded = get_deltas (cbgw);

	  g_mutex_lock (priv->dlock->mutex);
	  if (!succeded || priv->dlock->exit)
		break;
		
	  g_get_current_time (&timeout);
	  g_time_val_add (&timeout, CACHE_REFRESH_INTERVAL * 1000);
	  g_cond_timed_wait (priv->dlock->cond, priv->dlock->mutex, &timeout);
		
	  if (priv->dlock->exit)
		break;

	  g_mutex_unlock (priv->dlock->mutex);
	}
	g_mutex_unlock (priv->dlock->mutex);
	priv->dthread = NULL;
	return NULL;
}

ie. not accessing ->exit or ->dthread without the lock held. I'd also tend to expect fetch_deltas to take a lock before doing the thread creation. In refresh_calender and elsewhere - there is no need to take the lock before signalling the condition :-)

HTH.

Comment 16 Chenthill P 2008-01-17 04:44:07 UTC

Created attachment 103041 [details] [review]
updated patch with review comments incoporated.

Comment 17 Srinivasa Ragavan 2008-01-24 04:27:30 UTC

Chen, commit it and apply it for addressbook also.

Comment 18 Chenthill P 2008-01-28 08:10:25 UTC

Patch has been committed.