After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 503574 - huge thread usage ...
huge thread usage ...
Status: RESOLVED FIXED
Product: evolution-data-server
Classification: Platform
Component: general
2.22.x (obsolete)
Other Linux
: High critical
: ---
Assigned To: Chenthill P
Evolution QA team
Depends on:
Blocks:
 
 
Reported: 2007-12-14 11:03 UTC by Michael Meeks
Modified: 2013-09-14 16:50 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
masses of threads (132.04 KB, text/plain)
2007-12-14 11:03 UTC, Michael Meeks
  Details
Patch from Chenthill (544 bytes, patch)
2008-01-07 05:33 UTC, Srinivasa Ragavan
none Details | Review
Patch to run deltas in a single thread. (6.62 KB, patch)
2008-01-16 05:47 UTC, Chenthill P
none Details | Review
updated patch with review comments incoporated. (6.77 KB, patch)
2008-01-17 04:44 UTC, Chenthill P
committed Details | Review

Description Michael Meeks 2007-12-14 11:03:23 UTC
For me, often e-d-s appears to grow wrt. memory use over time, though very intermittently, until it turns into some vast giant of a process :-)

Recently I had the beasty deadlock on me, and I was surprised to see:
"Thread 195" in the gdb trace (attached).

Is it possible that our memory usage problems, could be caused by a runaway / unrestrained thread-pool problem ? Almost all of the threads were at:


which looks odd: reading the code - there is apparently a *lot* of spawning of this thread going on - sometimes it's spawned after a 60second timeout too (why?).

The logic here is unclear to me, but it seems that if get_deltas doesn't return inside a minute (the default) then we go on endlessly spawning these threads :-) - which is not cool: all of them blocking on the static 'connecting' mutex: ie. what we see.

It seems clear that we shouldn't be launching duplicate threads to do the same work on a timeout here. The quick fix is to have a [ mutex protected ] list of threads already doing work here, and not dispatch duplicate threads. Of course - it's possible we also want some cancellation for very-long-running (failed) get_deltas threads too - not sure.

HTH.
Comment 1 Michael Meeks 2007-12-14 11:03:59 UTC
Created attachment 100941 [details]
masses of threads
Comment 2 Srinivasa Ragavan 2007-12-14 11:11:30 UTC
Hmm.. were you resuming from a suspend? I know there were some issues like that.
 Sounds serious though. I have heard/seen it at times. Confirming though.
Comment 3 Srinivasa Ragavan 2007-12-14 11:12:03 UTC
Target for 2.12.3 and 2.21.4/5
Comment 4 Michael Meeks 2007-12-14 11:13:46 UTC
    Not resuming from suspend no, just left the laptop running overnight with the
    calendar open [ as I always do ]. Good news is: should be easy to fix.

Comment 5 Michael Meeks 2007-12-14 11:27:01 UTC
Having said that, it looks like no thread is actually executing the get_deltas method which is even more curious: since the lock being blocked on is of purely local scope ;-)

a quick review shows ~all explicit 'return's to unlock the lock - which is good [ albeit I'd prefer a 'goto exit;' style pattern to avoid cut/paste, error & code bloating.

Unless there is some strange memory corruption going on (quite possible given the  FIXME comment there) that is: most odd.
Comment 6 Srinivasa Ragavan 2007-12-14 12:46:27 UTC
Meeks, I had chat with Chen and he is looking into it. Incidentally he was working on the same side of code to implement refresh/fetch-deltas to be done on a Send/Receive. He would be looking at both and should be in for 2.21.4 probably. 

Nice to see that you are using trunk :)
Comment 7 Srinivasa Ragavan 2008-01-07 05:33:55 UTC
Created attachment 102298 [details] [review]
Patch from Chenthill
Comment 8 Srinivasa Ragavan 2008-01-07 05:39:21 UTC
Committed to stable/trunk.
Comment 9 Michael Meeks 2008-01-07 11:45:05 UTC
Getting into the habit of using trylock is really not a wonderful idea, though it may work here.

The *root* question is - wtf. is going on here ? :-) why do we have this static local lock, what does it protect ? there seem to be no 'static' local members for it to protect.

Are we really just using it to exclude multiple threads from entering this method ? if so, why ?

Also, this still really concerns me:

	if (cache_keys) {
		/*FIXME this is a memory leak, but free'ing it causes crash in gslice */
//		g_slist_free (cache_keys);
	}

:-)
Comment 10 Srinivasa Ragavan 2008-01-07 13:41:20 UTC
Meeks right. Chen is on his vacation and this is his quick patch for the moment. I'm sure that he will be back to answer/fix it all.

My fault - Don't close the bug and keep it open to fix it right.
Comment 11 Chenthill P 2008-01-09 10:59:36 UTC
Ah I never knew the bug id for this particular issue to see the description. I was searching all through bugzilla.novell.com. The GroupWise server used to crash when get_deltas was called on multiple threads for the backend's such as calendar, tasks and memos. That was a temporary fix made to avoid the server side crash. I will remove the lock and test it with the latest server.

Am obsoleting the fix as it would not solve the issue. I will attach an updated fix for solving the above mentioned issues.
Comment 12 Michael Meeks 2008-01-09 11:09:05 UTC
chenthill - heh :-) of course - we still don't want to create more than 1 thread doing this at idle I guess.

Just a thought - why don't we create a thread that just loops doing this refresh ? of the form:

while (TRUE) {
    if (!g_cond_timed_wait (exit_cond, dummy_mutex, <timeout-seconds>))
        return;
    ... do refresh as before ...
};

I forget if there are lots of starts / stops required for this one.
Comment 13 Chenthill P 2008-01-10 10:04:20 UTC
(In reply to comment #12)
> chenthill - heh :-) of course - we still don't want to create more than 1
> thread doing this at idle I guess.
We would require three threads for calendar, tasks and memos since they don't share the same memory.

> 
> Just a thought - why don't we create a thread that just loops doing this
> refresh ? of the form:
> 
> while (TRUE) {
>     if (!g_cond_timed_wait (exit_cond, dummy_mutex, <timeout-seconds>))
>         return;
>     ... do refresh as before ...
> };
> 
> I forget if there are lots of starts / stops required for this one.
The above approach is a better one and am writing a patch in those lines.

Comment 14 Chenthill P 2008-01-16 05:47:05 UTC
Created attachment 102973 [details] [review]
Patch to run deltas in a single thread. 

Will be submitting another patch for canceling the server requests while going offline or when the backend goes down.
Comment 15 Michael Meeks 2008-01-16 12:33:31 UTC
For the delta_thread, I would expect something more like:

delta_thread (gpointer data)
{
	ECalBackendGroupwise *cbgw = data;
	ECalBackendGroupwisePrivate *priv = cbgw->priv;
	GTimeVal timeout;

	timeout.tv_sec = 0;
	timeout.tv_usec = 0;

	while (TRUE) {
	  gboolean succeded = get_deltas (cbgw);

	  g_mutex_lock (priv->dlock->mutex);
	  if (!succeded || priv->dlock->exit)
		break;
		
	  g_get_current_time (&timeout);
	  g_time_val_add (&timeout, CACHE_REFRESH_INTERVAL * 1000);
	  g_cond_timed_wait (priv->dlock->cond, priv->dlock->mutex, &timeout);
		
	  if (priv->dlock->exit)
		break;

	  g_mutex_unlock (priv->dlock->mutex);
	}
	g_mutex_unlock (priv->dlock->mutex);
	priv->dthread = NULL;
	return NULL;
}

ie. not accessing ->exit or ->dthread without the lock held. I'd also tend to expect fetch_deltas to take a lock before doing the thread creation. In refresh_calender and elsewhere - there is no need to take the lock before signalling the condition :-)

HTH.
Comment 16 Chenthill P 2008-01-17 04:44:07 UTC
Created attachment 103041 [details] [review]
updated patch with review comments incoporated.
Comment 17 Srinivasa Ragavan 2008-01-24 04:27:30 UTC
Chen, commit it and apply it for addressbook also. 
Comment 18 Chenthill P 2008-01-28 08:10:25 UTC
Patch has been committed.