GNOME Bugzilla – Bug 503574
huge thread usage ...
Last modified: 2013-09-14 16:50:13 UTC
For me, often e-d-s appears to grow wrt. memory use over time, though very intermittently, until it turns into some vast giant of a process :-) Recently I had the beasty deadlock on me, and I was surprised to see: "Thread 195" in the gdb trace (attached). Is it possible that our memory usage problems, could be caused by a runaway / unrestrained thread-pool problem ? Almost all of the threads were at:
+ Trace 181855
which looks odd: reading the code - there is apparently a *lot* of spawning of this thread going on - sometimes it's spawned after a 60second timeout too (why?). The logic here is unclear to me, but it seems that if get_deltas doesn't return inside a minute (the default) then we go on endlessly spawning these threads :-) - which is not cool: all of them blocking on the static 'connecting' mutex: ie. what we see. It seems clear that we shouldn't be launching duplicate threads to do the same work on a timeout here. The quick fix is to have a [ mutex protected ] list of threads already doing work here, and not dispatch duplicate threads. Of course - it's possible we also want some cancellation for very-long-running (failed) get_deltas threads too - not sure. HTH.
Created attachment 100941 [details] masses of threads
Hmm.. were you resuming from a suspend? I know there were some issues like that. Sounds serious though. I have heard/seen it at times. Confirming though.
Target for 2.12.3 and 2.21.4/5
Not resuming from suspend no, just left the laptop running overnight with the calendar open [ as I always do ]. Good news is: should be easy to fix.
Having said that, it looks like no thread is actually executing the get_deltas method which is even more curious: since the lock being blocked on is of purely local scope ;-) a quick review shows ~all explicit 'return's to unlock the lock - which is good [ albeit I'd prefer a 'goto exit;' style pattern to avoid cut/paste, error & code bloating. Unless there is some strange memory corruption going on (quite possible given the FIXME comment there) that is: most odd.
Meeks, I had chat with Chen and he is looking into it. Incidentally he was working on the same side of code to implement refresh/fetch-deltas to be done on a Send/Receive. He would be looking at both and should be in for 2.21.4 probably. Nice to see that you are using trunk :)
Created attachment 102298 [details] [review] Patch from Chenthill
Committed to stable/trunk.
Getting into the habit of using trylock is really not a wonderful idea, though it may work here. The *root* question is - wtf. is going on here ? :-) why do we have this static local lock, what does it protect ? there seem to be no 'static' local members for it to protect. Are we really just using it to exclude multiple threads from entering this method ? if so, why ? Also, this still really concerns me: if (cache_keys) { /*FIXME this is a memory leak, but free'ing it causes crash in gslice */ // g_slist_free (cache_keys); } :-)
Meeks right. Chen is on his vacation and this is his quick patch for the moment. I'm sure that he will be back to answer/fix it all. My fault - Don't close the bug and keep it open to fix it right.
Ah I never knew the bug id for this particular issue to see the description. I was searching all through bugzilla.novell.com. The GroupWise server used to crash when get_deltas was called on multiple threads for the backend's such as calendar, tasks and memos. That was a temporary fix made to avoid the server side crash. I will remove the lock and test it with the latest server. Am obsoleting the fix as it would not solve the issue. I will attach an updated fix for solving the above mentioned issues.
chenthill - heh :-) of course - we still don't want to create more than 1 thread doing this at idle I guess. Just a thought - why don't we create a thread that just loops doing this refresh ? of the form: while (TRUE) { if (!g_cond_timed_wait (exit_cond, dummy_mutex, <timeout-seconds>)) return; ... do refresh as before ... }; I forget if there are lots of starts / stops required for this one.
(In reply to comment #12) > chenthill - heh :-) of course - we still don't want to create more than 1 > thread doing this at idle I guess. We would require three threads for calendar, tasks and memos since they don't share the same memory. > > Just a thought - why don't we create a thread that just loops doing this > refresh ? of the form: > > while (TRUE) { > if (!g_cond_timed_wait (exit_cond, dummy_mutex, <timeout-seconds>)) > return; > ... do refresh as before ... > }; > > I forget if there are lots of starts / stops required for this one. The above approach is a better one and am writing a patch in those lines.
Created attachment 102973 [details] [review] Patch to run deltas in a single thread. Will be submitting another patch for canceling the server requests while going offline or when the backend goes down.
For the delta_thread, I would expect something more like: delta_thread (gpointer data) { ECalBackendGroupwise *cbgw = data; ECalBackendGroupwisePrivate *priv = cbgw->priv; GTimeVal timeout; timeout.tv_sec = 0; timeout.tv_usec = 0; while (TRUE) { gboolean succeded = get_deltas (cbgw); g_mutex_lock (priv->dlock->mutex); if (!succeded || priv->dlock->exit) break; g_get_current_time (&timeout); g_time_val_add (&timeout, CACHE_REFRESH_INTERVAL * 1000); g_cond_timed_wait (priv->dlock->cond, priv->dlock->mutex, &timeout); if (priv->dlock->exit) break; g_mutex_unlock (priv->dlock->mutex); } g_mutex_unlock (priv->dlock->mutex); priv->dthread = NULL; return NULL; } ie. not accessing ->exit or ->dthread without the lock held. I'd also tend to expect fetch_deltas to take a lock before doing the thread creation. In refresh_calender and elsewhere - there is no need to take the lock before signalling the condition :-) HTH.
Created attachment 103041 [details] [review] updated patch with review comments incoporated.
Chen, commit it and apply it for addressbook also.
Patch has been committed.