GNOME Bugzilla – Bug 595480
Since adding MAPI, evolution-data-server crashes, 100% CPU useage, or OOM kills.
Last modified: 2009-11-23 05:51:47 UTC
Using 0.27.92 plus a few mcrha patches for message headers. Yesterday, my CPU useage maxed out for a long time due to EDS. Connecting to the EDS process yielded this interesting activity, every time I connected.
+ Trace 217645
I'd like to note that the trace parsing didn't realise I had two separate traces in there. Please expand Trace 217645 above - there is quite a bit more description of the bug in it.
strings ~/.evolution/addressbook/local/system/addressbook.db | grep 'EMAIL' | sort | uniq | wc -l yields 75 entries, all of which I recognise. So I'm not sure what's up with EDS now, and also where all those unrecognised addresses came from. I'm going to disable MAPI and retest.
With mapi disabled, EDS works perfectly, personal address list appears to be just fine. GAL also works perfectly if I switch to that (using OWA). Note that above I had *not* switched to GAL in the address book dialog. I do have duplicate addresses in my personal address book that appeared sometime after using mapi that I will have to clean up, but it does not seem serious.
Aha, I see I'm wrong, I thought the build_cache function is missing e_file_cache_freeze_changes/e_file_cache_thaw_changes, but I see it's there. So the CPU usage is something else. For the crash itself, two threads are trying to use the global_mapi_session pointer, which might result it this. I added locking there (attaching patch now).
Created attachment 143369 [details] [review] proposed ema patch (for a crash) for evolution-mapi; Please try with this patch. Should help with the crash itself.
Yep. Crash is fixed, address book works again, however periodically EDS still hits 100% CPU usage while sucking up every scrap of memory until OOM killed or --force-shutdown
Could you try few snapshots (backtraces) while eds is in stuck, please? When it happens something like: $ gdb --batch --ex "t a a bt" -pid=PID &>/tmp/bt.log where PID is a process ID of running (and CPU chewing) evolution-data-server? Say around 3 snapshots in different time? Maybe I overlooked something. Also, do you know how many contacts are stored in your GAL? We can try to figure out later, probably.
The first trace that I posted (the one I had posted in IRC) was from repeatedly connecting to the EDS process and doing a backtrace. Each time, the same trace was displayed, the polling loop and this e_book_backend_cache_add_contact thing. As for the number of people in the GAL. No idea. Thousands? Maybe tens of thousands?
FWIW, I tried your snapshot suggestion. for i in `seq 10`;do gdb --batch --ex "t a a bt" -pid=$(ps auwx | grep evolution-data-server | grep -v grep | awk '{print $2}') &>/tmp/eds${i}.log;sleep 2;done I fired this off the instant my CPU, network and memory started skyrocketing. It wasn't too hard to trigger, I just started a new mail and fiddled w/ autocomplete of addresses, personal vs GAL address lists and such until it happened. (it happens even without my doing this unfortunately, so MAPI is fairly unreliable right now). I did --force-shutdown after, to beat the OOM killer to the punch. All 10 traces were identical. Just looked like polling stuff. Unlike the time I connected and suspended, where every single time I did that and resumed it was in e_book_backend_cache_add_contact. Also, the "9644": No such file or directory seems odd. Pretty sure my fetching a pid above is reasonable, and I wouldn'tve had much time to act after it went wild. 9644: No such file or directory. [Thread debugging using libthread_db enabled] [New Thread 0xb66dcb70 (LWP 8263)] [New Thread 0xb7f20b70 (LWP 8224)] 0x00fb7422 in __kernel_vsyscall ()
+ Trace 217669
EDS just started spazzing out again. This time no network activity and memory usage was relatively stable. 10678 nemo 20 0 972m 864m 9268 S 87.1 43.0 3:21.20 evolution-data- The traces were also more varied, and they included similar content as my first trace. This full trace was the first one. I'll go over the others to see if they were doing anything at all different. [Thread debugging using libthread_db enabled] [New Thread 0xb771eb70 (LWP 10963)] [New Thread 0xb6f1db70 (LWP 10895)] [New Thread 0xb2cfeb70 (LWP 10809)] [New Thread 0xb04f9b70 (LWP 10808)] [New Thread 0xb0cfab70 (LWP 10807)] [New Thread 0xb3ef0b70 (LWP 10803)] [New Thread 0xb5f1bb70 (LWP 10719)] [New Thread 0xb7f60b70 (LWP 10679)] 0x00404422 in __kernel_vsyscall ()
+ Trace 217674
Thread 3 (Thread 0xb6f1db70 (LWP 10895))
Thread 2 (Thread 0xb771eb70 (LWP 10963))
Most of the other traces were indeed essentially the same apart with fetching other entries from our huge GAL. A couple of the differences: One thread on one of the logs was in:
+ Trace 217675
All the rest were essentially the same with minor address differences.
Thanks for the update. I guess the comment #10 has a clue. Two threads are trying to build a local cache of GAL contacts simultaneously. If one thread finished the operation then in unlocks the cache (the EFileCache doesn't use "recursive" locking), and any other operation is on an unfrozen cache, thus makes all the direct saving to file. Bharath, you wrote this part recently, what is your opinion on it?
Comment on attachment 143369 [details] [review] proposed ema patch (for a crash) Created commit 056b109 in ema master (0.29.1+) Created commit 31252e2 in ema gnome-2-28 (0.28.1+) The part of the build_cache thread still applies.
Thanks Milan. I'll take a look at the thread issues.
Created attachment 145274 [details] [review] proposed ema patch (partial) for evolution-mapi; This is an interim partial solution to the CPU usage issue, but the whole cache system in ema seems to need a little rewrite. Note the GAL will not work probably, as this is just for not-to-chew-my-CPU hot-fix. Please give it a try, maybe together with a GAL testing. Thanks in advance.
Comment on attachment 145274 [details] [review] proposed ema patch (partial) I feel the build_thread needs to be locked in authenticate_user. isn it?
whole cache in gal or in ema ? would be good if you put more specifically..
(In reply to comment #16) > I feel the build_thread needs to be locked in authenticate_user. isn it? What do you mean with "locked"? It is created only in the authenticate user function, and the authenticate user function ignores re-creating of the thread when there is some already. (In reply to comment #17) > whole cache in gal or in ema ? would be good if you put more specifically.. I'm not sure what exactly I meant. Might be for ema's GAL only. By the way, the book_view_thread function is not using the cache at all, it's always fetching all contacts from GAL. The proper way might be to use cache or search for them, than reading whole GAL.
see bug 598769 also
authenticate_user can be called by multiple clients/threads at the same time isn it? Bharath will be looking into the second part..
(In reply to comment #20) > authenticate_user can be called by multiple clients/threads at the same time > isn it? There should be always one backend per server/uri, thus technically this shouldn't happen, as long as the build_cache thread will be stopped properly. What I guess happened to Nemo is that the build thread was kept running while one backend stopped, and the second started to update the cache, but the first build_cache thread just finished, thus unlocked the cache nd then the mess began. (I hope it makes sense.)
Committed to master and gnome-2-28 branch http://bit.ly/IRYbT I will look at delta fetching for GAL in libmapi. But with this fix and Milan's cache locking, things should be better.
Comment on attachment 145274 [details] [review] proposed ema patch (partial) Committed to master a05fc8481bfd48c941efb19a3a592bc30f297577 Committed to gnome-2-28 branch http://git.gnome.org/cgit/evolution-mapi/commit/?h=gnome-2-28&id=598f318109f44b8649269c100f51f98fc554d3c4
Closing the report