GNOME Bugzilla – Bug 506457
e-d-s 2.21.4 crash in e_cal_backend_finalize(): "priv->clients == NULL"
Last modified: 2013-09-14 16:50:01 UTC
original Ubuntu bug: https://bugs.edge.launchpad.net/ubuntu/+source/evolution-data-server/+bug/179132 We are getting some of them, and I would expect more as more test users move to 2.21.4.
+ Trace 183555
Thread 1 (process 7813)
Good trace. COnfirming. We don't have large amounts of duplicates (yet), hence I set severity blocker->critical
no prob, Crhistian. Just for the sake of clarity, we have had 5 such SIGABRT on launchpad so far, on Hardy alpha (which means not many running it) -- and this was why I set it as blocker.
there haven't been any changes to e-cal-backend.c since may 2007, so i expect the latest change (fix for bug 494314) in e-data-cal-factory.c to be a possible reason. CC'ing milan and ondřej on this report, because they worked on bug 494314. especially the line "uri_type_string = 0x64d4e0 "contacts:///:4"" looks suspicious to me.
"contacts://" is an URI for the 'Birthdays & Anniversaries' calendar backend. This bug is failed assertion in e_cal_backend_finalize. Birthdays backend obejct is unreffed while the priv->clients list is not NULL. My patch in above mentioned bug 494314 probably uncovered locking problem that is mentioned in the e_cal_backend_remove_client function in e-cal-backend.c:414. I see two possible fixes: 1) fix locking in e_cal_backend_remove_client() 2) revert patch from bug 494314 -> this will ensure that backends are leaked again and thus e_cal_backend_finalize() is never called I'll post a patch to fix locking in e_cal_backend_remove_client() soon and we'll see if it fixes the bug.
Created attachment 101875 [details] [review] possible fix for the bug I'm sorry I can't test the fix right now. It seems obvious, but I'm not 100% sure. It is modelled after similar function in the addressbook code.
I guess I refreshed the wrong page, and I put this bug back into Sev blocker. My fault, and sorry, folks (Christian, I am sorry, I did not mean or intend to look like I was overriding you). Back into critical.
To Ondrej: From my point of view, you should not hold the clients_mutex in the time of call to last_client_gone function. It's not the meaning of the mutex. Also, isn't is possible that there are more than one Birthdays&Anniversaries calendars, which causes the trouble? To hggdh: Do you have any exact steps how to reproduce this, please? Maybe a key from /apps/evolution/calendar/sources will help here?
@mcrha: No, I do not have steps on this, and it has not happened again so far (or, if it did, without a core). Per some of the duplicates, this seems to happen on Evo startup, perhaps after a login. Here's all of my entries in /apps/evolution/calendar/sources; The only entry I really added in was weather, the others we created automagically by Evo. Also, I never entered any birthday data in Calendar: <?xml version="1.0"?> <group uid="1176403146.15515.0@xango" name="CalDAV" base_uri="caldav://" readonly="no"/> <?xml version="1.0"?> <group uid="1176403147.15515.1@xango" name="On This Computer" base_uri="file:///home/hggdh/.evolution/calendar/local" readonly="no"><source uid="1176403147.15515.2@xango" name="Personal" relative_uri="system" color_spec="#becedd"><properties><property name="alarm" value="true"/><property name="conflict" value="true"/></properties></source></group> <?xml version="1.0"?> <group uid="1176403147.15515.3@xango" name="On The Web" base_uri="webcal://" readonly="no"/> <?xml version="1.0"?> <group uid="1176403147.15515.4@xango" name="Contacts" base_uri="contacts://" readonly="no"><properties><property name="create_source" value="no"/></properties><source uid="1176403147.15515.5@xango" name="Birthdays & Anniversaries" relative_uri="/" color_spec="#DDBECE"><properties><property name="alarm" value="true"/><property name="delete" value="no"/><property name="conflict" value="true"/></properties></source></group> <?xml version="1.0"?> <group uid="1176403147.15515.6@xango" name="Weather" base_uri="weather://" readonly="no"><source uid="1188933766.2708.0@xango2" name="Weather" relative_uri="ccf/DAL/Dallas" color_spec="#F0B8B7"><properties><property name="alarm" value="true"/><property name="refresh" value="15"/><property name="conflict" value="true"/><property name="units" value="imperial"/><property name="offline_sync" value="1"/></properties></source></group> <?xml version="1.0"?> <group uid="1196873711.20662.0@xango2" name="Google" base_uri="Google://" readonly="no"/>
If it helps any... I have been able to reproduce it as follows: 1. log in gnome 2. start Evo 3. Here I went on working as usual for a while, do not know if necessary or not 4. close Evo (File/Quit) 5. log out gnome 6. log on gnome again e-d-s crashes. It is worth noting that e-d-s was left running after logout. I repeated it 3 times, 3 different crashes I have not yet tried to reboot & repeat. YMMV.
(In reply to comment #7) > To Ondrej: From my point of view, you should not hold the clients_mutex in the > time of call to last_client_gone function. It's not the meaning of the mutex. > Also, isn't is possible that there are more than one Birthdays&Anniversaries > calendars, which causes the trouble? > To hggdh: Do you have any exact steps how to reproduce this, please? Maybe a > key from /apps/evolution/calendar/sources will help here? > I don't think it is specific to 'Birthdays&Anniversaries' backend because the failed assertion is in the base backend class method. (ECalBackend) I see what you are talking about. Problem is in the e-data-cal-factory.c. There might be a race between new ECal client connecting to the backend that is being removed because of previous client disconnected. The right solution is to introduce locking for factory->priv->backends hasth table... 1) thread1: client disconnects, last client gone callback is called 2) thread1: ->backends hash table is looked up for a backend client was connected to 3) thread2: other client connects 4) thread2: looks up existing backend in the ->backneds h.t. (successfully) 5) thread2: adds itself to the ->clients list in the backend 6) thread1: the backend is removed from the ->backends hash table, finalize is called 7) thread1: the assertion fails, because the ->clients list is not empty I might not be catching the race, because I have some debug g_print() calls -- that take mutex -- in various places that are not present upstream. This might break timing.
Created attachment 102022 [details] race test client Attached program that might trigger the race. It does not for me, though.
I guess you've right, Ondrej. From the code, the lock on factory->priv->backends should help, because what happened here was the eds tried to insert same backend twice, because of the threads race condition, as you said. Will you create a patch, please? hggdh, are you able to test such patch? Or at least try if it's harder to reproduce it with such patch?
I will probably be able to test it, yes.
Created attachment 102054 [details] [review] possible fix for the race OK, here is the patch. Hope it helps.
I applied the patch against e-d-s-2.21.4-0ubuntu2; it applied cleanly, and the packages were rebuilt cleanly. I then stopped Evo via evolution --shutdown, and restarted it. I then tried the process I noted out earlier (close Evo, logout, login) some 5 times so far, with no crashes. Obviously, since this is a race, I might not have hit the perfect scenario. But so far, so good.
hggdh: Thank you for testing the patch. I'm glad it works. My speculation is that my other patch for bug 494314 affected timing of code execution in getCal() function and thus made race more probable on some systems. I'll post a patch containing changelog entry for application to the SVN trunk.
Created attachment 102067 [details] [review] changelog entry patch
hggdh: thanks for testing it. Ondrej: do you really want to hold that mutex even when emitting a signal? You know, it's similar like your previous patch, in time of the signal emit it doesn't matter the backends value, so the mutex can be unlocked. In case some application will reestablish the backend when last gone, then it will starve on locks. Strange scenario, I agree, but keep lock only as long as really necessary. What do you think?
I'm sorry. I don't understand the situation you are talking about. If you can break it down into steps like in comment #10, it may became clearer.
Sure, I see in your patch that you hold the mutex even when emitting signal LAST_CALENDAR_GONE. In case the application will be connected to this signal and will need to reopen at least one backend, then because it will hold the lock, then it will starve on locking. Like: a) closing last backend, sending signal, still hold backends mutex b) app receives this signal, need to reopen backend c) but the data cal factory will wait for unlocking backends mutex, which will not happen. I know, a little bit stupid scenario, but the only thing I want to say here is that the lock should be help only as long as really necessary, and because signal handlers can do "anything", then I would be more happy if mutexes are unlocked during signal invocations, just to prevent locking issues.
Created attachment 102102 [details] [review] move mutex higher I see. I think this will never happen. But anyway. This patch should be applied after all others.
Created attachment 102105 [details] [review] merged and a bit extended patch for evolution-data-server; OK, I merged those three patches into one and because the mutex should guard access to backends variable, then I extended it a bit. Ondrej, can you review this patch, please? If you approve, we can commit to trunk, then.
Looks fine. Please apply. Thanks!
Committed to trunk. Committed revision 8330.
*** Bug 452795 has been marked as a duplicate of this bug. ***
*** Bug 449430 has been marked as a duplicate of this bug. ***