Bug 506457 – e-d-s 2.21.4 crash in e_cal_backend_finalize(): "priv->clients == NULL"

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 506457 - e-d-s 2.21.4 crash in e_cal_backend_finalize(): "priv->clients == NULL"


Summary:	e-d-s 2.21.4 crash in e_cal_backend_finalize(): "priv->clients == NULL"


Status:	RESOLVED FIXED

Product:	evolution-data-server
Classification:	Platform
Component:	Calendar
Version:	2.22.x (obsolete)
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	---
Assigned To:	evolution-calendar-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:

Duplicates:	449430 452795 (view as bug list)
Depends on:
Blocks:

Reported:	2007-12-30 18:40 UTC by C de-Avillez
Modified:	2013-09-14 16:50 UTC

See Also:
GNOME target:	---
GNOME version:	2.21/2.22

Attachments
possible fix for the bug (1.72 KB, patch) 2007-12-30 23:33 UTC, Ondrej Jirman	needs-work	Details \| Review
race test client (639 bytes, application/x-compressed-tar) 2008-01-02 23:02 UTC, Ondrej Jirman		Details
possible fix for the race (2.68 KB, patch) 2008-01-03 16:58 UTC, Ondrej Jirman	reviewed	Details \| Review
changelog entry patch (464 bytes, patch) 2008-01-03 20:42 UTC, Ondrej Jirman	reviewed	Details \| Review
move mutex higher (1.07 KB, patch) 2008-01-04 11:50 UTC, Ondrej Jirman	none	Details \| Review
merged and a bit extended patch (3.98 KB, patch) 2008-01-04 12:13 UTC, Milan Crha	committed	Details \| Review

Description C de-Avillez 2007-12-30 18:40:10 UTC

original Ubuntu bug: https://bugs.edge.launchpad.net/ubuntu/+source/evolution-data-server/+bug/179132

We are getting some of them, and I would expect more as more test users move to 2.21.4.

+ Trace 183555

Thread 1 (process 7813)

#0 IA__g_logv
#1 IA__g_log
at /build/buildd/glib2.0-2.15.0/glib/gmessages.c line 517
#2 g_assert_warning
at /build/buildd/glib2.0-2.15.0/glib/gmessages.c line 573
#3 e_cal_backend_finalize
at e-cal-backend.c line 294
#4 IA__g_object_unref
at /build/buildd/glib2.0-2.15.0/gobject/gobject.c line 1793
#5 g_hash_table_insert_internal
at /build/buildd/glib2.0-2.15.0/glib/ghash.c line 747
#6 impl_CalFactory_getCal
at e-data-cal-factory.c line 203
#7 _ORBIT_skel_small_GNOME_Evolution_Calendar_CalFactory_getCal
at Evolution-DataServer-Calendar-common.c line 244
#8 ORBit_small_invoke_adaptor
at orbit-small.c line 844
#9 ORBit_POAObject_handle_request
at poa.c line 1351
#10 ORBit_POAObject_invoke_incoming_request
at poa.c line 1421
#11 giop_thread_queue_process
at giop.c line 771
#12 giop_request_handler_thread
at giop.c line 481
#13 g_thread_pool_thread_proxy
at /build/buildd/glib2.0-2.15.0/glib/gthreadpool.c line 265
#14 g_thread_create_proxy
at /build/buildd/glib2.0-2.15.0/glib/gthread.c line 635
#15 start_thread
from /lib/libpthread.so.0
#16 clone
from /lib/libc.so.6
#17 ??

Comment 1 Christian Kirbach 2007-12-30 21:27:59 UTC

Good trace. COnfirming.

We don't have large amounts of duplicates (yet), hence I set severity blocker->critical

Comment 2 C de-Avillez 2007-12-30 21:32:27 UTC

no prob, Crhistian. Just for the sake of clarity, we have had 5 such SIGABRT on launchpad so far, on Hardy alpha (which means not many running it) -- and this was why I set it as blocker.

Comment 3 André Klapper 2007-12-30 22:33:45 UTC

there haven't been any changes to e-cal-backend.c since may 2007, so i expect the latest change (fix for bug 494314) in e-data-cal-factory.c to be a possible reason.

CC'ing milan and ondřej on this report, because they worked on bug 494314.

especially the line "uri_type_string = 0x64d4e0 "contacts:///:4"" looks suspicious to me.

Comment 4 Ondrej Jirman 2007-12-30 23:22:24 UTC

"contacts://" is an URI for the 'Birthdays & Anniversaries' calendar backend.

This bug is failed assertion in e_cal_backend_finalize. Birthdays backend obejct is unreffed while the priv->clients list is not NULL. My patch in above mentioned bug 494314 probably uncovered locking problem that is mentioned in the e_cal_backend_remove_client function in e-cal-backend.c:414.

I see two possible fixes:

1) fix locking in e_cal_backend_remove_client()
2) revert patch from bug 494314 -> this will ensure that backends are leaked again  and thus e_cal_backend_finalize() is never called

I'll post a patch to fix locking in e_cal_backend_remove_client() soon and we'll see if it fixes the bug.

Comment 5 Ondrej Jirman 2007-12-30 23:33:09 UTC

Created attachment 101875 [details] [review]
possible fix for the bug

I'm sorry I can't test the fix right now. It seems obvious, but I'm not 100% sure.

It is modelled after similar function in the addressbook code.

Comment 6 C de-Avillez 2007-12-31 13:45:10 UTC

I guess I refreshed the wrong page, and I put this bug back into  Sev blocker. My fault, and sorry, folks (Christian, I am sorry, I did not mean or intend to look like I was overriding you).

Back into critical.

Comment 7 Milan Crha 2008-01-02 13:06:16 UTC

To Ondrej: From my point of view, you should not hold the clients_mutex in the time of call to last_client_gone function. It's not the meaning of the mutex.
Also, isn't is possible that there are more than one Birthdays&Anniversaries calendars, which causes the trouble?
To hggdh: Do you have any exact steps how to reproduce this, please? Maybe a key from /apps/evolution/calendar/sources will help here?

Comment 8 C de-Avillez 2008-01-02 13:19:40 UTC

@mcrha: 

No, I do not have steps on this, and it has not happened again so far (or, if it did, without a core). Per some of the duplicates, this seems to happen on Evo startup, perhaps after a login.

Here's all of my entries in /apps/evolution/calendar/sources; The only entry I really added in was weather, the others we created automagically by Evo. Also, I never entered any birthday data in Calendar:

<?xml version="1.0"?>
<group uid="1176403146.15515.0@xango" name="CalDAV" base_uri="caldav://" readonly="no"/>
<?xml version="1.0"?>
<group uid="1176403147.15515.1@xango" name="On This Computer" base_uri="file:///home/hggdh/.evolution/calendar/local" readonly="no"><source uid="1176403147.15515.2@xango" name="Personal" relative_uri="system" color_spec="#becedd"><properties><property name="alarm" value="true"/><property name="conflict" value="true"/></properties></source></group>
<?xml version="1.0"?>
<group uid="1176403147.15515.3@xango" name="On The Web" base_uri="webcal://" readonly="no"/>
<?xml version="1.0"?>
<group uid="1176403147.15515.4@xango" name="Contacts" base_uri="contacts://" readonly="no"><properties><property name="create_source" value="no"/></properties><source uid="1176403147.15515.5@xango" name="Birthdays &amp; Anniversaries" relative_uri="/" color_spec="#DDBECE"><properties><property name="alarm" value="true"/><property name="delete" value="no"/><property name="conflict" value="true"/></properties></source></group>
<?xml version="1.0"?>
<group uid="1176403147.15515.6@xango" name="Weather" base_uri="weather://" readonly="no"><source uid="1188933766.2708.0@xango2" name="Weather" relative_uri="ccf/DAL/Dallas" color_spec="#F0B8B7"><properties><property name="alarm" value="true"/><property name="refresh" value="15"/><property name="conflict" value="true"/><property name="units" value="imperial"/><property name="offline_sync" value="1"/></properties></source></group>
<?xml version="1.0"?>
<group uid="1196873711.20662.0@xango2" name="Google" base_uri="Google://" readonly="no"/>

Comment 9 C de-Avillez 2008-01-02 21:06:20 UTC

If it helps any... I have been able to reproduce it as follows:

1. log in gnome
2. start Evo
3. Here I went on working as usual for a while, do not know if necessary or not
4. close Evo (File/Quit)
5. log out gnome
6. log on gnome again

e-d-s crashes. It is worth noting that e-d-s was left running after logout.
I repeated it 3 times, 3 different crashes

I have not yet tried to reboot & repeat. YMMV.

Comment 10 Ondrej Jirman 2008-01-02 22:50:03 UTC

(In reply to comment #7)
> To Ondrej: From my point of view, you should not hold the clients_mutex in the
> time of call to last_client_gone function. It's not the meaning of the mutex.
> Also, isn't is possible that there are more than one Birthdays&Anniversaries
> calendars, which causes the trouble?
> To hggdh: Do you have any exact steps how to reproduce this, please? Maybe a
> key from /apps/evolution/calendar/sources will help here?
> 

I don't think it is specific to 'Birthdays&Anniversaries' backend because the failed assertion is in the base backend class method. (ECalBackend)

I see what you are talking about. Problem is in the e-data-cal-factory.c. There might be a race between new ECal client connecting to the backend that is being removed because of previous client disconnected. The right solution is to introduce locking for factory->priv->backends hasth table...

1) thread1: client disconnects, last client gone callback is called
2) thread1: ->backends hash table is looked up for a backend client was connected to
3) thread2: other client connects
4) thread2: looks up existing backend in the ->backneds h.t. (successfully)
5) thread2: adds itself to the ->clients list in the backend
6) thread1: the backend is removed from the ->backends hash table, finalize is called
7) thread1: the assertion fails, because the ->clients list is not empty

I might not be catching the race, because I have some debug g_print() calls -- that take mutex -- in various places that are not present upstream. This might break timing.

Comment 11 Ondrej Jirman 2008-01-02 23:02:00 UTC

Created attachment 102022 [details]
race test client

Attached program that might trigger the race. It does not for me, though.

Comment 12 Milan Crha 2008-01-03 10:43:37 UTC

I guess you've right, Ondrej. From the code, the lock on factory->priv->backends should help, because what happened here was the eds tried to insert same backend twice, because of the threads race condition, as you said.
Will you create a patch, please?

hggdh, are you able to test such patch? Or at least try if it's harder to reproduce it with such patch?

Comment 13 C de-Avillez 2008-01-03 13:37:49 UTC

I will probably be able to test it, yes.

Comment 14 Ondrej Jirman 2008-01-03 16:58:22 UTC

Created attachment 102054 [details] [review]
possible fix for the race

OK, here is the patch. Hope it helps.

Comment 15 C de-Avillez 2008-01-03 20:27:54 UTC

I applied the patch against e-d-s-2.21.4-0ubuntu2; it applied cleanly, and the packages were rebuilt cleanly. I then stopped Evo via evolution --shutdown, and restarted it.

I then tried the process I noted out earlier (close Evo, logout, login) some 5 times so far, with no crashes.

Obviously, since this is a race, I might not have hit the perfect scenario. But so far, so good.

Comment 16 Ondrej Jirman 2008-01-03 20:37:18 UTC

hggdh: Thank you for testing the patch. I'm glad it works.

My speculation is that my other patch for bug 494314 affected timing of code execution in getCal() function and thus made race more probable on some systems.

I'll post a patch containing changelog entry for application to the SVN trunk.

Comment 17 Ondrej Jirman 2008-01-03 20:42:45 UTC

Created attachment 102067 [details] [review]
changelog entry patch

Comment 18 Milan Crha 2008-01-04 09:48:02 UTC

hggdh: thanks for testing it.
Ondrej: do you really want to hold that mutex even when emitting a signal? You know, it's similar like your previous patch, in time of the signal emit it doesn't matter the backends value, so the mutex can be unlocked. In case some application will reestablish the backend when last gone, then it will starve on locks. Strange scenario, I agree, but keep lock only as long as really necessary. What do you think?

Comment 19 Ondrej Jirman 2008-01-04 10:21:26 UTC

I'm sorry. I don't understand the situation you are talking about. If you can break it down into steps like in comment #10, it may became clearer.

Comment 20 Milan Crha 2008-01-04 10:35:44 UTC

Sure, I see in your patch that you hold the mutex even when emitting signal LAST_CALENDAR_GONE. In case the application will be connected to this signal and will need to reopen at least one backend, then because it will hold the lock, then it will starve on locking. Like:
a) closing last backend, sending signal, still hold backends mutex
b) app receives this signal, need to reopen backend
c) but the data cal factory will wait for unlocking backends mutex, which will not happen.

I know, a little bit stupid scenario, but the only thing I want to say here is that the lock should be help only as long as really necessary, and because signal handlers can do "anything", then I would be more happy if mutexes are unlocked during signal invocations, just to prevent locking issues.

Comment 21 Ondrej Jirman 2008-01-04 11:50:23 UTC

Created attachment 102102 [details] [review]
move mutex higher

I see. I think this will never happen. But anyway. This patch should be applied after all others.

Comment 22 Milan Crha 2008-01-04 12:13:14 UTC

Created attachment 102105 [details] [review]
merged and a bit extended patch

for evolution-data-server;

OK, I merged those three patches into one and because the mutex should guard access to backends variable, then I extended it a bit. Ondrej, can you review this patch, please? If you approve, we can commit to trunk, then.

Comment 23 Ondrej Jirman 2008-01-04 12:22:08 UTC

Looks fine. Please apply.

Thanks!

Comment 24 Milan Crha 2008-01-04 12:45:30 UTC

Committed to trunk. Committed revision 8330.

Comment 25 Milan Crha 2008-02-21 18:04:30 UTC

*** Bug 452795 has been marked as a duplicate of this bug. ***

Comment 26 Milan Crha 2008-02-22 13:09:12 UTC

*** Bug 449430 has been marked as a duplicate of this bug. ***