Bug 732556 – pad: Race condition when removing sticky events

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 732556 - pad: Race condition when removing sticky events


Summary:	pad: Race condition when removing sticky events


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.3.91
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-07-01 14:59 UTC by Göran Jönsson
Modified:	2014-07-04 12:57 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gstreamer patch (1.50 KB, patch) 2014-07-01 14:59 UTC, Göran Jönsson	needs-work	Details \| Review

Description Göran Jönsson 2014-07-01 14:59:02 UTC

Created attachment 279693 [details] [review]
gstreamer patch

Not unlock in loop in (gstpad.c)remove_events

If unlocking in loop then there is no protection for list in the loop
in remove_events·

In my case this was causing that a events was unrefed twice, and a 
`mini_object->refcount > 0' failed happens. 

But i think that many other unpredictable things can happen.

Comment 1 Sebastian Dröge (slomo) 2014-07-01 15:29:01 UTC

Comment on attachment 279693 [details] [review]
gstreamer patch

The g_object_notify() should probably happen at the very end when everything else is done too. So after the events_cookie is changed

Comment 2 Sebastian Dröge (slomo) 2014-07-01 17:33:26 UTC

I updated the patch and pushed it... but I wonder how you could even get into such a situation. The would most likely be a reference count problem with the pad involved, as the only way to get there is to release the last reference to the pad while it is currently being deactivated. Which should not happen as the code that deactivates the pad should still have a reference.

Can you provide a testcase or explain in what situation this happened?


commit d0a808cdc87c60f3dea4ac8d458324fd9e458ae7
Author: Göran Jönsson <goranjn@axis.com>
Date:   Tue Jul 1 12:22:56 2014 +0200

    pad: Don't unlock while iterating over all sticky events for removal
    
    Otherwise we might end up getting the event removed from elsewhere
    at the same time while we're unlocked for g_object_notify().
    
    https://bugzilla.gnome.org/show_bug.cgi?id=732556

Comment 3 Göran Jönsson 2014-07-02 05:57:28 UTC

I do not have a complete explanation on what happens, but i have more information.

We have set up a test that we are running several hundred times before this happens. First I try explain test case.

Testcase: This is a RTSP session where the rtp packets are in tcp packets, and we are using something we call pullmode. This mean that we try to stream all data as fast as possible not in the speed that it should be shown for end user. We 
have a test client that are not reading any data at all in beginning and the first period of not reading at all it is followed by a period of reading slow.
This is followed by a short period were clients read as fast as possible then client is doing pause and then teardown. This causes queues and stuff to be pretty loaded by data when the session is teardown.

What I have seen in the remove_events func several times  is that before loop. ( I have a func that prints all events in list) there is 5 events. 

When entering the loop the first event is handled as it should and second is a
GST_EVENT_CAPS and after the unlock notify lock sequence strange things happends. Now it proccess the 4:th event and then the 5:th and after that another event that is same as the 5:th and here we got the double unref.

My interpretation of this is that during unlock notify lock sequence there is someone manipulating the list probable remove event 3 and change size of list. So the last iteration is done on data outside the actual list size causing the doubble unref of the last event. 

With the code change the problem disappear.

If I had access to trace tools like Lauterbach or similar I could analyze this in more detail but I don't have access to tools like that. I have tried to use my function that prints the list of events inside loop but then it just crash.

Comment 4 Sebastian Dröge (slomo) 2014-07-02 08:57:07 UTC

It would be useful to understand from where the event list is changed while notifying about the caps. It shouldn't be possible and probably hints at another bug elsewhere :)