GNOME Bugzilla – Bug 732556
pad: Race condition when removing sticky events
Last modified: 2014-07-04 12:57:37 UTC
Created attachment 279693 [details] [review] gstreamer patch Not unlock in loop in (gstpad.c)remove_events If unlocking in loop then there is no protection for list in the loop in remove_events· In my case this was causing that a events was unrefed twice, and a `mini_object->refcount > 0' failed happens. But i think that many other unpredictable things can happen.
Comment on attachment 279693 [details] [review] gstreamer patch The g_object_notify() should probably happen at the very end when everything else is done too. So after the events_cookie is changed
I updated the patch and pushed it... but I wonder how you could even get into such a situation. The would most likely be a reference count problem with the pad involved, as the only way to get there is to release the last reference to the pad while it is currently being deactivated. Which should not happen as the code that deactivates the pad should still have a reference. Can you provide a testcase or explain in what situation this happened? commit d0a808cdc87c60f3dea4ac8d458324fd9e458ae7 Author: Göran Jönsson <goranjn@axis.com> Date: Tue Jul 1 12:22:56 2014 +0200 pad: Don't unlock while iterating over all sticky events for removal Otherwise we might end up getting the event removed from elsewhere at the same time while we're unlocked for g_object_notify(). https://bugzilla.gnome.org/show_bug.cgi?id=732556
I do not have a complete explanation on what happens, but i have more information. We have set up a test that we are running several hundred times before this happens. First I try explain test case. Testcase: This is a RTSP session where the rtp packets are in tcp packets, and we are using something we call pullmode. This mean that we try to stream all data as fast as possible not in the speed that it should be shown for end user. We have a test client that are not reading any data at all in beginning and the first period of not reading at all it is followed by a period of reading slow. This is followed by a short period were clients read as fast as possible then client is doing pause and then teardown. This causes queues and stuff to be pretty loaded by data when the session is teardown. What I have seen in the remove_events func several times is that before loop. ( I have a func that prints all events in list) there is 5 events. When entering the loop the first event is handled as it should and second is a GST_EVENT_CAPS and after the unlock notify lock sequence strange things happends. Now it proccess the 4:th event and then the 5:th and after that another event that is same as the 5:th and here we got the double unref. My interpretation of this is that during unlock notify lock sequence there is someone manipulating the list probable remove event 3 and change size of list. So the last iteration is done on data outside the actual list size causing the doubble unref of the last event. With the code change the problem disappear. If I had access to trace tools like Lauterbach or similar I could analyze this in more detail but I don't have access to tools like that. I have tried to use my function that prints the list of events inside loop but then it just crash.
It would be useful to understand from where the event list is changed while notifying about the caps. It shouldn't be possible and probably hints at another bug elsewhere :)