GNOME Bugzilla – Bug 796737
ASYNC_DONE message is dropped leading to hanging application
Last modified: 2018-11-03 12:47:10 UTC
Created attachment 372914 [details] Problem with ASYNC_DONE The ASYNC_DONE message is dropped in gstbin in this scenario. (for more detailed information, see attached sequence) If change_state is executing at the same time as ASYNC_START is received then is the "polling" flag indicating that gstbin is busy with a state change and the PENDING state is not set but the ASYNC_START message is posted. But when the ASYNC_DONE is handled, then is pending == VOID_PENDING (initial value received when creating the pipeline) and the ASYNC_DONE is dropped leaving the application hanging. This might be the same problem as reported in #740121 and I don't have a solution for the problem that is working.
Created attachment 372980 [details] [review] patch adding sending of ASYNC_DONE message
I have tried some different solutions to this race condition, wait for the polling flag to be cleared before continue with the async_start handling but then I got some other problems as well as it did not solve this race condition. However, it seems to be one simple solution and I have run my test 1 million times without getting failing once and I triggered the race condition scenario 30 times. All I needed to do in the end was to send the ASYNC_DONE message to the parent when pending == VOID_PENDING which is basically the same as ASYNC_START is handled in the "was_busy" branch. I have added this patch for reveiw
Review of attachment 372980 [details] [review]: ::: gst/gstbin.c @@ +3512,3 @@ + /* post our combined ASYNC_DONE when all is ASYNC_DONE. */ + GST_DEBUG_OBJECT (bin, "posting ASYNC_DONE to parent"); + gst_element_post_message (GST_ELEMENT_CAST (bin), amessage); I believe this can cause multiple async-done to be posted for the same state change (and that's a problem as we currently guarantee that there is only one and there is code expecting that behaviour). Generally most codepaths leading to "nothing pending" should post the message themselves already, or not? Can we distinguish between your buggy case and the other code paths? Or is the actual bug that we always end up here without ever having posted the message? I remember that I tried something like your patch a longer time ago and it broke something, we'll have to try the core/base/etc and especially gst-rtsp-server tests and also do a gst-validate run I guess.
Well, I do not know about other test scenarios but in this scenario is it not the normal case where pending == VOID_PENDING when handling async_done. Normally is the transition not finished, in my case is the state PAUSED when the async_done is received from the gstbasesink. So normally is async_done sent but if we have had this race condition where the polling flag was set when receiving async_start then do we sometimes (~1/20) get the async_done in VOID_PENDING instead of PAUSED. As I have said, I have run this code successfully 1 million times but then again, I do not think that our application would get into a problem if we receive two ASYNC_DONE messages. I understand your concern and perhaps could we add some code to the patch that make it a little bit more sure that this is the buggy case by adding a new flag was_busy that is set when async_start is received and polling=true or state != VOID_PENDING. And only send the ASYNC_DONE if nothing_pending AND was_busy = TRUE.
Looking closer through all this it seems like exactly one of the problems I was looking into a long time ago. There should be another bug about it. The problem here is already at the moment the async-start message is arriving while we are in change_state(). It's unclear what should happen at all at that point: it might be that change_state() triggered the message (all good then), it might be that change_state() still has to go over that element (possibly invalidating the message), etc. It is not clear how async-start and the currently running state change relate to each other, if at all. Now when async-done arrives we don't know more. It might very well be that the async state change is now actually done, it might be that it was invalidated (and the message we get is from before), anything really. The underlying problem here is that we allow multiple state changes to happen at once, and more specifically state changes from the bottom (async-start/done, lost-state) and state changes from the top (gst_element_set_state()). I don't think this generally leads to good results and should be prevented (and I have ideas what to do about that in 2.0, but that's not a solution for now). Maybe the async-start message should already have been delayed until change_state() is done (and then any pending messages handled from there), but I'm 100% sure that I actually wrote a patch like that and it broke other things. Specifically it breaks the case when the async state change was invalidated by the change_state() (and there would never be a corresponding async-done for example). async-start would trigger the start of an async state change that might not even happen anymore. I guess a potential solution (and close to my 2.0 idea) would be the following 1) queue up async-start (and async-done) if they happen during "polling" 2) If an element posted async-start and that is somehow invalidated by change_state(), it must post async-done (I believe all elements do that already) as part of change_state() (!) 3) After unsetting polling, check if there are pending messages. If there is an async-start, first check if the corresponding async-done was received and if so do nothing at all. If there is no async-done, behave as normal. If there is an async-done without corresponding async-start, drop it. What do you think, do you see a problem with that approach?
I also tried to delay async_start with a wait condition on the polling flag but got problem with that solution. I really do not have that good overview of the code to say if your idea of solution will work but I am happy to test a patch that you provide and see if it solves our problem. However I have some thoughts about bullet three: 1) I do not think that you will get both the async_start and async_done message while "polling" but what do you mean by do nothing at all in that case? If the async done is not sent then will our application hang. Shouldn't first the async_start code be executed and when finished then should the async_done code be executed. 2) I assume this will be the "normal" branch where we have a queued async_start but not yet received async_done, what do you mean by behave as normal? Start the execution of async_start but how will you handle if the async_done is received while handling the "delayed" async_start? 3) Something must have gone wrong if we get an async_done without a corresponding async_start, is it correct to only drop it, sounds as an assert to me to find the error.
(In reply to Joakim Johansson from comment #6) > However I have some thoughts about bullet three: > 1) I do not think that you will get both the async_start and async_done > message while "polling" but what do you mean by do nothing at all in that > case? If the async done is not sent then will our application hang. > Shouldn't first the async_start code be executed and when finished then > should the async_done code be executed. If the corresponding async-done was already received, both can be dropped without doing anything. The state change happened after all, it's all done. Or am I missing something here? What would your application wait for, what would have triggered the waiting? Or is the problem that a state change could've returned ASYNC but there would never be a async-done message corresponding to it? The problem I see with handling both normally is that it causes spurious state changes although it's all finished already. > 2) I assume this will be the "normal" branch where we have a queued > async_start but not yet received async_done, what do you mean by behave as > normal? Start the execution of async_start but how will you handle if the > async_done is received while handling the "delayed" async_start? Yes, and async-done would also have to be delayed then until async-start is done > 3) Something must have gone wrong if we get an async_done without a > corresponding async_start, is it correct to only drop it, sounds as an > assert to me to find the error. It currently would be dropped already, but there should be at least a warning about it, yes.
I our case is the "application" sending a seek event and then wait for sink to post async-done after seek and retrieve position. ("Wait for preroll") The problem that we have is that sometimes is the application never receiving any async_done message on the bus and is hanging forever. Sure, the state-change is correct in gstbin but the async_done message MUST always be sent to the application and can not be dropped internally by gstbin. Or is the application doing wrong to expect to always get the async_done message when sending a seek event? gst_element_seek (pb->demuxbin, pb->rate, GST_FORMAT_TIME, GST_SEEK_FLAG_FLUSH | GST_SEEK_FLAG_KEY_UNIT | GST_SEEK_FLAG_SNAP_BEFORE, GST_SEEK_TYPE_SET, pb->timeoffset, GST_SEEK_TYPE_SET, stop)) Note: The seek event is done after the state of the pipeline is set to PAUSED.
Where exactly does the async-start and the change_state() come from in your case? What should happen is that state-lost is happening, i.e. async-starts, from the seek event, and the pipeline is waiting for the async-dones to go to PAUSED again. In your case there also seems to be another state change happening at the same time? When pausing the pipeline before seeking, are you waiting until the PAUSED state is reached before sending the seek?
We are not waiting until PAUSED state is reached. However the pads for the recordingdemuxbin is blocked before the seek operation. As we perform seeking on recordingdemuxbin, we have to block its pads to prevent any buffers/events to reach sink element. But now when I look closer to the code, then do I realize that the sink is linked to the muxer when the first pad is blocked and the state is at this time synchronized with: gst_element_sync_state_with_parent (pb->muxer); gst_element_sync_state_with_parent (pb->sink); Maybe is this causing the race condition and that the race condition could be avoided by a different implementation in the application, for example waiting with linking the sink until after the preroll buffer is received after the seek. I will see if I can solve our problem from the application side, but then again, you still have a race condition in gstreamer where the async_done can be dropped.
(In reply to Joakim Johansson from comment #10) > I will see if I can solve our problem from the application side It would be a workaround, yes :) > but then > again, you still have a race condition in gstreamer where the async_done can > be dropped. Agreed, but I'm not sure at this point what the correct behaviour would even be in this case. There are conflicting requirements here. Generally you will run into problems with the current state design if you have bottom-up state changes (flushing/lost-state and async-start) mixed with top-down state changes (set_state()). And giving preference to either of the two over the other, or even performing them both, can lead to wrong situations from what I can see.
The changes in the application to link the sink when the seek operation is finished and after receiving the async_done is working fine for us. So, it is up to you to decide if you will fix this bug shall be fixed or if you wait for GStreamer 2.0.
Now would be better of course, but I'm not sure how to fix it without breaking expected behaviour in other cases.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gstreamer/issues/301.