GNOME Bugzilla – Bug 757548
aggregator: Forced to produce data in paused state
Last modified: 2017-05-20 14:23:32 UTC
Created attachment 314772 [details] [review] A workaround patch (with mix locking fix, sorry) This issue is a bit complicated, I might be explaining it badly. Here's the pipeline I'm am using (rtpsrc is a simplification) ( rtpsrc ! decoder) ! ( audiointerleave ! alsasink ) Audio interleave have 3 additional pad requested, but currently unlinked. So the only way for that element to produce data, is to trigger a timeout (because of the unlinked pads). Now, to schedule a timeout, we need a clock and a base time, which is normally available when the transtion to PAUSED_TO_PLAYING is called on the element. Currently, we try and get this information from the pad activation, and then every-time a buffer comes in. We don't sync with the state lock (in fact, it's probably had to do so), and we may endup with half the information we want (a clock with base_time = 0). In this case we schedule the next timeout, but it's quite wrong and everything fall apart. As you can see, in my pipeline, the interleaver is next to the sink, and contained in the same bin. I'm no expert, but that bin won't transition to playing until data have reached the sync. Effectively, the transiton to PAYING never happens on the interleaver, unless we are lucky, and base_time was set. In such live element, the normal way, is to not produce data until we reached PLAYING state, and return NO_PREROLL in the TO_PAUSED and TO_PLAYING transition. But to decide if we should return that, we'd need to know if we are live or not. This is usally done doing a latency query. Unfortunately, it is too soon in READY_TO_PAUSED transition. Find attach my current workaround, note that it bring back this idea that latency > 0 means live. It's a bit unfortunate, since latency = 0 works fine now in live since we have fixed the queue to never block before 2 buffers where queued.
Olivier, Thibault, any idea on this one ?
Created attachment 325883 [details] [review] This is really a tentative patch, I didn't go around everything to make sure it is right. What I try is simply move the query (which is done only once) from pad activation to the moment we receive STREAM_START event. aggregator: Start the task when linked Until now we would start the task when the pad is activated. Part of the activiation concist of testing if the pipeline is live or not. Unfortunatly, this is often too soon, as it's likely that the pad get activated before it is fully linked in dynamic pipeline. Instead, start the task when the first serialized event arrive. This is a safe moment as we know that the upstream chain is complete and just like the pad activation, the pads are locked, hence cannot change.
Comment on attachment 325883 [details] [review] This is really a tentative patch, I didn't go around everything to make sure While it generally seems like a good idea, this should break something. Consider the case where you have e.g. a compositor in a live pipeline and no inputs linked to it provided the STREAM_START event yet until a later time. You still might want the compositor to output empty background frames whenever the timeout triggers, as it does right now.
(In reply to Sebastian Dröge (slomo) from comment #3) > Comment on attachment 325883 [details] [review] [review] > This is really a tentative patch, I didn't go around everything to make sure > > While it generally seems like a good idea, this should break something. > Consider the case where you have e.g. a compositor in a live pipeline and no > inputs linked to it provided the STREAM_START event yet until a later time. > You still might want the compositor to output empty background frames > whenever the timeout triggers, as it does right now. This use case didn't work before. There is no way to make an aggregator go live without any pads being linked. I think for the automatic detection of a pipeline being live, this is closer to being correct, but only a property can let you cover everything.
The use case is *linked* pads without having any data or stream-start or anything yet. That did work before, but I don't know how important that is to keep working. I think your patch is acceptable as is (linked but no stream-start is very unlikely), we should just keep an eye one people complaining about things breaking :)
(In reply to Sebastian Dröge (slomo) from comment #5) > The use case is *linked* pads without having any data or stream-start or > anything yet. That did work before, but I don't know how important that is > to keep working. I see now, I had miss-understood. > > I think your patch is acceptable as is (linked but no stream-start is very > unlikely), we should just keep an eye one people complaining about things > breaking :) The other thing I had in mind, and I'd like to consider first, is to figure-out a way to know that upstream is not linked (do a link test first). Do you have an idea how this could be tested ?
That case seems to be really rare in fact. If I read correctly, GstBaseSrc will send the stream-start before it call create/fill which is the one that may block if there is no data coming. For demuxers, the non-enforced rule is that expose pads should have caps, which imply that stream-start has been "sticked" to the pad already. So I agree the use case might exist, but it's probably avoidable in most cases. Note, while testing I've hit some assertion with clock, which I didn't have with the previous patch. Apparently the locking fixex that are hidden in that previous patch are correct, so I'll extract this into it's own patch now. It's a bit unrelated though.
(In reply to Nicolas Dufresne (stormer) from comment #7) > That case seems to be really rare in fact. If I read correctly, GstBaseSrc > will send the stream-start before it call create/fill which is the one that > may block if there is no data coming. For demuxers, the non-enforced rule is > that expose pads should have caps, which imply that stream-start has been > "sticked" to the pad already. So I agree the use case might exist, but it's > probably avoidable in most cases. Note that for demuxers it's not that easy. They are expected to have stream-start on the pad when they expose it... but the event is only forwarded downstream then after linking when the next buffer is pushed (or serialized event). So maybe much later.
Created attachment 326133 [details] [review] aggregator: Fix locking when using the clock This fixes a race where we check if there is a clock, then it get removed and we endup calling gst_clock_new_single_shit_id() with a NULL pointer instead of a valid clock and also calling gst_object_unref() with a NULL pointer later.
Created attachment 326139 [details] [review] aggregator: Fix locking when using the clock This fixes a race where we check if there is a clock, then it get removed and we endup calling gst_clock_new_single_shit_id() with a NULL pointer instead of a valid clock and also calling gst_object_unref() with a NULL pointer later.
Created attachment 326573 [details] [review] aggregator: Start the task when linked Until now we would start the task when the pad is activated. Part of the activiation concist of testing if the pipeline is live or not. Unfortunatly, this is often too soon, as it's likely that the pad get activated before it is fully linked in dynamic pipeline. Instead, start the task when the first serialized event arrive. This is a safe moment as we know that the upstream chain is complete and just like the pad activation, the pads are locked, hence cannot change.
Comment on attachment 326573 [details] [review] aggregator: Start the task when linked Marking as need work, as it breaks the aggregator unit test. It deadlocks in the state_intensive test. So far it's not yet clear to me, here is the backtrace: https://paste.fedoraproject.org/358714/61357513/
I think the deadlock is the following: - srcpad task is being started - we start deactivating the pad, hence signal on the SRC_COND - srcpad task take the SRC_LOCK and WAIT - then we continue deactivating the pad, so we try to take the stream lock This deadlock, because the streamlock is only release once every call to the GstTask function. This race already existed before, but as we started the task much earlier, it didn't trigger in the test. There is possibly a missing condition between SRC_LOCK and WAIT, or some locking miss-placed.
Ok, actually we simply receive stream-start while the pad is being deactivated. Check if the pad is active fixes it, but it makes this patch ugly. I'm starting to think that the real issue is that there exist race where the latency query succeed with wrong information instead of failing while linking (and allowing for a retry later). Let me post a working patch never the less.
Created attachment 326578 [details] [review] aggregator: Start the task when linked Until now we would start the task when the pad is activated. Part of the activiation concist of testing if the pipeline is live or not. Unfortunatly, this is often too soon, as it's likely that the pad get activated before it is fully linked in dynamic pipeline. Instead, start the task when the first serialized event arrive. This is a safe moment as we know that the upstream chain is complete and just like the pad activation, the pads are locked, hence cannot change.
Comment on attachment 326139 [details] [review] aggregator: Fix locking when using the clock Typo in commit message (shit -> shot), otherwise good to go :)
Comment on attachment 326578 [details] [review] aggregator: Start the task when linked This still seems to have the problem that you wait for a stream-start event before starting, but might already want to produce data in a live scenario. Consider the case where we link a live upstream that is already running and has sent its sticky events to the srcpads already before linking (like all demuxers do) but receives no data for a while. It would successfully answer the LATENCY query, but you will only see the stream-start event once the first buffer is sent (or another serialized event). We don't forward sticky events automatically when linking as that would forward them from the application thread and potentially block the application for a while.
(In reply to Sebastian Dröge (slomo) from comment #16) > Comment on attachment 326139 [details] [review] [review] > aggregator: Fix locking when using the clock > > Typo in commit message (shit -> shot), otherwise good to go :) Oops, will fix. (In reply to Sebastian Dröge (slomo) from comment #17) > Comment on attachment 326578 [details] [review] [review] > aggregator: Start the task when linked > > This still seems to have the problem that you wait for a stream-start event > before starting, but might already want to produce data in a live scenario. > Consider the case where we link a live upstream that is already running and > has sent its sticky events to the srcpads already before linking (like all > demuxers do) but receives no data for a while. It would successfully answer > the LATENCY query, but you will only see the stream-start event once the > first buffer is sent (or another serialized event). We don't forward sticky > events automatically when linking as that would forward them from the > application thread and potentially block the application for a while. I agree. I now have a better understanding of the strategy that was initially used in aggregator. The idea is that it expects the query to fail until the data is valid. So the real bug is probably that an element didn't fail and reply with a non-live latency query in the first pace. I'll do another round of investigation. There could be another bug around latency update, as it should have been updated in that case. Tracking this is hard, because enabling traces often prevent triggering the bug.
Comment on attachment 326139 [details] [review] aggregator: Fix locking when using the clock Attachment 326139 [details] pushed as fbd824a - aggregator: Fix locking when using the clock
Comment on attachment 326578 [details] [review] aggregator: Start the task when linked Leaving Open, as it only fix some parts. Attachment 326578 [details] pushed as 302580c - aggregator: Start the task when linked
Oops, my bad, I'm reverting.
Created attachment 352218 [details] [review] aggregator: Reset upstream latency on first buffer We think that the real bug is a race where the query happens before any of the live sources are linked and was not re-done later, this should fix it. In the case an aggregator is created and pads are requested but only linked later, we end up never updating the upstream latency. This was because latency queries on pads that are not linked succeed, so we never did a new query once a live source has been linked, so the thread was never started.
Pushed as: Author: Nicolas Dufresne <nicolas.dufresne@collabora.com> Date: Sat May 20 13:10:53 2017 +0200 aggregator: Reset upstream latency on first buffer In the case an aggregator is created and pads are requested but only linked later, we end up never updating the upstream latency. This was because latency queries on pads that are not linked succeed, so we never did a new query once a live source has been linked, so the thread was never started. https://bugzilla.gnome.org/show_bug.cgi?id=757548