Bug 757548 – aggregator: Forced to produce data in paused state

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 757548 - aggregator: Forced to produce data in paused state


Summary:	aggregator: Forced to produce data in paused state


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-bad
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.13.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-11-03 22:35 UTC by Nicolas Dufresne (ndufresne)
Modified:	2017-05-20 14:23 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
A workaround patch (with mix locking fix, sorry) (4.21 KB, patch) 2015-11-03 22:35 UTC, Nicolas Dufresne (ndufresne)	none	Details \| Review
This is really a tentative patch, I didn't go around everything to make sure (1.50 KB, patch) 2016-04-13 20:44 UTC, Nicolas Dufresne (ndufresne)	reviewed	Details \| Review
aggregator: Fix locking when using the clock (3.57 KB, patch) 2016-04-15 18:35 UTC, Nicolas Dufresne (ndufresne)	none	Details \| Review
aggregator: Fix locking when using the clock (2.38 KB, patch) 2016-04-15 21:01 UTC, Nicolas Dufresne (ndufresne)	committed	Details \| Review
aggregator: Start the task when linked (2.13 KB, patch) 2016-04-22 20:15 UTC, Nicolas Dufresne (ndufresne)	none	Details \| Review
aggregator: Start the task when linked (2.64 KB, patch) 2016-04-22 21:13 UTC, Nicolas Dufresne (ndufresne)	rejected	Details \| Review
aggregator: Reset upstream latency on first buffer (1.64 KB, patch) 2017-05-20 11:22 UTC, Olivier Crête	committed	Details \| Review

Description Nicolas Dufresne (ndufresne) 2015-11-03 22:35:01 UTC

Created attachment 314772 [details] [review]
A workaround patch (with mix locking fix, sorry)

This issue is a bit complicated, I might be explaining it badly. Here's the pipeline I'm am using (rtpsrc is a simplification)

  ( rtpsrc ! decoder) ! ( audiointerleave ! alsasink )

Audio interleave have 3 additional pad requested, but currently unlinked. So the only way for that element to produce data, is to trigger a timeout (because of the unlinked pads). Now, to schedule a timeout, we need a clock and a base time, which is normally available when the transtion to PAUSED_TO_PLAYING is called on the element. Currently, we try and get this information from the pad activation, and then every-time a buffer comes in. We don't sync with the state lock (in fact, it's probably had to do so), and we may endup with half the information we want (a clock with base_time = 0). In this case we schedule the next timeout, but it's quite wrong and everything fall apart.

As you can see, in my pipeline, the interleaver is next to the sink, and contained in the same bin. I'm no expert, but that bin won't transition to playing until data have reached the sync. Effectively, the transiton to PAYING never happens on the interleaver, unless we are lucky, and base_time was set.

In such live element, the normal way, is to not produce data until we reached PLAYING state, and return NO_PREROLL in the TO_PAUSED and TO_PLAYING transition. But to decide if we should return that, we'd need to know if we are live or not. This is usally done doing a latency query. Unfortunately, it is too soon in READY_TO_PAUSED transition.

Find attach my current workaround, note that it bring back this idea that latency > 0 means live. It's a bit unfortunate, since latency = 0 works fine now in live since we have fixed the queue to never block before 2 buffers where queued.

Comment 1 Nicolas Dufresne (ndufresne) 2016-01-06 15:56:09 UTC

Olivier, Thibault, any idea on this one ?

Comment 2 Nicolas Dufresne (ndufresne) 2016-04-13 20:44:51 UTC

Created attachment 325883 [details] [review]
This is really a tentative patch, I didn't go around everything to make sure

it is right. What I try is simply move the query (which is done only once) from
pad activation to the moment we receive STREAM_START event.

aggregator: Start the task when linked

Until now we would start the task when the pad is activated. Part of the
activiation concist of testing if the pipeline is live or not.
Unfortunatly, this is often too soon, as it's likely that the pad get
activated before it is fully linked in dynamic pipeline.

Instead, start the task when the first serialized event arrive. This is
a safe moment as we know that the upstream chain is complete and just
like the pad activation, the pads are locked, hence cannot change.

Comment 3 Sebastian Dröge (slomo) 2016-04-14 06:44:30 UTC

Comment on attachment 325883 [details] [review]
This is really a tentative patch, I didn't go around everything to make sure

While it generally seems like a good idea, this should break something. Consider the case where you have e.g. a compositor in a live pipeline and no inputs linked to it provided the STREAM_START event yet until a later time. You still might want the compositor to output empty background frames whenever the timeout triggers, as it does right now.

Comment 4 Nicolas Dufresne (ndufresne) 2016-04-14 14:41:53 UTC

(In reply to Sebastian Dröge (slomo) from comment #3)
> Comment on attachment 325883 [details] [review] [review]
> This is really a tentative patch, I didn't go around everything to make sure
> 
> While it generally seems like a good idea, this should break something.
> Consider the case where you have e.g. a compositor in a live pipeline and no
> inputs linked to it provided the STREAM_START event yet until a later time.
> You still might want the compositor to output empty background frames
> whenever the timeout triggers, as it does right now.

This use case didn't work before. There is no way to make an aggregator go live without any pads being linked. I think for the automatic detection of a pipeline being live, this is closer to being correct, but only a property can let you cover everything.

Comment 5 Sebastian Dröge (slomo) 2016-04-14 16:57:18 UTC

The use case is *linked* pads without having any data or stream-start or anything yet. That did work before, but I don't know how important that is to keep working.

I think your patch is acceptable as is (linked but no stream-start is very unlikely), we should just keep an eye one people complaining about things breaking :)

Comment 6 Nicolas Dufresne (ndufresne) 2016-04-14 17:57:04 UTC

(In reply to Sebastian Dröge (slomo) from comment #5)
> The use case is *linked* pads without having any data or stream-start or
> anything yet. That did work before, but I don't know how important that is
> to keep working.

I see now, I had miss-understood.

> 
> I think your patch is acceptable as is (linked but no stream-start is very
> unlikely), we should just keep an eye one people complaining about things
> breaking :)

The other thing I had in mind, and I'd like to consider first, is to figure-out a way to know that upstream is not linked (do a link test first). Do you have
an idea how this could be tested ?

Comment 7 Nicolas Dufresne (ndufresne) 2016-04-15 13:21:26 UTC

That case seems to be really rare in fact. If I read correctly, GstBaseSrc will send the stream-start before it call create/fill which is the one that may block if there is no data coming. For demuxers, the non-enforced rule is that expose pads should have caps, which imply that stream-start has been "sticked" to the pad already. So I agree the use case might exist, but it's probably avoidable in most cases.

Note, while testing I've hit some assertion with clock, which I didn't have with the previous patch. Apparently the locking fixex that are hidden in that previous patch are correct, so I'll extract this into it's own patch now. It's a bit unrelated though.

Comment 8 Sebastian Dröge (slomo) 2016-04-15 13:23:24 UTC

(In reply to Nicolas Dufresne (stormer) from comment #7)
> That case seems to be really rare in fact. If I read correctly, GstBaseSrc
> will send the stream-start before it call create/fill which is the one that
> may block if there is no data coming. For demuxers, the non-enforced rule is
> that expose pads should have caps, which imply that stream-start has been
> "sticked" to the pad already. So I agree the use case might exist, but it's
> probably avoidable in most cases.

Note that for demuxers it's not that easy. They are expected to have stream-start on the pad when they expose it... but the event is only forwarded downstream then after linking when the next buffer is pushed (or serialized event). So maybe much later.

Comment 9 Nicolas Dufresne (ndufresne) 2016-04-15 18:35:04 UTC

Created attachment 326133 [details] [review]
aggregator: Fix locking when using the clock

This fixes a race where we check if there is a clock, then it get
removed and we endup calling gst_clock_new_single_shit_id() with a NULL
pointer instead of a valid clock and also calling gst_object_unref()
with a NULL pointer later.

Comment 10 Nicolas Dufresne (ndufresne) 2016-04-15 21:01:09 UTC

Created attachment 326139 [details] [review]
aggregator: Fix locking when using the clock

This fixes a race where we check if there is a clock, then it get
removed and we endup calling gst_clock_new_single_shit_id() with a NULL
pointer instead of a valid clock and also calling gst_object_unref()
with a NULL pointer later.

Comment 11 Nicolas Dufresne (ndufresne) 2016-04-22 20:15:09 UTC

Created attachment 326573 [details] [review]
aggregator: Start the task when linked

Until now we would start the task when the pad is activated. Part of the
activiation concist of testing if the pipeline is live or not.
Unfortunatly, this is often too soon, as it's likely that the pad get
activated before it is fully linked in dynamic pipeline.

Instead, start the task when the first serialized event arrive. This is
a safe moment as we know that the upstream chain is complete and just
like the pad activation, the pads are locked, hence cannot change.

Comment 12 Nicolas Dufresne (ndufresne) 2016-04-22 20:39:36 UTC

Comment on attachment 326573 [details] [review]
aggregator: Start the task when linked

Marking as need work, as it breaks the aggregator unit test. It deadlocks in the state_intensive test. So far it's not yet clear to me, here is the backtrace:

https://paste.fedoraproject.org/358714/61357513/

Comment 13 Nicolas Dufresne (ndufresne) 2016-04-22 21:00:01 UTC

I think the deadlock is the following:

- srcpad task is being started
- we start deactivating the pad, hence signal on the SRC_COND
- srcpad task take the SRC_LOCK and WAIT
- then we continue deactivating the pad, so we try to take the stream lock

This deadlock, because the streamlock is only release once every call to the GstTask function. This race already existed before, but as we started the task much earlier, it didn't trigger in the test. There is possibly a missing condition between SRC_LOCK and WAIT, or some locking miss-placed.

Comment 14 Nicolas Dufresne (ndufresne) 2016-04-22 21:12:40 UTC

Ok, actually we simply receive stream-start while the pad is being deactivated. Check if the pad is active fixes it, but it makes this patch ugly. I'm starting to think that the real issue is that there exist race where the latency query succeed with wrong information instead of failing while linking (and allowing for a retry later). Let me post a working patch never the less.

Comment 15 Nicolas Dufresne (ndufresne) 2016-04-22 21:13:48 UTC

Created attachment 326578 [details] [review]
aggregator: Start the task when linked

Until now we would start the task when the pad is activated. Part of the
activiation concist of testing if the pipeline is live or not.
Unfortunatly, this is often too soon, as it's likely that the pad get
activated before it is fully linked in dynamic pipeline.

Instead, start the task when the first serialized event arrive. This is
a safe moment as we know that the upstream chain is complete and just
like the pad activation, the pads are locked, hence cannot change.

Comment 16 Sebastian Dröge (slomo) 2016-04-25 06:52:43 UTC

Comment on attachment 326139 [details] [review]
aggregator: Fix locking when using the clock

Typo in commit message (shit -> shot), otherwise good to go :)

Comment 17 Sebastian Dröge (slomo) 2016-04-25 06:56:22 UTC

Comment on attachment 326578 [details] [review]
aggregator: Start the task when linked

This still seems to have the problem that you wait for a stream-start event before starting, but might already want to produce data in a live scenario. Consider the case where we link a live upstream that is already running and has sent its sticky events to the srcpads already before linking (like all demuxers do) but receives no data for a while. It would successfully answer the LATENCY query, but you will only see the stream-start event once the first buffer is sent (or another serialized event). We don't forward sticky events automatically when linking as that would forward them from the application thread and potentially block the application for a while.

Comment 18 Nicolas Dufresne (ndufresne) 2016-04-25 17:00:32 UTC

(In reply to Sebastian Dröge (slomo) from comment #16)
> Comment on attachment 326139 [details] [review] [review]
> aggregator: Fix locking when using the clock
> 
> Typo in commit message (shit -> shot), otherwise good to go :)

Oops, will fix.

(In reply to Sebastian Dröge (slomo) from comment #17)
> Comment on attachment 326578 [details] [review] [review]
> aggregator: Start the task when linked
> 
> This still seems to have the problem that you wait for a stream-start event
> before starting, but might already want to produce data in a live scenario.
> Consider the case where we link a live upstream that is already running and
> has sent its sticky events to the srcpads already before linking (like all
> demuxers do) but receives no data for a while. It would successfully answer
> the LATENCY query, but you will only see the stream-start event once the
> first buffer is sent (or another serialized event). We don't forward sticky
> events automatically when linking as that would forward them from the
> application thread and potentially block the application for a while.

I agree. I now have a better understanding of the strategy that was initially used in aggregator. The idea is that it expects the query to fail until the data is valid. So the real bug is probably that an element didn't fail and reply with a non-live latency query in the first pace. I'll do another round of investigation. There could be another bug around latency update, as it should have been updated in that case. Tracking this is hard, because enabling traces often prevent triggering the bug.

Comment 19 Nicolas Dufresne (ndufresne) 2016-04-26 10:55:06 UTC

Comment on attachment 326139 [details] [review]
aggregator: Fix locking when using the clock

Attachment 326139 [details] pushed as fbd824a - aggregator: Fix locking when using the clock

Comment 20 Nicolas Dufresne (ndufresne) 2016-05-25 17:37:36 UTC

Comment on attachment 326578 [details] [review]
aggregator: Start the task when linked

Leaving Open, as it only fix some parts.

Attachment 326578 [details] pushed as 302580c - aggregator: Start the task when linked

Comment 21 Nicolas Dufresne (ndufresne) 2016-05-25 17:38:14 UTC

Oops, my bad, I'm reverting.

Comment 22 Olivier Crête 2017-05-20 11:22:52 UTC

Created attachment 352218 [details] [review]
aggregator: Reset upstream latency on first buffer

We think that the real bug is a race where the query happens before any of the live sources are linked and was not re-done later, this should fix it.


In the case an aggregator is created and pads are requested but only
linked later, we end up never updating the upstream latency.
This was because latency queries on pads that are not linked succeed,
so we never did a new query once a live source has been linked, so the
thread was never started.

Comment 23 Olivier Crête 2017-05-20 14:23:13 UTC

Pushed as:
Author: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Date:   Sat May 20 13:10:53 2017 +0200

    aggregator: Reset upstream latency on first buffer
    
    In the case an aggregator is created and pads are requested but only
    linked later, we end up never updating the upstream latency.
    This was because latency queries on pads that are not linked succeed,
    so we never did a new query once a live source has been linked, so the
    thread was never started.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=757548