GNOME Bugzilla – Bug 606382
decodebin: Upstream events (seek,qos) get lost when switching groups/chains
Last modified: 2018-05-05 15:47:27 UTC
Take attachement #151008 of bug #606348. This is a chained ogg file with a new chain starting at position 17s. Seeking beyong the 17s boundary causes a decoder error.
oggdemux needs to cache the headers of the chains and push them right after activating the new chains.
This commit does not completely fix it but makes it mostly work. I think there is a race left in playbin2 when it actually switches chains, which makes it fail sometimes. commit 0201326db12ac1dfd224385789ae36bbe62f052e Author: Wim Taymans <wim.taymans@collabora.co.uk> Date: Fri Jan 8 16:57:40 2010 +0100 oggdemux: push headers when activating chains Keep a list of headers for each stream of a chain. When a chain is activated, push the headers before pushing the data so that decoders can sync. Fix seeking in chains, take the chain start time into account when comparing timestamps. See #606382
It a problem in decodebin2 somewhere. It seems that it doesn't mark chains and groups, that got a new pad after no_more_pads, as pending/incomplete/unexposed.
(In reply to comment #3) > It a problem in decodebin2 somewhere. It seems that it doesn't mark chains and > groups, that got a new pad after no_more_pads, as pending/incomplete/unexposed. Why are there new pads after no-more-pads at all? Adding a new pad after no-more-pads will make decodebin2 add that new pad to the next/a new group. How should it detect that this new pad belongs to the currently active group?
There are two types of pad groups 1) unclosed groups: pads are added to an element but at no time is there ever a no-more-pads signal emited. The demuxer usualy has no idea when all pads are added. Decodebin2 should assume that when more pads are added, they belong to the current group of pads. A similar thing when pads are removed. Decodebin2 also uses the multiqueues to decide when enough data is processed so that a reasonable amount of pads are added to the group and begin playback. This is usually the case for mpeg(ts)demux. 2) closed groups: pads are added to an element and when they are all added, a no-more-pads signal is emited. In this situation a new group can be started by first adding the new pads (they go to a new group because the previous group was closed), removing the old pads and then signaling no-more-pads. This is common for chained oggs (from live radio or local streams) and chained asf currently. In this case, decodebin2 uses the no-more-pads signal to remove the old group and activate a new group. No multiqueue overrun is needed to decide when playback can start.
Yes, this is exactly what is implemented in decodebin2. But you said: "It seems that it doesn't mark chains and groups, that got a new pad after no_more_pads, as pending/incomplete/unexposed." That would be against 2) because new pads that are added after no-more-pads should open a new group.
What I saw in debug logs is that the no-more-pads did somehow not expose the new group, leaving the new pads unlinked and erroring out.
Either I'm looking at something wrong or I have a different behavior here. I'm using totem and if I seek beyond 17secs I see that the new pads are exposed on the logs, the position bar keeps moving but no audio is played. And lots of pulsering overflow happening.
And that happens everytime.
Ok for me something completely different happens every time :) Seeking after 17s sometimes works but most of the time playback continues at some position before 17s. And seeking in general is not very accurate.
Oh, and is this actually a regression? I would be surprised if seeking to another chain had worked at some point
I don't think it's a regression. I would also be surprised if it ever worked before at any point, but it would still be nice to get this fixed up. It feels like we're really close to a fix, but I might be wrong of course :)
A thing I just noticed. When I seek to 22 secs it goes to 5 secs. When I seeked to 31, it went to 13. 22 = 17 + 5 31 ~ 17 + 13
(In reply to comment #13) > A thing I just noticed. When I seek to 22 secs it goes to 5 secs. When I seeked > to 31, it went to 13. > > 22 = 17 + 5 > 31 ~ 17 + 13 Oh, good catch. It's the same for me. So somewhere the chain start timestamp is getting substracted or something.
When reverting the begintime chunk of 0201326db12ac1dfd224385789ae36bbe62f052e the reported timestamps (as reported by totem) are more correct but there's no sound.
The commit below fixes the position reporting, but still no sound. Module: gst-plugins-base Branch: master Commit: 125f7dfdb058888a4f2ba0e837259be1e1a2ff2b URL: http://cgit.freedesktop.org/gstreamer/gst-plugins-base/commit/?id=125f7dfdb058888a4f2ba0e837259be1e1a2ff2b Author: Thiago Santos <thiago.sousa.santos@collabora.co.uk> Date: Mon Jan 18 15:22:52 2010 -0300 oggdemux: granulepos is relative to its chain When performing seeks, the granulepos should be offset by its chain start time to avoid using wrong values to update segment's last_stop. A sample file is indicated on bug #606382
Ok, this seems to do fix the seeking, but wtay still has errors with decodebin2. Module: gst-plugins-base Branch: master Commit: 4b771bff7a224d179c67cdf8ab105f12079b3286 URL: http://cgit.freedesktop.org/gstreamer/gst-plugins-base/commit/?id=4b771bff7a224d179c67cdf8ab105f12079b3286 Author: Thiago Santos <thiago.sousa.santos@collabora.co.uk> Date: Tue Jan 19 08:39:14 2010 -0300 oggdemux: No need to subtract begin time Last stop is already based on the chain start and there is no need to subtract the chain start as it may lead to a negative overflow. This was causing seeking issues when the target chain was not the first one (that has chain start = 0) Fixes #606382
Created attachment 151825 [details] deadlock trace I'm getting this deadlock when seeking around the chain switch (17 secs)
Created attachment 151826 [details] [review] patch for the deadlock This patch fixes the deadlock, but I still see lots of "seek failed" when using seek.c example. Dunno if the locking is optimal, but at least it works.
(In reply to comment #19) > Created an attachment (id=151826) [details] [review] > patch for the deadlock > > This patch fixes the deadlock, but I still see lots of "seek failed" when using > seek.c example. > > Dunno if the locking is optimal, but at least it works. You attached the backtrace again :)
Created attachment 151828 [details] [review] Patch for the deadlock Oops. Now the real patch.
Created attachment 151833 [details] debug log that leads to "seek failed" This is a part of the debug log showing why "seek failed" error happens on seek.c, the seek events flows upstream until the multiqueue that has already been unlinked due to the chain switching.
Comment on attachment 151833 [details] debug log that leads to "seek failed" bah, copying and paste fail.
Created attachment 151834 [details] debug log that leads to "seek failed" The seek event fails because of the chain switch. It goes upstream up to multiqueue that is already unlinked from the demuxer, so it fails.
What seems to happen is that : * oggdemux detects a new chain therefore * oggdemux emits no-more-pads * current decodebin group is marked as closed, but data is still going out (multiqueue isn't empty) * oggdemux removes the old source pads * current decodebin group isn't linked to anything upstream anymore * oggdemux adds new pads * decodebin creates a new group with multiqueue/decoder. * oggdemux pushes data from the second chain * the data goes into the new decodebin2 group which will only be exposed when the first group is drained. The problem is that: * the first group is still pushing out data, but isn't linked to oggdemux anymore * the second group (i.e. where we want to seek) isn't exposed yet The seek arrives through the first group... which isn't linked anymore... and therefore oggdemux never gets the seek :( So basically... we need to have a way for upstream events (SEEK but also QOS would be good) coming from outside decodebin2 to make their way up to the demuxer. I'll think about how this can be fix... but feels quite tricky.
Reminder: There's still a patch for oggdemux's deadlock on chain seeking pending review
Sorry, completely forgot about that. Bit late to put it in now for the next pre-release though, need to look at it again with a clear head. Not entirely sure about the possibility of a locking order problem with the stream lock and chain lock.
Comment on attachment 151828 [details] [review] Patch for the deadlock I think Wim was also not convinced the locking/locking order is entirely right here.
Unsetting blocker status: while a deadlock is quiet bad, it should only happen in corner cases (who has chained oggs as local files? heavy scrubbing/seeking) and I don't think we'll have a solution for this until the release. And it's not really a regression.
Still fails in current git, but in slightly different ways.
How does it fail now?
This has become more evident with adaptive demuxers where group switches are frequent. The smaller the fragments, the more it happens. Here is an idea I had for implementation: Keep the removed group around and attach probes to the sink pads, then push a dummy custom upstream event/query to flow through the branch and only remove the group after it has reached the probe. Events received in this probe should be forwarded to the new group
We've just stumbled across this problem as well, when streaming over HLS. We often experience variant switches at the start of playback, and therefore seeking early in the piece is not terribly reliable, as the seek events often get lost. A question; if we implement a solution along the lines that Thiago has described, does anyone know how decodebin2 will behave when it receives a FLUSH_START event from the upstream element which has now received the SEEK event? I think we would want it to destroy the new groups (which are not yet linked downstream), and push the event into the old linked group (which is linked downstream).
(In reply to Thiago Sousa Santos from comment #32) > This has become more evident with adaptive demuxers where group switches are > frequent. The smaller the fragments, the more it happens. > > Here is an idea I had for implementation: > > Keep the removed group around I'm a bit confused as to what you mean by "removed group". If it's still got data (i.e. EOS hasn't been seen on the source pads of that group), then it needs to stay around. If it received EOS and there's a pending group... there's no reason to keep it around ? > and attach probes to the sink pads Having probes on the sink pads is definitely something we need to not "lose" events/queries that are travelling upstream. But I'd say we should put those immediately before unlinking the demuxer source pads (i.e. to handle events/queries travelling upstream between the moment we unlink from demuxer and the moment we un-expose that group). >, then push > a dummy custom upstream event/query to flow through the branch and only > remove the group after it has reached the probe. At what moment do you intend to push that custom upstream event ? When un-exposing the group ? > > Events received in this probe should be forwarded to the new group (In reply to Duncan Palmer from comment #33) > A question; if we implement a solution along the lines that Thiago has > described, does anyone know how decodebin2 will behave when it receives a > FLUSH_START event from the upstream element which has now received the SEEK > event? I think we would want it to destroy the new groups (which are not yet > linked downstream), and push the event into the old linked group (which is > linked downstream). Until now this bug was looking at not losing events/queries coming from downstream, But you are indeed right that there is some special handling required for events coming from upstream. In the case of FLUSH_START I would propose the following: In group event probes (whether linked or unlinked) if we see a FLUSH_START we should forward that to *all* other groups. The rationale behind this is that upstream (the demuxer) asked us to flush, therefore we should flush everything downstream from that... which includes all the groups. By doing so we: 1) Ensure all groups are flushed 2) Downstream of decodebin gets the FLUSH_START event (there is at least one group exposed which will push that through decodebin's current source pads) Once that is done, we should also expose the latest group (which is the one currently linked to the demuxer) and remove the previous ones.
Assigning it to myself since I plan to fix it !
(In reply to Edward Hervey from comment #34) > In the case of FLUSH_START I would propose the following: > > In group event probes (whether linked or unlinked) if we see a FLUSH_START > we should forward that to *all* other groups. The rationale behind this is > that upstream (the demuxer) asked us to flush, therefore we should flush > everything downstream from that... which includes all the groups. > By doing so we: > 1) Ensure all groups are flushed > 2) Downstream of decodebin gets the FLUSH_START event (there is at least > one group exposed which will push that through decodebin's current source > pads) > Once that is done, we should also expose the latest group (which is the > one currently linked to the demuxer) and remove the previous ones. Would there be any problems if upstream did a variant switch after FLUSH_STOP, and before pushing any data into the exposed group? I don't think there would be, as the exposed group (now flushed) would receive an EOS, and things would proceed in the way they usually do for a variant switch.
(In reply to Duncan Palmer from comment #36) > Would there be any problems if upstream did a variant switch after > FLUSH_STOP, and before pushing any data into the exposed group? I don't > think there would be, as the exposed group (now flushed) would receive an > EOS, and things would proceed in the way they usually do for a variant > switch. Not that I can think of (for the reason you stated). I'll start giving it a go today.
Created attachment 302568 [details] [review] WIP:decodebin2: Forward event/queries for unlinked groups When upstream events/queries reach sinkpads of unlinked groups (i.e. no longer linked to the upstream demuxer), this patch attempts to find the linked group and forward it upstream of that group. This is done by adding upstream event/query probes on new group sinkpads and then: * Checking if the pad is linked or not (has a peer or not) * If there is a peer, just let the event/query follow through normally * If there is no peer, we find a pad to which to proxy it and return GST_PROBE_HANDLED if it succeeded (allowing the event/query to be properly returned to the initial called) Note that this is definitely not thread-safe for the time being
With the above patch, the various queries/events do end up in the upstream elements even if the group isn't linked. The next phase I'm working on is ensuring some action is take on downstream events, such as on FLUSH_START where we drain/remove all non-last groups and switch the last group to be the active one.
Created attachment 302846 [details] [review] decodebin2: Forward event/queries for unlinked groups When upstream events/queries reach sinkpads of unlinked groups (i.e. no longer linked to the upstream demuxer), this patch attempts to find the linked group and forward it upstream of that group. This is done by adding upstream event/query probes on new group sinkpads and then: * Checking if the pad is linked or not (has a peer or not) * If there is a peer, just let the event/query follow through normally * If there is no peer, we find a pad to which to proxy it and return GST_PROBE_HANDLED if it succeeded (allowing the event/query to be properly returned to the initial called) Note that this is definitely not thread-safe for the time being
Created attachment 302847 [details] [review] decodebin2: Handle flushing with multiple decode groups When an upstream element wants to flush downstream, we need to take all chains/groups into consideration. To that effect, when a FLUSH_START event is seen, after having it sent downstream we mark all those chains/groups as "drained" (as if they had seen a EOS event on the endpads). When a FLUSH_STOP event is received, we check if we need to switch groups. This is done by checking if there are next groups. If so, we will switch over to the latest next_group. The actual switch will be done when that group is blocked.
Created attachment 302962 [details] [review] decodebin2: Forward event/queries for unlinked groups When upstream events/queries reach sinkpads of unlinked groups (i.e. no longer linked to the upstream demuxer), this patch attempts to find the linked group and forward it upstream of that group. This is done by adding upstream event/query probes on new group sinkpads and then: * Checking if the pad is linked or not (has a peer or not) * If there is a peer, just let the event/query follow through normally * If there is no peer, we find a pad to which to proxy it and return GST_PROBE_HANDLED if it succeeded (allowing the event/query to be properly returned to the initial called) Note that this is definitely not thread-safe for the time being
Created attachment 302963 [details] [review] decodebin2: Handle flushing with multiple decode groups When an upstream element wants to flush downstream, we need to take all chains/groups into consideration. To that effect, when a FLUSH_START event is seen, after having it sent downstream we mark all those chains/groups as "drained" (as if they had seen a EOS event on the endpads). When a FLUSH_STOP event is received, we check if we need to switch groups. This is done by checking if there are next groups. If so, we will switch over to the latest next_group. The actual switch will be done when that group is blocked.
Created attachment 309329 [details] [review] decodebin2: Forward event/queries for unlinked groups When upstream events/queries reach sinkpads of unlinked groups (i.e. no longer linked to the upstream demuxer), this patch attempts to find the linked group and forward it upstream of that group. This is done by adding upstream event/query probes on new group sinkpads and then: * Checking if the pad is linked or not (has a peer or not) * If there is a peer, just let the event/query follow through normally * If there is no peer, we find a pad to which to proxy it and return GST_PROBE_HANDLED if it succeeded (allowing the event/query to be properly returned to the initial called) Note that this is definitely not thread-safe for the time being
Created attachment 309330 [details] [review] decodebin2: Handle flushing with multiple decode groups When an upstream element wants to flush downstream, we need to take all chains/groups into consideration. To that effect, when a FLUSH_START event is seen, after having it sent downstream we mark all those chains/groups as "drained" (as if they had seen a EOS event on the endpads). When a FLUSH_STOP event is received, we check if we need to switch groups. This is done by checking if there are next groups. If so, we will switch over to the latest next_group. The actual switch will be done when that group is blocked.
commit eaf9ca90c7000f5c7156d13969b9135f7273fa79 Author: Edward Hervey <edward@centricular.com> Date: Mon May 4 11:19:28 2015 +0200 decodebin2: Handle flushing with multiple decode groups When an upstream element wants to flush downstream, we need to take all chains/groups into consideration. To that effect, when a FLUSH_START event is seen, after having it sent downstream we mark all those chains/groups as "drained" (as if they had seen a EOS event on the endpads). When a FLUSH_STOP event is received, we check if we need to switch groups. This is done by checking if there are next groups. If so, we will switch over to the latest next_group. The actual switch will be done when that group is blocked. https://bugzilla.gnome.org/show_bug.cgi?id=606382 commit 2d3743e37dd13d2211e4ec2c596c5a420e944d18 Author: Edward Hervey <edward@centricular.com> Date: Wed Apr 29 15:56:39 2015 +0200 decodebin2: Forward event/queries for unlinked groups When upstream events/queries reach sinkpads of unlinked groups (i.e. no longer linked to the upstream demuxer), this patch attempts to find the linked group and forward it upstream of that group. This is done by adding upstream event/query probes on new group sinkpads and then: * Checking if the pad is linked or not (has a peer or not) * If there is a peer, just let the event/query follow through normally * If there is no peer, we find a pad to which to proxy it and return GST_PROBE_HANDLED if it succeeded (allowing the event/query to be properly returned to the initial called) Note that this is definitely not thread-safe for the time being https://bugzilla.gnome.org/show_bug.cgi?id=606382
This is still a problem, you can easily reproduce it with the HLS seek tests in gst-validate: > validate.hls.playback.seek_with_stop.hls_bibbop > validate.hls.playback.seek_forward.hls_bibbop > validate.hls.playback.seek_backward.hls_bibbop The seek events are returning FALSE, although the seek handler in hlsdemux returns TRUE.
The problem here is that hlsdemux switches to a new bitrate at the same time as we send a seek. So the seek event goes to hlsdemux, returns TRUE, goes up to playbin. Playbin sees that elements changed, gst_iterator_resync(), send event again. But this time the streamsynchronizer pad at playsink is not linked because the bitrate switch is currently happening.
Related to this is also commit 53f135cec7ac63b8951797568b9e8e153bd6ad2e Author: Sebastian Dröge <sebastian@centricular.com> Date: Fri Oct 23 12:02:28 2015 +0300 playbin: Send upstream events directly to playsink Send event directly to playsink instead of letting GstBin iterate over all sink elements. The latter might send the event multiple times in case the SEEK causes a reconfiguration of the pipeline, as can easily happen with adaptive streaming demuxers. What would then happen is that the iterator would be reset, we send the event again, and on the second time it will fail in the majority of cases because the pipeline is still being reconfigured Doesn't fix it, but makes it better.
This fixes things even more, enough to enable the HLS seeking tests in gst-validate again. commit 36b80edb7263118467dfcaee3923f7c964ae6bc8 Author: Sebastian Dröge <sebastian@centricular.com> Date: Tue Oct 27 15:44:06 2015 +0200 decodebin: Send SEEK events directly to adaptive streaming demuxers This makes sure that they will always get SEEK events, even if we're currently in the middle of a group switch (i.e. switching to another representation/bitrate/etc). https://bugzilla.gnome.org/show_bug.cgi?id=606382
Created attachment 315338 [details] logs showing hang on seek I'm currently seeing a problem where gst_element_seek_simple(decodebin) is not returning. log.2 shows that hlsdemux receives the seek event fine, but that flush-start is not propogated to any groups within the decodebin. This is because hlsdemux has just switched bitrates, and the src pad probe has yet to be put in place by hlsdemux. A quick hack to workaround this is for hlsdemux to ensure the streaming thread is stopped before sending flush-start downstream. With the quick hack in place, I can the produce the scenario in log.3. Search for WARNING and CRITICAL. In this case, flush-start is making it's way into a group which is in the process of being created, and causing grief. I'm not sure why either scenario causes a hang, but in any case it's obviously wrong. Any suggestions?
I don't think this can be fixed in any meaningful and reliable way in the existing decodebin design. There are already ideas how to solve it with a new design, see Edward's talk from the GStreamer Conference this year :)
Bummer... I've put in a bit of a hack to try and avoid the issue. Seems to be working so far. What I try and do is avoid sending a seek in whilst decodebin is creating a new group as a result of a bitrate switch. The way I've done this is to add a new signal to decodebin, which I've called pad-activated. It fires after gst_decode_pad_activate() is called. I can know how many pads I expect to be 'activated' (it's the same as the number of src pads created by the tsdemux), so I can know when the new group has been built by counting the number of times pad-activated has fired. hlsdemux emits a message to indicate when it's switching to a new playlist. So, between these two things, I almost have a slightly fragile mechanism of detecting when the decodebin is busy creating a new decode group. There is a still a race with hlsdemux - the streaming thread needs to be paused before sending the seek event so we're not racing with the beginning of a bitrate switch. This is a bit horrible, but is the best I can think of, seeing as not allowing seek isn't an option for me. Another avenue I haven't investigated it to determine why a seek request can hang at the moment. If it instead failed cleanly, that's something I could deal with. If anyone has any suggestions for a nicer workaround, I'd be grateful ;)
The hang might happen because we're flushing a decode group that is currently draining. With that, it would never go EOS anymore and we would wait forever for it to drain before switching to the next group. There's a pad probe that should prevent it, but maybe that's broken.
And this bug can nicely die now that we have decodebin3 which never ends up in that situation.