After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 796382 - flvmux: (too) easily produces backwards flv-timestamps
flvmux: (too) easily produces backwards flv-timestamps
Status: RESOLVED OBSOLETE
Product: GStreamer
Classification: Platform
Component: gst-plugins-good
git master
Other All
: Normal blocker
: git master
Assigned To: GStreamer Maintainers
GStreamer Maintainers
: 796379 796380 797044 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2018-05-24 10:11 UTC by Håvard Graff (hgr)
Modified: 2018-11-03 15:30 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test (5.58 KB, patch)
2018-05-24 10:11 UTC, Håvard Graff (hgr)
none Details | Review
flvmux: warn and override backwards timestamps (1.45 KB, patch)
2018-05-24 13:42 UTC, Håvard Graff (hgr)
none Details | Review
flvmux: always wait for both an audio and a video buffer before processing (12.97 KB, patch)
2018-05-29 09:51 UTC, Håvard Graff (hgr)
none Details | Review
flvmux: Force timestamps to always be increasing (1.85 KB, patch)
2018-10-27 18:28 UTC, Olivier Crête
none Details | Review
flvmux: Test that timestamps are always increasing (4.87 KB, patch)
2018-10-27 18:28 UTC, Olivier Crête
none Details | Review

Description Håvard Graff (hgr) 2018-05-24 10:11:49 UTC
Created attachment 372380 [details] [review]
test

Running the new GstAggregator-based flvmuxer in our system gave some weird failures.

Turns out it was caused by the flv-timestamps going backwards, and with RTMP generating delta timestamps, these would go negative and cause the timestamps
to overflow, producing timestamps thousands of hours into the future!

With a bit of hassle I was able to produce a test that can be run both with the
old and new flvmuxer, and that will produce a "correct" sequence with the old implmentation, and a "backwards-timestamp" sequence with the new, basing it on
the real data I saw in our tests.

As for possible fixes I think first it would be a good idea to determine if
producing backwards timestamps in the flv-stream is something we want to avoid
at all costs, or if this can be expected to happen.

Anyway, some sort of sanity internally in the flvmuxer to at least notify the
user of backwards timestamps (or perhaps a property to not allow it), would probably be a good idea, as I expect many people upgrading the muxer would experience similar things to what we did.
Comment 1 Nicolas Dufresne (ndufresne) 2018-05-24 12:16:06 UTC
*** Bug 796380 has been marked as a duplicate of this bug. ***
Comment 2 Nicolas Dufresne (ndufresne) 2018-05-24 12:16:17 UTC
*** Bug 796379 has been marked as a duplicate of this bug. ***
Comment 3 Håvard Graff (hgr) 2018-05-24 12:17:50 UTC
It is worth adding that if you add some latency using the latency-property (in the test) the test *will* pass.

Hence I think there exists ample opportunities to work around this issue by using a bit more latency, but I still think it critical for it to be possible to produce a negative timestamp-delta.
Comment 4 Håvard Graff (hgr) 2018-05-24 12:20:17 UTC
(In reply to Nicolas Dufresne (ndufresne) from comment #1)
> Can you clarify, FLV format is delta based, so backward timestamp are not
> possible. Do you mean the timestamps on the buffers are going backward ? Do
> you mean PTS or DTS ?

I mean the flv-timestamp (24 bits) in the flv-header are going backwards, hence the blocker. (because, yes, deltas are calculated for rtmp)
Comment 5 Nicolas Dufresne (ndufresne) 2018-05-24 12:22:24 UTC
Review of attachment 372380 [details] [review]:

Indeed, there is no good reason this would not be going forward.

::: tests/check/elements/flvmux.c
@@ +645,3 @@
+    guint32 pts;
+    gst_buffer_map (buf, &map, GST_MAP_READ);
+    pts = GST_READ_UINT24_BE (map.data + 4);

You should maybe add a comment on why you can ignore the TimestampExtented 8 bit.
Comment 6 Nicolas Dufresne (ndufresne) 2018-05-24 12:23:41 UTC
* backward
Comment 7 Nicolas Dufresne (ndufresne) 2018-05-24 12:24:42 UTC
(In reply to Håvard Graff (hgr) from comment #3)
> It is worth adding that if you add some latency using the latency-property
> (in the test) the test *will* pass.
> 
> Hence I think there exists ample opportunities to work around this issue by
> using a bit more latency, but I still think it critical for it to be
> possible to produce a negative timestamp-delta.

Very interesting. So whenever data is dropped, there is this bad side effect. I guess if the source isn't live, it's also ok ?
Comment 8 Håvard Graff (hgr) 2018-05-24 13:42:19 UTC
Created attachment 372387 [details] [review]
flvmux: warn and override backwards timestamps

Here is a possible patch for "fixing" this. Not great stuff but does the job.
Comment 9 Nicolas Dufresne (ndufresne) 2018-05-25 12:58:36 UTC
Review of attachment 372387 [details] [review]:

This is not really a fix. I think we get in this situation because the mixer let data go, on timeout, and the some "late" data comes in, and get pushed. If we want to use that model, late is late, data need to be dropped then. When you add latency, you move the timeout later, allowing a certain delay between a/v. So, weigher we follow the paradigm, and massively drop, or just disable timeout mode, but hacking the timestamp after, I don't agree.
Comment 10 Håvard Graff (hgr) 2018-05-28 10:00:34 UTC
(In reply to Nicolas Dufresne (ndufresne) from comment #9)
> Review of attachment 372387 [details] [review] [review]:
> 
> This is not really a fix. I think we get in this situation because the mixer
> let data go, on timeout, and the some "late" data comes in, and get pushed.
> If we want to use that model, late is late, data need to be dropped then.
> When you add latency, you move the timeout later, allowing a certain delay
> between a/v. So, weigher we follow the paradigm, and massively drop, or just
> disable timeout mode, but hacking the timestamp after, I don't agree.

I agree. This causes more problems than it solves. Not great stuff indeed.

Now, picture the following scenario:

We have requested an audio and a video pad on flvmux. However, due to some various application/device delays, the audio stream starts flowing 2 seconds before the video can get going. What should the muxer do now?

If we have queues in front of the flvmuxer it won't start doing anything with the audio until the video-caps arrives, and now we will get into huge muxing-trouble (with todays implementation) with the timestamp jumping back and forth between the buffer audio-stream (starting from dts 0) and the video stream (starting from dts 2000 (ms)).

There are a few options available here:

1. Drop all the audio leading up the the point where the video starts.
This will ensure sync from that point onward but at the cost of some missing audio.

2. Start processing the audio only first for 2 seconds, effectively introducing a 2 second delay on the stream.

Obviously option 1 would work better for "live" scenarios, and 2 for recording type scenarios.

Our use-case calls for 1, but I am not sure how this is best implemented.

As for 2 I think this is supposed to work, but testing shows that a video-buffer might be nominated in gst_flv_mux_find_best_pad if a audio-buffer does not reach the flv-sinkpad in time (racy).

I will write a simple test demonstrating this and then we can discuss the best way forward.
Comment 11 Håvard Graff (hgr) 2018-05-28 11:00:48 UTC
To elaborate on why 2. is not always working.

When calling gst_aggregator_wait_and_check, this will return FALSE if not all pads are ready, but in the case of a timeout (which is very likely given that the times that are being process (old audio) are in the past), wait_and_check will return TRUE, even if not all pads have data yet (the gstqueue in front is basically in the middle of a push), making ->aggregate() being called with only a video-buffer, advancing the "flv-pts" to the current time and making all the other old audio very problematic (old timestamps etc).
Comment 12 Håvard Graff (hgr) 2018-05-28 12:53:05 UTC
I pulled a bit more on this thread to find out why flvmux was asking for a timeout at all, and I think basically the logic in the get_times implementation are wrong. I will create a test and suggested fix.
Comment 13 Håvard Graff (hgr) 2018-05-29 09:51:40 UTC
Created attachment 372461 [details] [review]
flvmux: always wait for both an audio and a video buffer  before processing

his makes the behavior almost identical to the old flvmuxer, and removes
quite a few racy behaviors that exists today.

The most interesting one we have been seeing has been when audio
comes first into the muxer (say 2 seconds worth), and then video
starts flowing after that.

The audio *should* then always be processed first (since it has lower
timestamps), but due to the wait the waiting for buffers is working,
sometimes a video-buffer would sneak in (because the audio-buffer did
not reached the flvmuxpad in time from the queue it was residing in,
and then only a video-buffer was available when finding the best pad) and
then you get havoc with backwards timestamps etc etc.

Effectively this means the flvmuxer will now never wait on the clock
at all, but rather wait for at least one buffer on each pad.

Whether we would like a "live-mode" that would wait on the clock at some
point could probably be discussed, but then more care need to be given
about what to do with late buffer etc (probably dropping?).

AFAICT this patch makes the "new" flvmuxer act almost identical with the
"old" one, and that is probably a good thing for now.
Comment 14 Nicolas Dufresne (ndufresne) 2018-05-29 11:07:09 UTC
(In reply to Håvard Graff (hgr) from comment #13)
> Whether we would like a "live-mode" that would wait on the clock at some
> point could probably be discussed, but then more care need to be given
> about what to do with late buffer etc (probably dropping?).

Dropping and, for video, requesting keyframes, hoping there is an encoder upstream.

> 
> AFAICT this patch makes the "new" flvmuxer act almost identical with the
> "old" one, and that is probably a good thing for now.

Indeed, the implementation is not better then using collect pad with that patch. It probably also breaks the reason it was ported in the first place.
Comment 15 Håvard Graff (hgr) 2018-05-29 11:22:14 UTC
> > AFAICT this patch makes the "new" flvmuxer act almost identical with the
> > "old" one, and that is probably a good thing for now.
> 
> Indeed, the implementation is not better then using collect pad with that
> patch. It probably also breaks the reason it was ported in the first place.

But it is arguably better than reverting the GstAggregator-port all-together?
Comment 16 Nicolas Dufresne (ndufresne) 2018-05-29 12:46:22 UTC
Assuming the FLV spec forbids a certain delay debtween audio / video, which is what makes it looks like backward timestamp. This as been simply assumed based so far. It's also bad, since the idea behind using aggregator is now completely broken. I'd like Olivier's input, this patch seems backward, and a little too focus on one test case. More information need to be gathered.
Comment 17 Håvard Graff (hgr) 2018-05-29 13:30:08 UTC
(In reply to Nicolas Dufresne (ndufresne) from comment #16)
> Assuming the FLV spec forbids a certain delay debtween audio / video, which
> is what makes it looks like backward timestamp. This as been simply assumed
> based so far. It's also bad, since the idea behind using aggregator is now
> completely broken. I'd like Olivier's input, this patch seems backward, and
> a little too focus on one test case. More information need to be gathered.

No one is assuming anything. I am very familiar with the specs, and they *do* allow backwards timestamps, because of seeking: ref: http://wwwimages.adobe.com/www.adobe.com/content/dam/acom/en/devnet/rtmp/pdf/rtmp_specification_1.0.pdf

5.3.1.2.1. Type 0
 Type 0 chunk headers are 11 bytes long. This type MUST be used at
 the start of a chunk stream, and whenever the stream timestamp goes
 backward (e.g., because of a backward seek).

But what the specs says and what implementations out there are doing is of course two very different things (as you would know).

What I am pointing out is a very large behavioral difference that will cause chaos for many users. It certainly did for us, all our RTMP related tests went blood red when upgrading to 1.18.

And it is *not* just the fact that timestamps will sometimes go backwards that is the problem here (as you would have understood having read all my previous comments). The fact that a video buffer "from the future" can sneak in to a stream of audio-buffer being processed is very bad, and a side-effect of timing out a wait for a time in the past.

Keep in mind this is not flvmux2, this is good old flvmux and hence I would expect the behavior to stay the same. Moving to GstAggregator is great because collectpads are very prone do deadlocking, and aggregator are much better in that respect.

Now, if we were to add a new "live-feature" to the flvmux that would be great, and the aggregator would be perfect for that, but that is *new* functionality, and should be treated as such (ex. opt in/out with a property) which is different from a change that on the surface should *not* be functionally different.

Reverting to the old flvmux would of course be an option, but I think there are good things in the future for a more live-based flvmux and would definitely like to see this moving forwards, and also removing collectpads from the equation is a very good thing. (they have been our single biggest source of deadlocks)
Comment 18 Håvard Graff (hgr) 2018-05-29 14:00:23 UTC
I think a possible fix for a "live-mode" would be to, in the case of a missing buffer on the audio or video pad, to wait for (now + latency), and not what the current _get_next_time_for_segment is doing. That way you are delaying as long as you are allowed to waiting for another buffer to arrive, and you won't get waits that are in the past, causing the racy behavior we are seeing. And of course, anything that comes in late (ts + latency < now) needs to be dropped.
Comment 19 Nicolas Dufresne (ndufresne) 2018-05-29 14:29:55 UTC
This is already what aggregator is supposedly doing. I haven't review the get_times() function, it's normally all handled by the base class. To demonstrate that, you have set a latency on the muxer, and that "fixed" it, since the timeout is calculated as:

    start = gst_aggregator_get_next_time (self);
    time = base_time + start;
    time += latency;

Where latency is (peer_latency_min + latency + sub_latency_min). Can you check if the latency query did work in your case ? (somehow it must work, since otherwise the aggregator would not have gone live, and then there would be no timeout).
Comment 20 Olivier Crête 2018-10-27 18:28:38 UTC
Created attachment 374070 [details] [review]
flvmux: Force timestamps to always be increasing
Comment 21 Olivier Crête 2018-10-27 18:28:44 UTC
Created attachment 374071 [details] [review]
flvmux: Test that timestamps are always increasing

Decreasing timestamps break rtmpsink.

With contributions from Olivier Crête.
Comment 22 Olivier Crête 2018-10-27 18:37:50 UTC
@hgr Here is a simpler patch that just prevent the DTS from going backwards. But I'm worried it may break streams where PTS != DTS with B-frames. Although in theory it should work. Please let me know if this works for your use-cases ?
Comment 23 Olivier Crête 2018-10-27 18:38:45 UTC
*** Bug 797044 has been marked as a duplicate of this bug. ***
Comment 24 GStreamer system administrator 2018-11-03 15:30:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/474.