GNOME Bugzilla – Bug 796382
flvmux: (too) easily produces backwards flv-timestamps
Last modified: 2018-11-03 15:30:07 UTC
Created attachment 372380 [details] [review] test Running the new GstAggregator-based flvmuxer in our system gave some weird failures. Turns out it was caused by the flv-timestamps going backwards, and with RTMP generating delta timestamps, these would go negative and cause the timestamps to overflow, producing timestamps thousands of hours into the future! With a bit of hassle I was able to produce a test that can be run both with the old and new flvmuxer, and that will produce a "correct" sequence with the old implmentation, and a "backwards-timestamp" sequence with the new, basing it on the real data I saw in our tests. As for possible fixes I think first it would be a good idea to determine if producing backwards timestamps in the flv-stream is something we want to avoid at all costs, or if this can be expected to happen. Anyway, some sort of sanity internally in the flvmuxer to at least notify the user of backwards timestamps (or perhaps a property to not allow it), would probably be a good idea, as I expect many people upgrading the muxer would experience similar things to what we did.
*** Bug 796380 has been marked as a duplicate of this bug. ***
*** Bug 796379 has been marked as a duplicate of this bug. ***
It is worth adding that if you add some latency using the latency-property (in the test) the test *will* pass. Hence I think there exists ample opportunities to work around this issue by using a bit more latency, but I still think it critical for it to be possible to produce a negative timestamp-delta.
(In reply to Nicolas Dufresne (ndufresne) from comment #1) > Can you clarify, FLV format is delta based, so backward timestamp are not > possible. Do you mean the timestamps on the buffers are going backward ? Do > you mean PTS or DTS ? I mean the flv-timestamp (24 bits) in the flv-header are going backwards, hence the blocker. (because, yes, deltas are calculated for rtmp)
Review of attachment 372380 [details] [review]: Indeed, there is no good reason this would not be going forward. ::: tests/check/elements/flvmux.c @@ +645,3 @@ + guint32 pts; + gst_buffer_map (buf, &map, GST_MAP_READ); + pts = GST_READ_UINT24_BE (map.data + 4); You should maybe add a comment on why you can ignore the TimestampExtented 8 bit.
* backward
(In reply to Håvard Graff (hgr) from comment #3) > It is worth adding that if you add some latency using the latency-property > (in the test) the test *will* pass. > > Hence I think there exists ample opportunities to work around this issue by > using a bit more latency, but I still think it critical for it to be > possible to produce a negative timestamp-delta. Very interesting. So whenever data is dropped, there is this bad side effect. I guess if the source isn't live, it's also ok ?
Created attachment 372387 [details] [review] flvmux: warn and override backwards timestamps Here is a possible patch for "fixing" this. Not great stuff but does the job.
Review of attachment 372387 [details] [review]: This is not really a fix. I think we get in this situation because the mixer let data go, on timeout, and the some "late" data comes in, and get pushed. If we want to use that model, late is late, data need to be dropped then. When you add latency, you move the timeout later, allowing a certain delay between a/v. So, weigher we follow the paradigm, and massively drop, or just disable timeout mode, but hacking the timestamp after, I don't agree.
(In reply to Nicolas Dufresne (ndufresne) from comment #9) > Review of attachment 372387 [details] [review] [review]: > > This is not really a fix. I think we get in this situation because the mixer > let data go, on timeout, and the some "late" data comes in, and get pushed. > If we want to use that model, late is late, data need to be dropped then. > When you add latency, you move the timeout later, allowing a certain delay > between a/v. So, weigher we follow the paradigm, and massively drop, or just > disable timeout mode, but hacking the timestamp after, I don't agree. I agree. This causes more problems than it solves. Not great stuff indeed. Now, picture the following scenario: We have requested an audio and a video pad on flvmux. However, due to some various application/device delays, the audio stream starts flowing 2 seconds before the video can get going. What should the muxer do now? If we have queues in front of the flvmuxer it won't start doing anything with the audio until the video-caps arrives, and now we will get into huge muxing-trouble (with todays implementation) with the timestamp jumping back and forth between the buffer audio-stream (starting from dts 0) and the video stream (starting from dts 2000 (ms)). There are a few options available here: 1. Drop all the audio leading up the the point where the video starts. This will ensure sync from that point onward but at the cost of some missing audio. 2. Start processing the audio only first for 2 seconds, effectively introducing a 2 second delay on the stream. Obviously option 1 would work better for "live" scenarios, and 2 for recording type scenarios. Our use-case calls for 1, but I am not sure how this is best implemented. As for 2 I think this is supposed to work, but testing shows that a video-buffer might be nominated in gst_flv_mux_find_best_pad if a audio-buffer does not reach the flv-sinkpad in time (racy). I will write a simple test demonstrating this and then we can discuss the best way forward.
To elaborate on why 2. is not always working. When calling gst_aggregator_wait_and_check, this will return FALSE if not all pads are ready, but in the case of a timeout (which is very likely given that the times that are being process (old audio) are in the past), wait_and_check will return TRUE, even if not all pads have data yet (the gstqueue in front is basically in the middle of a push), making ->aggregate() being called with only a video-buffer, advancing the "flv-pts" to the current time and making all the other old audio very problematic (old timestamps etc).
I pulled a bit more on this thread to find out why flvmux was asking for a timeout at all, and I think basically the logic in the get_times implementation are wrong. I will create a test and suggested fix.
Created attachment 372461 [details] [review] flvmux: always wait for both an audio and a video buffer before processing his makes the behavior almost identical to the old flvmuxer, and removes quite a few racy behaviors that exists today. The most interesting one we have been seeing has been when audio comes first into the muxer (say 2 seconds worth), and then video starts flowing after that. The audio *should* then always be processed first (since it has lower timestamps), but due to the wait the waiting for buffers is working, sometimes a video-buffer would sneak in (because the audio-buffer did not reached the flvmuxpad in time from the queue it was residing in, and then only a video-buffer was available when finding the best pad) and then you get havoc with backwards timestamps etc etc. Effectively this means the flvmuxer will now never wait on the clock at all, but rather wait for at least one buffer on each pad. Whether we would like a "live-mode" that would wait on the clock at some point could probably be discussed, but then more care need to be given about what to do with late buffer etc (probably dropping?). AFAICT this patch makes the "new" flvmuxer act almost identical with the "old" one, and that is probably a good thing for now.
(In reply to Håvard Graff (hgr) from comment #13) > Whether we would like a "live-mode" that would wait on the clock at some > point could probably be discussed, but then more care need to be given > about what to do with late buffer etc (probably dropping?). Dropping and, for video, requesting keyframes, hoping there is an encoder upstream. > > AFAICT this patch makes the "new" flvmuxer act almost identical with the > "old" one, and that is probably a good thing for now. Indeed, the implementation is not better then using collect pad with that patch. It probably also breaks the reason it was ported in the first place.
> > AFAICT this patch makes the "new" flvmuxer act almost identical with the > > "old" one, and that is probably a good thing for now. > > Indeed, the implementation is not better then using collect pad with that > patch. It probably also breaks the reason it was ported in the first place. But it is arguably better than reverting the GstAggregator-port all-together?
Assuming the FLV spec forbids a certain delay debtween audio / video, which is what makes it looks like backward timestamp. This as been simply assumed based so far. It's also bad, since the idea behind using aggregator is now completely broken. I'd like Olivier's input, this patch seems backward, and a little too focus on one test case. More information need to be gathered.
(In reply to Nicolas Dufresne (ndufresne) from comment #16) > Assuming the FLV spec forbids a certain delay debtween audio / video, which > is what makes it looks like backward timestamp. This as been simply assumed > based so far. It's also bad, since the idea behind using aggregator is now > completely broken. I'd like Olivier's input, this patch seems backward, and > a little too focus on one test case. More information need to be gathered. No one is assuming anything. I am very familiar with the specs, and they *do* allow backwards timestamps, because of seeking: ref: http://wwwimages.adobe.com/www.adobe.com/content/dam/acom/en/devnet/rtmp/pdf/rtmp_specification_1.0.pdf 5.3.1.2.1. Type 0 Type 0 chunk headers are 11 bytes long. This type MUST be used at the start of a chunk stream, and whenever the stream timestamp goes backward (e.g., because of a backward seek). But what the specs says and what implementations out there are doing is of course two very different things (as you would know). What I am pointing out is a very large behavioral difference that will cause chaos for many users. It certainly did for us, all our RTMP related tests went blood red when upgrading to 1.18. And it is *not* just the fact that timestamps will sometimes go backwards that is the problem here (as you would have understood having read all my previous comments). The fact that a video buffer "from the future" can sneak in to a stream of audio-buffer being processed is very bad, and a side-effect of timing out a wait for a time in the past. Keep in mind this is not flvmux2, this is good old flvmux and hence I would expect the behavior to stay the same. Moving to GstAggregator is great because collectpads are very prone do deadlocking, and aggregator are much better in that respect. Now, if we were to add a new "live-feature" to the flvmux that would be great, and the aggregator would be perfect for that, but that is *new* functionality, and should be treated as such (ex. opt in/out with a property) which is different from a change that on the surface should *not* be functionally different. Reverting to the old flvmux would of course be an option, but I think there are good things in the future for a more live-based flvmux and would definitely like to see this moving forwards, and also removing collectpads from the equation is a very good thing. (they have been our single biggest source of deadlocks)
I think a possible fix for a "live-mode" would be to, in the case of a missing buffer on the audio or video pad, to wait for (now + latency), and not what the current _get_next_time_for_segment is doing. That way you are delaying as long as you are allowed to waiting for another buffer to arrive, and you won't get waits that are in the past, causing the racy behavior we are seeing. And of course, anything that comes in late (ts + latency < now) needs to be dropped.
This is already what aggregator is supposedly doing. I haven't review the get_times() function, it's normally all handled by the base class. To demonstrate that, you have set a latency on the muxer, and that "fixed" it, since the timeout is calculated as: start = gst_aggregator_get_next_time (self); time = base_time + start; time += latency; Where latency is (peer_latency_min + latency + sub_latency_min). Can you check if the latency query did work in your case ? (somehow it must work, since otherwise the aggregator would not have gone live, and then there would be no timeout).
Created attachment 374070 [details] [review] flvmux: Force timestamps to always be increasing
Created attachment 374071 [details] [review] flvmux: Test that timestamps are always increasing Decreasing timestamps break rtmpsink. With contributions from Olivier Crête.
@hgr Here is a simpler patch that just prevent the DTS from going backwards. But I'm worried it may break streams where PTS != DTS with B-frames. Although in theory it should work. Please let me know if this works for your use-cases ?
*** Bug 797044 has been marked as a duplicate of this bug. ***
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/474.