GNOME Bugzilla – Bug 754230
qtdemux: support sparse time ranges in qtdemux without needing a seek for MSE
Last modified: 2017-03-22 17:50:04 UTC
This is a follow up of https://bugzilla.gnome.org/show_bug.cgi?id=754222#c3 We're using qtdemux in WebKit to process data coming from Media Source Extensions[1] appends. At appsrc push time, we don't have any info about what time the data we're appending belongs to. We rely on what qtdemux detects. When our tests and use cases append data linearly from the begining, qtdemux keeps timing info internally and generates the proper timestamps without problems. However, there are problems for sparse ranges. If the user seeks forward and the upper layer (out of our control) decides to append [2:00, 2:30] right after a former append of [0:00, 1:00], then the internal demuxer status keeps generating PTSs in the range [1:00, 1:30]. Note that in our layer we don't really know the time range being appended, so we can't hint appsrc in any way. We have a patch[2] for qtdemux_parse_trun() to use decode_ts (which comes from the TFDT atom) if it's significantly higher than the time in the last sample. This rule allows us to blindly feed sparse ranges and have qtdemux generating the right times. Is it wrong to rely on TFDT as much as possible? What alternatives do we have to support sparse ranges in the appended data? [1] https://w3c.github.io/media-source/ [2] http://pastebin.com/612D3kyU Sebastian Dröge (slomo) asked: Why don't you know the timestamps on your side, i.e. that you are appending [2:00, 2:30] now instead of [1:00, 1:30]? How would you distinguish these two cases if you don't know the timestamps?
We don't know the timestamps because the MSE append()[1] API only supports binary data. This API is ultimately used from JavaScript by the web page, completely out of WebKit's control. This is a brief summary of how the spec expects things to work: You append some raw binary data one or more times. That raw binary data can provide info about an "initialization segment" (headers) or a "media segment" (actual audio/video samples). The only information that the user of append() can provide is the stream duration ([0, infinity] by default). The "segment parser" (ie: qtdemux in our case) must be able to generate accurate timestamps based on the info provided by the headers. As a result, some buffered ranges are generated, indicating what collection of samples have been detected and are now owned by the MSE layer, ready to be used for playback if needed. Some particular cases: - If you append the same raw data twice, the buffered ranges don't change, because the same samples (PTS) are generated the second time and they overwrite the existing ones. - You should be able to append raw data corresponding to [0, 1:00] and [2:00, 2:30] and MSE should generate buffered ranges accordingly by itself. Other implementations not based on GStreamer (eg: Chromium) have their own "ChunkDemuxer" that analyzes the headers and report the times stored there. In the case of GStreamer, this is a task for qtdemuxer. I hope to have clarified the use case a little bit. [1] https://w3c.github.io/media-source/#widl-SourceBuffer-appendBuffer-void-ArrayBuffer-data
Basically you want a seek to happen, without telling GStreamer. You want every segment, regardless what they are, to simply play one after the other. You most likely need code in your application to detect the induce discontinuity, and adapt the segment to hide this discontinuity (so the continuous timing is preserved). Then you'll probably need more work to figure-out how to translate the position query. It's not really a task for qtdemux, but for your application wrapper. When I first looked at MSE and Webkit, it was clear to me that to be correct, one would need to parse/demux the received fragments, in order to gather list of elementary streams (and reduce the delay for stream switches), and that separately from the playback. These elementary stream list will give you all the data to make a decision how to feed and setup the playback block. Blindly using appsrc before playbin for such a specialized use case does not seem like a robust approach to me.
I have a patch in Webkit Bugzilla that does parsing of the media and would allow to know the actual positions (and if something is a keyframe or not, etc). I agree with Nicolas that this is something that should happen at the application layer, so let's close this here. If you do special things from the application, you need to do special things from the application to make that work. https://bugs.webkit.org/show_bug.cgi?id=140078
Thank you both for your suggestions. Just to let you know: I've been working for some time on top of the WebKit code pointed by Sebastian, mainly making the Youtube conformance tests [1] pass, with a big deal of success. I even got Youtube TV [2] working... until I bumped into the seek feature and realized about some limitations of the design I was using. I always tried to perform appends and playbacks in the same pipeline and suffered the consequences of sharing the pipeline for operations that should be completely independent. Also, both the MSE spec and the MSE framework layed out by Apple in WebCore suggest that the Samples (after the demuxer) should be encapsulated and provided to the upper MSE layer by the lower GStreamer MSE implementation layer. In practice, I implemented it by using a probe in the demuxer src pads to "steal" the GstBuffers. Later, when the upper MSE layer wanted to perform the playback, I pushed the buffers again in the pads. Needless to say, I had lots of problems trying to convince the seeks and flushes to play well with all this. :-/ Now I'm changing approach and I'm splitting the original pipeline: N append pipelines (one per SourceBuffer) and one common playback pipeline. Each append pipeline holds a demuxer and reports the received samples to the upper MSE layer. The playback pipeline, still based on playbin, performs the playback in a simpler way than the original pipeline in the current upstream WebKit code. I hope this new design will simplify things and ease the debugging of the remaining issues I still have. I'll be around in Dublin in the next GStreamer Conference. If you have interest in the topic, I'll be glad to give you more details about my experience implementing all this and discuss the advantages and drawbacks of the different designs. [1] http://yt-dash-mse-test.commondatastorage.googleapis.com/unit-tests/2015.html [2] https://www.youtube.com/tv
Using separate pipelines for the "preprocessing" pipelines, and then a common playback pipeline for the final buffers seems to make sense to me and might indeed make everything a bit simpler. When I wrote the above mentioned patch, I didn't really notice the roundtrip from the GStreamer layer, to the upper layer, back to the GStreamer layer. Let's talk about the details at GstConf :)
Enrique, can you add the summary of our discussion at the conference here?
Sure! I talked to both of you (Nicolas and Sebastian) and commented that it would be a pity for Samsung and Igalia to have to resort to ChunkDemuxer (Chromium's standalone demuxer) when GStreamer can actually demux the appended raw data for us, or at least it should. I explained to Nicolas that the MSE API[1] in general and our implementation in WebKit[2] in particular splits the concept of appending data to get "buffered ranges" from the concept of playing, pausing, stopping and seeking the previously buffered data. Raw data is appended to a SourceBuffer at any time and the MSE layer has no idea of what's been appended until it processes (demuxes) the data, after which "buffered ranges" are created. The player can then playback what's buffered, or even portions not yet buffered (so it first waits for the needed buffered ranges to appear and then continues). Same for seeks. As I said in previous posts, there's no problem when we append data linearly. The first append usually contains the "MSE init segment" (headers describing the data that will follow) and no real samples, but it's very important to let qtdemux configure itself properly, create src pads, etc. The second append usually contains the samples. Our append pipeline captures those samples with an appsink and return them to the upper MSE layer so it can track the "buffered ranges". However, when the JavaScript code in the webpage seeks to a future/past position out of the buffered ranges, the most probable (but not guaranteed) consequence is that the raw data appended since that moment belongs to some distant position compared to what was appended the last time. Qtdemux doesn't know it and just keeps generating timestamps (PTS) contiguous to the last ones. It was suggested that some sort of heuristic rule could be used, so that when a seek is performed in the playback pipeline, the qtdemux(es) in the append pipeline(s) are reset. There's no guarantee that such heuristic would work, and in any case it would be very weak and error prone. So, by now, we have to assume that we have no hints about what's being appended and that the demuxer(s) must be able to produce absolute time information on their own. To try to understand what was happening during seeks, I saved the raw data to disk as it was appended and analyzed it offline using MP4Box.js[3]. What I found was that the TFDT atom contained the absolute position of the "MSE segment" being appended. I considered that this would be a good hint and coded the patch I linked in the description of this bug[4]. The patch enforces honoring the TFDT atom when it can help us detect a gap. The side effect of this patch is that it breaks the current qtdemux behaviour (ie: breaks some tests), so it's apparently not desirable to have it upstream. After talking to Nicolas, he thought that one posible solution would be to add a new boolean property to qtdemux ("prioritizeTfdt"? any better name?) that would default to false (so the tests keep passing). If set to true, it would honor the TFDT atom and make the MSE append use case work. Sebastian commented that in any case the pastebin patch should be improved, because it only worked for "forward discontinuities/jumps" due to the way the comparison is being done. It should be fixed to work also for "backward jumps". Please, let me know if I forgot or can clarify something more. [1] https://w3c.github.io/media-source/#sourcebuffer-segment-parser-loop [2] Slides 14-16 in http://www.slideshare.net/philn2/gstreamer-support-in-webkit-whats-new [3] http://download.tsi.telecom-paristech.fr/gpac/mp4box.js/filereader.html [4] http://pastebin.com/612D3kyU
It seems I confused people's names. Please, s/nicolas/thiago/.
Hi I might be completely wrong so be easy on me :) with the following: After a seek to a position not in the buffered range, we could seek "again" to the closed position between what is requested and what is possible to go in the next received data. But maybe this is actually what you are trying to achieve and I just started to understand this :) Would a container parser help on this ? like tsparse (for video/mpegts) but for video/quicktime.
Relying in the TFDT is the only option I could think about as there is not enough information when the 'append data' call happen to know if it should be discontinuous or not with the previous appended ones. A discont would help qtdemux stop generating continuous timestamps and just read the TFDT again. As this is not possible, relying in the TFDT is all we can do. The problem is that some slightly broken streams exist out there and we have a check for major jumps in the TFDT timestamp, when that happens qtdemux ignores the TFDT and just uses continuous timestamps.
Apparently I was confused with a patch that wasn't merged. qtdemux doesn't do these tfdt fragment checks. So it should be relying on the tfdt already to some extend. What version have you been using and do you see the debug line: GST_DEBUG_OBJECT (qtdemux, "decode time %" G_GINT64_FORMAT " (%" GST_TIME_FORMAT ")", decode_time, GST_TIME_ARGS (decode_time_ts)); Being printed?
To be fair, I'm no sure why the chunker design is a shame for you. When I first looked at this API (about 2 years ago) my idea was that all incoming buffers would be "chunked", which means demuxed and parsed, and the result stored in a proper data structure, with the buffer group segment, layered and all. Doing so, let the source handle the segment base accumulation, making all streams looks like they are all one after the other. This way you become completly format agnostic, and if someone allow switching from ISOMP4 to TS you'd already support that. Another advantage of this approach, is that it is really responsive to changes (seeks or change in bit rate). MSE is very similar to editing, this idea comes from gnlcomposition, now called nlecomposition.
(In reply to Nicolas Dufresne (stormer) from comment #12) > MSE is very similar to editing, this idea comes from gnlcomposition, now called nlecomposition. Great point actually, could we just re-use some of the ges libs http://cgit.freedesktop.org/gstreamer/gst-editing-services/tree/ges and/or some ges elements http://cgit.freedesktop.org/gstreamer/gst-editing-services/tree/plugins/nle to build a GstMediaSrc (bin) element ? Maybe everything is there already and we just need to connect them together. Should I create a new bug: "Implement new element GstMediaSrc based on GES" ? Mathieu and Thibault that would be great if we could have your thought about it ? (Just in case you do not know what is MSE-Media Source Extension, just ping me on IRC I'll quickly explain) Thx! (Btw I am marking this bug as reopen, more inputs have been provided since it was marked as closed.)
Created attachment 314923 [details] [review] Use the tfdt decode time when it's significantly different than the time in the last sample if always-honor-tfdt is enabled This patch solves the problem for my particular case, while keeping the default behaviour in the general case. It adds a new "always-honor-tfdt" property (default to false). If set to true when creating the qtdemux element, it uses the time reported by the TFDT atom. I've tested it by supplying linear data from the begining, data from far away in the future and data from far away in the past and the presentation timestamps seem to be generated correctly according to what TFDT says.
The problem with a property here is that qtdemux is usually autoplugged and somewhere hidden inside decodebin. Ideally all this should somehow work automatically.
(In reply to Sebastian Dröge (slomo) from comment #15) > The problem with a property here is that qtdemux is usually autoplugged and > somewhere hidden inside decodebin. Not in our case. We have an "append pipeline" (several, actually, one per SourceBuffer) constructed by hand. It would be nice to make it work automatically in the general case, that's true, but I don't know how to do it without breaking the standard behaviour of qtdemux, so I don't want to risk.
Do you know when a discontinuity happens or do you really just get arbitrary data and have to figure out things yourself?
Unfortunatelly, the MSE and GStreamer layers don't know when a discontinuity can happen. The JavaScript in the web page knows (in theory), but that's completely out of our control. The use case works like this: JavaScript calls SourceBuffer.append() and provides raw binary data without any indication of what time it belongs to. Then, it's job of the "parser"[1] to figure out what timestamps and actual Samples (roughly equivalent to GstBuffers) are generated by that data. Chromium ChunkDemuxer[2] & MP4 parser[3] and Apple AVStreamDataParser[4] are able to "blindly" generate the proper timestamps with the raw data provided in appends, without any context. [1] https://w3c.github.io/media-source/#sourcebuffer-segment-parser-loop [2] https://chromium.googlesource.com/chromium/src/media/+/master/filters/chunk_demuxer.cc [3] https://chromium.googlesource.com/chromium/src/media/+/master/formats/mp4/mp4_stream_parser.cc [4] https://github.com/WebKit/webkit/blob/master/Source/WebCore/platform/graphics/avfoundation/objc/SourceBufferPrivateAVFObjC.mm#L775
What a useless API :) I guess we could in theory require a DISCONT flag to be set on buffers where it should resync the timestamping based on what it finds in the stream. And you would just always set it because you don't know anything at all.
(In reply to Sebastian Dröge (slomo) from comment #19) > What a useless API :) Agree. > I guess we could in theory require a DISCONT flag to be set on buffers where > it should resync the timestamping based on what it finds in the stream. And > you would just always set it because you don't know anything at all. Our use case works in push mode. I see in [1] that there's code already in place to propagate the DISCONT flag from incoming buffers to their corresponding internal qtdemux streams. However, the only place where that stream flag is used for something is in [2], where it's propagated to the outgoing buffer. I guess I should modify the condition in qtdemux_parse_trun()[3] to check for that flag in the stream instead of checking for always_honor_tfdt, right? Should I keep the "significative difference" condition too? I hope these thanges don't interfer with the other legitimate use of the DISCONT flag, in gst_qtdemux_move_stream()[4]. I'll try to implement these changes when I have some time, still leaving the "always_honor_tfdt" flag in my local branch, so I can compare both behaviours and check that there are no regressions. Thanks a lot for your advice! [1] https://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/isomp4/qtdemux.c#n5644 [2] https://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/isomp4/qtdemux.c#n4880 [3] https://bugzilla.gnome.org/attachment.cgi?id=314923&action=diff#a/gst/isomp4/qtdemux.c_sec5 [4] https://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/isomp4/qtdemux.c#n1210
Thiago, what do you think? Does this make sense? Enrique, my thinking was that whenever the DISCONT flag is set you would resync the timestamping logic. That is, take the tfdt or whatever else we have (buffer PTS/DTS?) available as a new base and interpolate based on that until the next DISCONT.
Enrique, your use case might also be solved by https://bugzilla.gnome.org/show_bug.cgi?id=767354 Can you confirm?
(In reply to Sebastian Dröge (slomo) from comment #22) > Enrique, your use case might also be solved by > https://bugzilla.gnome.org/show_bug.cgi?id=767354 > > Can you confirm? As we spoke in person during GUADEC, I can confirm this does not work.
Yeah, we also discussed a solution :) My proposal would be to use Enrique's patch, but only do that if upstream sends a BYTE segment (that is: it does know *nothing* about timestamps and qtdemux needs to make up timestamps from what is inside the stream). This should not break any use cases. adaptivedemux for example sends a TIME segment and wants it to be honored, which is exactly the case why we don't use the tfdt always currently. For the problem, to repeat my understanding: You just get a stream of bytes from the browser and have no information about which byte offsets, which time, etc it corresponds to. The only thing that you know is that you get a "useable" stream, that is, for fragmented MP4 you would always get complete fragments but they might be in arbitrary order, however you will never get the first half of one fragment followed by some other fragment. Enrique, can you confirm that this would solve your problem?
First of all, thank you very much both to you and Calvaris for moving this forward. I couldn't test the patch from https://bugzilla.gnome.org/show_bug.cgi?id=767354 myself because adapting GStreamer 1.9.1 to work with our patched buildroot environment for the Raspberry turned out to be harder than I thought. About the solution you're proposing now, yes, it should work. We append the data using gst_app_src_push_buffer(), and those buffers are created using gst_buffer_fill(), with no PTS/DTS/duration info on them. I guess that generates a byte stream. Regarding weird appends, we do have the case you describe (appending first half of one fragment followed by some other fragment). However, when that happens, our client code must call abort() before feeding the second segment. We handle aborts by resetting the pipeline (changing it to READY), so it shouldn't be a problem. Once again, thanks a lot for your help moving this forward.
(In reply to Enrique Ocaña González from comment #25) > First of all, thank you very much both to you and Calvaris for moving this > forward. I couldn't test the patch from > https://bugzilla.gnome.org/show_bug.cgi?id=767354 myself because adapting > GStreamer 1.9.1 to work with our patched buildroot environment for the > Raspberry turned out to be harder than I thought. It's also not going to help, it's solving a different problem that you don't have :) > About the solution you're proposing now, yes, it should work. We append the > data using gst_app_src_push_buffer(), and those buffers are created using > gst_buffer_fill(), with no PTS/DTS/duration info on them. I guess that > generates a byte stream. appsrc has a "format" property. If you set that to GST_FORMAT_BYTES, then that's what it would do. PTS/DTS/duration are somewhat independent of that (but only really make sense for GST_FORMAT_TIME). > Regarding weird appends, we do have the case you describe (appending first > half of one fragment followed by some other fragment). However, when that > happens, our client code must call abort() before feeding the second > segment. We handle aborts by resetting the pipeline (changing it to READY), > so it shouldn't be a problem. That wouldn't be a problem as long as we know when to flush and expect discontiguous data. > Once again, thanks a lot for your help moving this forward. Would you be able to update your patch accordingly and check if that solves all your problems? There is already other code in qtdemux doing such distinction (search for upstream_format_is_time).
Created attachment 334238 [details] [review] Use the tfdt decode time on byte streams when it's significantly different than the time in the last sample Here's the updated patch following your comments. I've tested it on our Raspberry port (rebased to GStreamer 1.8.2, but I'm submitting the patch for master) and it works fine. Thanks a lot for your help and suggestions.
Review of attachment 334238 [details] [review]: ::: gst/isomp4/qtdemux.c @@ +3123,3 @@ + * difference between decode_ts and timestamp, prefer the former */ + if (!qtdemux->upstream_format_is_time + && abs (decode_ts - timestamp) > That's not safe to do as both values are unsigned (also use ABS()) @@ +3124,3 @@ + if (!qtdemux->upstream_format_is_time + && abs (decode_ts - timestamp) > + stream->samples[stream->n_samples - 1].duration) { Why take the duration of the sample as a limit? Why not e.g. 1 second, 10 seconds, ...? :) The difference between decode_ts and timestamp can easily be higher than the duration because of frame reordering already, no?
After trying to find a combination of existing functions/macros together with type conversions, nothing convinces me. I'm going to use my own: #define ABSDIFF(x, y) ( x > y ? (x - y) : (y - x) ) ... if (!qtdemux->upstream_format_is_time && ABSDIFF (decode_ts, timestamp) > stream->samples[stream->n_samples - 1].duration) { I had forgotten frame reordering. Is there a limit about the maximum out-of-order distance that a {P,B}-frame can have or is using an arbitrary limit the only option? I've used GSTTIME_TO_QTSTREAMTIME(stream, GST_SECOND) and it seems to work, but the larger the value, the higher the probability of ignoring jumps (require timestamp reset to TFDT) by mistake.
(In reply to Enrique Ocaña González from comment #29) > I had forgotten frame reordering. Is there a limit about the maximum > out-of-order distance that a {P,B}-frame can have or is using an arbitrary > limit the only option? In general, no theoretic limit, no. > I've used GSTTIME_TO_QTSTREAMTIME(stream, GST_SECOND) and it seems to work, > but the larger the value, the higher the probability of ignoring jumps > (require timestamp reset to TFDT) by mistake. How big are the jumps you are interested in usually? A fragment duration? Maybe we can make the limit half a fragment duration then?
Enrique?
Sorry for the delay. I'm on it. I'm going to look for a way to specify half the fragment duration.
Created attachment 334904 [details] [review] Updated patch After trying to look for some info about the fragment duration in the moof, mfhd, traf, tfhd(*), trun(*) and other headers, I haven't found a simple way to compute the fragment duration at qtdemux_parse_trun() time. Therefore, by now I'd prefer to use a fixed time value to consider what a "significant difference" is. Even if Youtube TV seek granularity is 5 seconds, I'd prefer something shorter such as 1 second, just in case some other UIs are pickier. (*) tfhd has a default_sample_duration field, but it's optional. Trun has a sample count, but there can be several truns in a fragment. I'm attaching an updated patch version.
(In reply to Enrique Ocaña González from comment #33) > Created attachment 334904 [details] [review] [review] > Updated patch > > After trying to look for some info about the fragment duration in the moof, > mfhd, traf, tfhd(*), trun(*) and other headers, I haven't found a simple way > to compute the fragment duration at qtdemux_parse_trun() time. Therefore, by > now I'd prefer to use a fixed time value to consider what a "significant > difference" is. Even if Youtube TV seek granularity is 5 seconds, I'd prefer > something shorter such as 1 second, just in case some other UIs are pickier. > > (*) tfhd has a default_sample_duration field, but it's optional. Trun has a > sample count, but there can be several truns in a fragment. > > I'm attaching an updated patch version. Why don't you know the sample duration at this point? You have the moov (so with the trex all the default sample values), you have the moof (so the whole table of samples with their durations). You could e.g. sum them all up (but I'm sure the code does that already somewhere) :)
Created attachment 337244 [details] [review] Updated patch After some debugging, it looks like the trex atoms I get (in the videos used by the tests, at least) specify a zero duration. They aren't useful to get the fragment duration. I wanted to get the fragment duration from stream->duration_moof, but it's only available after having processed all the samples, when it's too late to be useful. It's also reset at the start of sample processing. I had to store the value of the previous iteration (duration_last_moof) and use it in the comparison. With all these changes, the patch keeps being useful for my use case.
Review of attachment 337244 [details] [review]: ::: gst/isomp4/qtdemux.c @@ +96,3 @@ #define STREAM_IS_EOS(s) (s->time_position == GST_CLOCK_TIME_NONE) +#define ABSDIFF(x, y) ( x > y ? (x - y) : (y - x) ) Needs some more parenthesis. (x) and (y) everywhere @@ +3130,3 @@ + if (!qtdemux->upstream_format_is_time + && ABSDIFF (decode_ts, timestamp) > + (stream->duration_last_moof / 2)) { This always triggers if stream->duration_last_moof == 0 (i.e. for the first fragment). Maybe make this > MAX(stream->duration_last_moof / 2, GST_SECOND) or something like that? And yes, we have a chicken-egg problem here indeed :) We only know the duration from parsing the samples (in the worst case), and the sample durations depend on the timestamps, which depend on the lines below.
Created attachment 338363 [details] [review] Updated patch I thought that in the "first fragment" case the flow was already going into the "n_samples == 0" if branch[1], but maybe you're right and having a fixed minimum threshold of 1 second is the safest approach. Here's a new version of the patch, which is also more readable because it uses the macros to convert from gst time to stream time and vice versa. [1] https://github.com/GStreamer/gst-plugins-good/blob/master/gst/isomp4/qtdemux.c#L3097
Review of attachment 338363 [details] [review]: Looks reasonable, thanks :)
commit 69fc488392b1d089d9054b6ba77bd19328aaeefb Author: Enrique Ocaña González <eocanha@igalia.com> Date: Mon Oct 24 16:56:31 2016 +0000 qtdemux: Use the tfdt decode time on byte streams when it's significantly different than the time in the last sample We consider there's a sifnificant difference when it's larger than on second or than half the duration of the last processed fragment in case the latter is larger. https://bugzilla.gnome.org/show_bug.cgi?id=754230
(In reply to Sebastian Dröge (slomo) from comment #40) > commit 69fc488392b1d089d9054b6ba77bd19328aaeefb > Author: Enrique Ocaña González <eocanha@igalia.com> > Date: Mon Oct 24 16:56:31 2016 +0000 > > qtdemux: Use the tfdt decode time on byte streams when it's > significantly different than the time in the last sample > > We consider there's a sifnificant difference when it's larger than on > second > or than half the duration of the last processed fragment in case the > latter is > larger. > > https://bugzilla.gnome.org/show_bug.cgi?id=754230 This patch causes problems with fragmented files that don't have tfdt atoms, because the decode_ts is always (default) 0 in that case, and output timestamps are reset at each fragment start. As it turns out, qtmux doesn't write tfdt for fragmented streams...
Created attachment 343081 [details] [review] qtdemux: Don't reset output timestamps when no tfdt If a fragmented stream doesn't have a tfdt, don't reset the output timestamps at each fragment boundary by erroneously using the default value of 0. Introduced by commit 69fc48
From the spec: "If the time expressed in the track fragment decode time (‘tfdt’) box exceeds the sum of the durations of the samples in the preceding movie and movie fragments, then the duration of the last sample preceding this track fragment is extended such that the sum now equals the time given in this box." It seems like if there's a tfdt, we should unconditionally use it, unless there's a specific counter example?
There are DASH streams (from what Thiago told me IIRC), that have invalid tfdt values and while from the Manifest you get the correct ones easily. In those cases we prefer to use the buffer timestamps in qtdemux (coming from dashdemux) over the tfdt.
(In reply to Jan Schmidt from comment #42) > Created attachment 343081 [details] [review] [review] > qtdemux: Don't reset output timestamps when no tfdt > > If a fragmented stream doesn't have a tfdt, don't > reset the output timestamps at each fragment boundary > by erroneously using the default value of 0. Introduced > by commit 69fc48 This patch caused a regression on our use case (YouTube TV MSE Conformance Tests 2016, test 40. AppendOutOfOrder). It's an audio stream whose data is appended out of order (as per the test) in these batches (ranges are in seconds): [10, 20] [0, 10] [30, 40] [20, 30]. All the fragments have TFDT atoms, but the new condition added in your patch is misinterpreting a legitimate "0" value from a TFDT atom with "no TFDT atom at all". I've thought about two possible solutions: A) Have an extra boolean parameter to qtdemux_parse_trun() hinting if the tfdt_node has been found in qtdemux_parse_moof(). Use that info in the condition you added, instead of checking decode_ts != 0. I've coded this one locally and it works fine. B) Initialize decode_time to GST_CLOCK_TIME_NONE in qtdemux_parse_moof() and add a couple of extra cases in qtdemux_parse_trun() to use 0 when GST_CLOCK_TIME_NONE is passed. Except in the decode_ts != 0 condition, of course, which now would be decode_ts != GST_CLOCK_TIME_NONE. Which one do you prefer me to submit as a patch?
Please open a new bug for that :) Thanks for finding this!
The new bug is https://bugzilla.gnome.org/show_bug.cgi?id=780410