GNOME Bugzilla – Bug 655804
Play position advancing but frames not rendered
Last modified: 2011-11-28 13:59:40 UTC
Hi all, When playing an MPEG TS stream multicast over UDP I occasionally see GStreamer stop rendering frames while the play position is normally progressing. On the last occasion this happened it actually managed to untangle itself after a while and started playing (rendering) again. Unfortunately I do not know whether it would always be case since previously I did not wait this long. To start getting more specific... I do not know when exactly it stopped rendering, but I know when I first saw it in that state which was at 14:14 today. It was streaming for roughly two hours up to that point. Last interesting logged thing before that state was a QOS event at 14:06. Then at 14:21 it started to render again and reported another QOS event. In between the application was querying play position every ten seconds which showed progress. If you look at this annotated log: ... <video state unknown> 14:06:15.006 PLAY POSITION 6615867197374L 14:06:18.408 QOS @ 6619179855556L Frames: 291654/704; Jitter: 88265288 14:14 <spotted video not playing (not rendering)> 14:20:59.484 PLAY POSITION 7500344289211L 14:21:08.142 QOS @ 6915779855556L Frames: 306475/705; Jitter: 593222818788L <video started playing again> Play position is what query_position() on playbin2 element returns (documentation says this is stream time). And times logged in QOS events come from stream-time message field, so it should be matched up in that respect. This second event is reporting that a frame was almost ten minutes late, but that there was only one dropped frame in the period on roughly a correct number of rendered frames (for 25 fps video) during the period. Which is of course wrong because nothing was rendered. Application is written in python and uses playbin2 with a custom bin set as video-sink which is queue ! deinterlace ! queue ! xvimagesink. Audio-sink is set to a bin containing alsasink only.
Do you have a test program we can reproduce this with?
Created attachment 193132 [details] PyGst test program
Hi, Thanks for looking at this! I will attach something I just hacked up which is basically the same GStreamer usage wise as the program I initially saw the problem. You will see that the code is pretty simple. It may have bugs since I am quite new to GStreamer and PyGst though. Reproducing the problem may be difficult since I suspect it depends on the content and length of play. We found this when stress testing GStreamer (PyGst) playing DVB-T rebroadcast over Ethernet which is one of our main use cases. I mostly test with BBC 1 and the problem shows after a couple of hours of streaming. Not every time though, there are different failure modes this bug is only one of (655790 is the other one I entered so far). I have also seen it when playing MPEG TS rebroadcast from localhost (UDP unicast to 127.0.0.1), unfortunately program which does rebroadcasting is proprietary. And with regards to Ethernet multicast those are proprietary hardware appliances. I did not manage to reproduce using GStreamer only (dvbbasebin I think is the plugin name) since that crashes typically quite soon. Which reminds me I could raise a bug about that as well. Versions wise, this is with Fedora's 15 GStreamer packages plus rpmfusion.org bad and ugly plugins. Former recompiled to include dvbsuboverlay for DVB-T subtitles support. Regards, Tvrtko
Additional comment - it is possible that having a less than perfect DVB-T reception helps in triggering this. But I am not at all sure about this.
Created attachment 193419 [details] GST_DEBUG=3 log of playback freezing for 14 seconds.
I have managed to capture a 15Mb transport stream which exhibits similar behaviour. I say similar because I am not 100% sure it the same - there is one possibly crucial difference in the behaviour. With this stream playback freezes approx. two seconds into the play time and then resumes after approx. 14 seconds. All the time play position is reported as advancing and there are no QOS messages until ~17s into playing. Then playback resumes and a bunch of "old" QOS messages get sent. Difference from the original report is that this time QOS messages are not reporting the number of rendered frames which would suggest frames were actually displayed in those ~14 seconds, but just 75 frames rendered when first message arrives. This stream is also of a quite low quality (poor signal), much worse than with what I was seeing a similar problem previously. However given how players like mplayer and vlc do play this stream much better I suspect there is scope for improvement here. If nothing maybe send those QOS messages earlier? I am attaching a GST_DEBUG=3 log of the above.
Can that possible-repro sample be shared ?
Created attachment 195204 [details] [review] mpegdemux: catch smaller PCR dicontinuities In a test stream, I get one buffer with a PCR of about 15 seconds in the future compared to the previous one, and next buffers with timestamps continuing where the original ones left off. This caused the sink to wait 15 seconds to display the frame while more frames queued up, and then dump all the subsequent frames as they "arrived too late". Maybe that threshold should be made configurable, but for now, make it more smaller to catch more of these.
That fixes the issue for the test stream in question. Now, the threshold is 10 seconds, but if some stream jumps by 9 seconds and back, there will still be a 9 seconds pause. A way to fix that could be to, when PCR is above something like a half second or so, keep the frame in a temp buffer and not send it, and wait for the next frame to see if it goes back in time or not. If it does, retimestamp the buffered frame and send both. Seems hacky though. I'm nowhere near enough knowledgeable about MPEG to know if such timestamp jumps are somehow OK, or indicative of a broken stream.
Where have you made in smaller in patch from comment #8? I only see you pulled the constant out, unless I am missing something. #define DISCONT_THRESHOLD (GST_SECOND * 10 * 60) Is that ten minutes?
In theory, at least, we can assume that PCR won't change by more than 100 milliseconds between PCR packets. Reasoning from the spec (references to ISO/IEC 13818-1:2007): Section 2.4.2.1 (System clock frequency) tells us that the PCR is derived from a stable 27MHz clock. Section 2.4.2.2 says maximum encoding error of the PCR in the stream is 500 nanoseconds as compared to the "real" PCR value. Section 2.4.3.5 defines a discontinuity_indicator in the adaptation field to tell us that the PCR is in the process of jumping. When it is set, packets before the packet with discontinuity_indicator set use the old PCR for their timestamps, packets after the one with discontinuity_indicator set use the new PCR. Finally, section 2.7.2 tells us that the PCR must be coded in the stream at least once every 100 milliseconds, or the stream is not a valid transport stream. Putting all these bits together, we know that for any PCR packet, if the discontinuity_indicator is not set (indicating that the timebase is changing, and we need to do future timestamping against the new PCR), any jump of more than 100 milliseconds is indicative of a broken stream, and we should ignore this PCR packet and continue to free-run until we pick up a valid PCR again - we should probably increase the permitted timestamp jump by 100 milliseconds for each faulty PCR packet we've seen, so we gracefully degrade and recover if several PCRs in a row are faulty followed by good PCRs again. There's also the transport_error_indicator in transport packets to hint that this packet is corrupt; our DVB-T cards do appear to set this whenever there are uncorrected errors in the packet. This could also be used as a hint that the PCR is probably wrong if it's a large jump.
Hah, I'd set it to 10 seconds, but mistakenly re-used the original 10 minutes value when cleaning up before posting, you're right. Since the standard mandates 100 ms maximum delay between PCRs, I'll change the code with that in mind. I'm not sure about several PCRs in a row though. If we get two, it would seem that the more likely is an error in the discontinuity indicator instead, than an error in PCR, no ?
Actually, it's not the PCR that's jumping, it's the PTS - which I understand to be "presentation timestamp". I'm not sure what is the relation between PTS and PCR.
(In reply to comment #13) > Actually, it's not the PCR that's jumping, it's the PTS - which I understand to > be "presentation timestamp". I'm not sure what is the relation between PTS and > PCR. In a hardware decoder, PCR is used to synchronise your local 27MHz clock to the broadcaster's 27MHz clock. PTS is the presentation timestamp, measured against that 27MHz clock. Packets are removed from transport buffering at their decode time, which is DTS if present, PTS if no DTS present (see section 2.4.3.7's description of PTS and DTS). I would not personally expect decode times to go backwards, and certainly not by a large amount - this could be used to detect a probable discontinuity.
Created attachment 195294 [details] [review] mpegdemux: catch smaller PTS dicontinuities In a test stream, I get one buffer with a PTS of about 15 seconds in the future compared to the previous one, and next buffers with timestamps continuing where the original ones left off. This caused the sink to wait 15 seconds to display the frame while more frames queued up, and then dump all the subsequent frames as they "arrived too late". Maybe that threshold should be made configurable, but for now, make it more smaller to catch more of these.
I can confirm the patch works around the issue in the test stream. However since the originally reported render pause was longer than ten minutes, I will leave streaming on for a day or so on one box to see how it holds long term.
Unfortunately this did not fix the original problem. I left the test running overnight streaming BBC ONE from DVB-T and saw several instances of the bug - the one ongoing as we speak is longer than observed before. First I had a couple of shorter instances, exactly like in the initial report. 2011-09-01 03:50:21.786 DEBUG QOS[xvimagesink0] @ 27934.877578s rendered/dropped=1394780/123; jitter=87ms 2011-09-01 03:50:28.874 DEBUG playPosition 27940535676389L 2011-09-01 03:50:38.874 DEBUG playPosition 27948599676389L This goes on for some time, during which I am pretty sure nothing was rendered because it ends with: 2011-09-01 03:55:48.875 DEBUG playPosition 28196567676389L 2011-09-01 03:55:58.874 DEBUG playPosition 28204535676389L 2011-09-01 03:56:05.462 DEBUG QOS[xvimagesink0] @ 27936.537578s rendered/dropped=1394843/124; jitter=342103ms A QOS message reporting jitter of roughly the period since last QOS message. Second instance happened early morning, at 5:35 - which is 100% because I have a clock showing that time on a frozen frame on screen right now. And again GStreamer is reporting that play position is advancing and there are no QOS messages. Log looks like this: 2011-09-01 05:35:01.821 DEBUG QOS[xvimagesink0] @ 5935.305789s rendered/dropped=295874/44; jitter=84ms 2011-09-01 05:35:08.874 DEBUG playPosition 5942442919718L 2011-09-01 05:35:18.874 DEBUG playPosition 5952443305672L 2011-09-01 05:35:28.874 DEBUG playPosition 5962443057111L ... 2011-09-01 08:53:28.874 DEBUG playPosition 16541955894445L 2011-09-01 08:53:38.874 DEBUG playPosition 16550019894445L 2011-09-01 08:53:48.874 DEBUG playPosition 16557987894445L
Could you get check if the basesink (debug tag "basesink") emits the message "possibly waiting for clock to reach" with a timestamp that's suddenly higher than what it should be, and more or less the delay which video wedges for ? Run with GST_DEBUG=basesink:4. Better even if to run with GST_DEBUG=5 and save the log for later, I might ask you to grep in it and send me excerpts (the full log for a day will be *huge*, so redirect to a partition that's got space).
Since it usually takes half day or more for this to happen do you think one or both debug levels would fit in a 1.5GB? (Which is how much I have available on the box).
basesink:4 probably would. For the other, not a snowball's chance. I guess you can run with basesink:4, and look at the log size after 15 minutes, then interpolate, so you'll know in advance if you risk running out. You can also do something like this (untested): export GST_DEBUG=basesink:4 run_player 2>&1 | bzip -c9 > log.bz2 GStreamer logs compress well.
Created attachment 195372 [details] GST_DEBUG=basesink:4 log of stuck video but sound playing I got something new which may be related just as I started capturing debug. This time video got stuck but audio is playing fine. It happened shortly after starting playback and it has been in this state for more than 15 minutes now. Does it make sense to keep capturing this log in case it unwedges itself or try again with GST_DEBUG=5 in hope something unusual happens again?
This is odd, both sinks seem to have gone to PAUSED 20 seconds into playback, then back to PLAYING after 12 seconds. But yes, the logs do show the image sink waits for a large timestamp (more than a minute into the the future) at some point, so that would pause video for that amount of time at least. Presumably lots of frames dropped after that as, when timestamps go back down, they'd all be considered to be too late. That "high timestamp" goes on for a couple dozen frames, so it's not a single corrupt timestamp. What's even more worrying is that, after these couple dozen frames, timestamps go back to *before* the original timestamp series. So the reason for the video pause seems to be the same (large timestamp causing the sink to wait for that timestamp to happen before showing the frame), but caused by a different issue. Maybe a bug in timestamp calculation, or some offset that we don't take into account. (mostly typing that out for my own memory once I get back to investigating this) Is the issue repeatable, if you saved that stream ? If so, could I have it ?
(In reply to comment #22) > This is odd, both sinks seem to have gone to PAUSED 20 seconds into playback, > then back to PLAYING after 12 seconds. Ah ignore this please, this is our application restarting playback because of too many dropped frames. > But yes, the logs do show the image sink waits for a large timestamp (more than > a minute into the the future) at some point, so that would pause video for that > amount of time at least. Presumably lots of frames dropped after that as, when > timestamps go back down, they'd all be considered to be too late. > > That "high timestamp" goes on for a couple dozen frames, so it's not a single > corrupt timestamp. What's even more worrying is that, after these couple dozen > frames, timestamps go back to *before* the original timestamp series. > > So the reason for the video pause seems to be the same (large timestamp causing > the sink to wait for that timestamp to happen before showing the frame), but > caused by a different issue. Maybe a bug in timestamp calculation, or some > offset that we don't take into account. > > (mostly typing that out for my own memory once I get back to investigating > this) > > Is the issue repeatable, if you saved that stream ? If so, could I have it ? Unfortunately I don't have this stream because I was totally not expecting it to replicate so quickly. Going forward I will start capturing so there is hope in taking out a slice which causes this. Test player has gotten unstuck in the meantime, ~50 minutes after it got stuck and indeed an appropriate jitter was reported in the QOS message. Compressed log is 10Mb if you are interested I can upload it to our server.
As it only contains the basesink information, it's not so useful, I just wanted to know if the sink was going to wait on a large timestamp, which it did, without generating huge amounts of logs. If you can get a smallish captured video that reproduces it, I'll be able to generate my own fuller logs. Thanks
I left the test running capturing gst debug log and also the stream but unfortunately due to a power cut last night I lost my test logs - meaning I don't know if the problem triggered. Is there a way to see if it happened from the GST_DEBUG log (which is 6.9GB in size)?
Yes: grep "basesink.*xv.*ossibly.waiting" log | less -SR (You might not have "xv", replace with a unique string of your video sink, it is here to discard references to the audio base sink). Then look at the timings, and check if there's a sudden large jump. If you see one, look at the full log at that point, and check if the next "clock returned" for that sink happens a long time later. If so, you can try to cut a portion of the video around that time and hope it's reproducible.
Created attachment 195470 [details] Another log file showing the problem Please can you look at this look file - I think this could be it and I have a sample stream, but with a small issue that the hang didn't repro although time stamps from this log and from the stream are the same. There is a glitch in the stream just about on the mark where it hung over night. I will upload the stream so you can have a look.
Created attachment 195569 [details] [review] mpegdemux: fix some more video freezing There was a second threshold, which apparently needs to be smaller than the first, though I'm not certain of it as I don't understand yet this nest of wtf that is the mpeg demuxer timing logic. Fixes video freezing on one (corrupted) MPEG sample. It would previously never think it was out of the discontinuity, and would push buffers with no timestamp. Now this took me more than a day's poking at the thing, for just one constant change, and I'm scared to have to touch this again :S
Created attachment 197168 [details] [review] mpegdemux: catch smaller PTS dicontinuities In a test stream, I get one buffer with a PTS of about 15 seconds in the future compared to the previous one, and next buffers with timestamps continuing where the original ones left off. This caused the sink to wait 15 seconds to display the frame while more frames queued up, and then dump all the subsequent frames as they "arrived too late". Maybe that threshold should be made configurable, but for now, make it more smaller to catch more of these. https://bugzilla.gnome.org/show_bug.cgi?id=655804 Non AV streams keep using the larger threshold (10 minutes), as subtitles may arrive only every so often.
This update ensures we don't spuriously detect a discontinuity on subtitles streams after a gap of 2 seconds. Fixes DVB subtitles on one of the sample streams.
Looks good in testing, thank you!
commit 918d2e710fc12ba9f489d8344549deb511f85ac9 Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk> Date: Sat Sep 3 14:37:40 2011 +0100 mpegdemux: fix some more video freezing There was a second threshold, which apparently needs to be smaller than the first, though I'm not certain of it as I don't understand yet this nest of wtf that is the mpeg demuxer timing logic. Fixes video freezing on one (corrupted) MPEG sample. It would previously never think it was out of the discontinuity, and would push buffers with no timestamp. Now this took me more than a day's poking at the thing, for just one constant change, and I'm scared to have to touch this again :S https://bugzilla.gnome.org/show_bug.cgi?id=655804 commit 7960280afbb2fdc436d8e1fcf89a414b37f6e135 Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk> Date: Tue Aug 30 14:50:52 2011 +0100 mpegdemux: catch smaller PTS dicontinuities In a test stream, I get one buffer with a PTS of about 15 seconds in the future compared to the previous one, and next buffers with timestamps continuing where the original ones left off. This caused the sink to wait 15 seconds to display the frame while more frames queued up, and then dump all the subsequent frames as they "arrived too late". Maybe that threshold should be made configurable, but for now, make it more smaller to catch more of these. https://bugzilla.gnome.org/show_bug.cgi?id=655804 Non AV streams keep using the larger threshold (10 minutes), as subtitles may arrive only every so often.