Bug 772646 – rtpjitterbuffer: fix lost-event using dts instead of pts

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 772646 - rtpjitterbuffer: fix lost-event using dts instead of pts


Summary:	rtpjitterbuffer: fix lost-event using dts instead of pts


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-good
Version:	git master
Hardware:	Other All

Importance:	Normal critical
Target Milestone:	1.10.3
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-10-09 14:06 UTC by Håvard Graff (hgr)
Modified:	2016-12-05 09:18 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Tests and fix (45.33 KB, patch) 2016-10-09 14:06 UTC, Håvard Graff (hgr)	none	Details \| Review
updated test and fix (45.34 KB, patch) 2016-10-09 14:39 UTC, Håvard Graff (hgr)	committed	Details \| Review
refactor the gap-handling (13.65 KB, patch) 2016-11-02 19:46 UTC, Håvard Graff (hgr)	none	Details \| Review
updated patch (13.64 KB, patch) 2016-11-03 16:34 UTC, Håvard Graff (hgr)	committed	Details \| Review

Description Håvard Graff (hgr) 2016-10-09 14:06:02 UTC

Created attachment 337270 [details] [review]
Tests and fix

The lost-event was using a different time-domain (dts) than the outgoing
buffers (pts). Given certain network-conditions these two would become
sufficiently different and the lost-event contained timestamp/duration
that was really wrong. As an example GstAudioDecoder could produce
a stream that jumps back and forth in time after receiving a lost-event.

The previous behavior calculated the pts (based on the rtptime) inside the
rtp_jitter_buffer_insert function, but now this functionality has been
refactored into a new function rtp_jitter_buffer_calculate_pts that is
called much earlier in the _chain function to make pts available to
various calculations that wrongly used dts previously
(like the lost-event).

There are however two calculations where using dts is the right thing to
do: calculating the receive-jitter and the rtx-round-trip-time, where the
arrival time of the buffer from the network is the right metric
(and is what dts in fact is today).

The patch also adds two tests regarding B-frames or the
“rtptime-going-backwards”-scenario, as there were some concerns that this
patch might break this behavior (which the tests shows it does not).

Comment 1 Håvard Graff (hgr) 2016-10-09 14:39:00 UTC

Created attachment 337275 [details] [review]
updated test and fix

We found that pts was calculated a bit too early in the last patch, wait with the calculation until it is actually needed.

Comment 2 Sebastian Dröge (slomo) 2016-10-20 12:00:11 UTC

Review of attachment 337275 [details] [review]:

Just some comments for now

::: gst/rtpmanager/gstrtpjitterbuffer.c
@@ +2342,2 @@
     /* calculate expected arrival time of the next seqnum */
+    expected = pts + priv->packet_spacing;

Isn't the arrival time dependant on the dts (aka arrival time at this point)? While the packet spacing would depend on the pts I guess

@@ +2408,3 @@
 static void
 calculate_expected (GstRtpJitterBuffer * jitterbuffer, guint32 expected,
+    guint16 seqnum, GstClockTime pts, gint gap)

Same here, shouldn't this consider the actual arrival times of previous packets (dts) too?

@@ +2832,3 @@
 
+  /* calculate a pts based on rtptime and arrival time (dts) */
+  pts = rtp_jitter_buffer_calculate_pts (priv->jbuf, dts, rtptime,

Calculating the PTS changes jbuf's state... is this going to cause problems if we later drop this specific packet? Previously we would only change the state once the packet is definitely inserted.

@@ +2963,3 @@
      * possible moment to push this buffer, maybe we get an earlier seqnum
      * while we wait */
+    set_timer (jitterbuffer, TIMER_TYPE_DEADLINE, seqnum, pts);

And timers also seem like something that is related to dts (arrival times)

Comment 3 Håvard Graff (hgr) 2016-10-20 18:26:41 UTC

(In reply to Sebastian Dröge (slomo) from comment #2)
> Review of attachment 337275 [details] [review] [review]:
> 
> Just some comments for now
> 
> ::: gst/rtpmanager/gstrtpjitterbuffer.c
> @@ +2342,2 @@
>      /* calculate expected arrival time of the next seqnum */
> +    expected = pts + priv->packet_spacing;
> 
> Isn't the arrival time dependant on the dts (aka arrival time at this
> point)? While the packet spacing would depend on the pts I guess

So this is a very good question. I was sure using dts was right for a very long time, but have now changed my mind and believe pts is the right answer. Let me try to explain why:

Expected (rtx) timers are all about sending a rtx-request for a packet that has not arrived (yet). If you imagine a "network" (upstream) and a "decoder" (downstream) side of the jitterbuffer (JB), the idea is that the network is chaotic, and the job of the JB is to create order for the decoder. Now, complete chaos in when we are receiving packets should not really affect when the decoder needs those packets. Extreme jitter and reordering of packets on the network side should not change the timing of when those buffers needs to be presented to the decoder. Hence, the expected time of those packets should be driven by their need for being presented in time, and not by when the happen to arrive from the wire.


> 
> @@ +2832,3 @@
>  
> +  /* calculate a pts based on rtptime and arrival time (dts) */
> +  pts = rtp_jitter_buffer_calculate_pts (priv->jbuf, dts, rtptime,
> 
> Calculating the PTS changes jbuf's state... is this going to cause problems
> if we later drop this specific packet? Previously we would only change the
> state once the packet is definitely inserted.

Very good question. I was worried about this too, but have not found any cases (practical or theoretical) where it is a problem. All existing tests are passing, but I am very happy to help write up a new test-case to prove me wrong though.

Comment 4 Sebastian Dröge (slomo) 2016-10-21 12:10:19 UTC

(In reply to Håvard Graff (hgr) from comment #3)

> > @@ +2832,3 @@
> >  
> > +  /* calculate a pts based on rtptime and arrival time (dts) */
> > +  pts = rtp_jitter_buffer_calculate_pts (priv->jbuf, dts, rtptime,
> > 
> > Calculating the PTS changes jbuf's state... is this going to cause problems
> > if we later drop this specific packet? Previously we would only change the
> > state once the packet is definitely inserted.
> 
> Very good question. I was worried about this too, but have not found any
> cases (practical or theoretical) where it is a problem. All existing tests
> are passing, but I am very happy to help write up a new test-case to prove
> me wrong though.

It will break for all cases where you get a packet with a completely different seqnum (and timestamp!) but ignore it as there are not enough consecutive seqnums following (see the gap code below your PTS calculation, the cases where it does not go into reset). So for example some spurious packet arriving here.

Comment 5 Håvard Graff (hgr) 2016-10-21 21:53:51 UTC

(In reply to Sebastian Dröge (slomo) from comment #4)

> It will break for all cases where you get a packet with a completely
> different seqnum (and timestamp!) but ignore it as there are not enough
> consecutive seqnums following (see the gap code below your PTS calculation,
> the cases where it does not go into reset). So for example some spurious
> packet arriving here.

Sorry, I am not quite following. Why don't we write some tests to check old vs. new behavior instead of theorizing around it?

If you can describe the behavior you want to make sure are not broken with the new patch, I am happy to try and write it up. Steady flow of buffers interrupted with a buffer with completely unrelated seqnum and timestamp, and then going back to the normal stream? What is the expected behavior? Why would we encounter such a buffer? Given that this is a stream of the same SSRC, we would either likely be facing a faulty rtp-implementation (those exist!) or a malicious attack (those exist too!). I can think of three scenarios:

1. The single bad packet: 1, 2, 3, 4, 60000, 5, 6, 7, 8, etc.
2. Restarting the stream: 1, 2, 3, 4, 60000, 60001, 60002, 60003, etc.
3. Interleaved madness: 1, 2, 3, 4, 60000, 5, 60001, 6, 60002, 7 etc.

Maybe we should first decide what is the desired behavior in each of those cases, then check how that matches reality for the old and the new implementation? Sounds good?

Comment 6 Sebastian Dröge (slomo) 2016-10-22 07:23:13 UTC

(In reply to Håvard Graff (hgr) from comment #5)

> 1. The single bad packet: 1, 2, 3, 4, 60000, 5, 6, 7, 8, etc.

That packet is queued up, but as the following ones are not a continuation it is dropped again. (we wait for iirc 5 consecutive packets)

> 2. Restarting the stream: 1, 2, 3, 4, 60000, 60001, 60002, 60003, etc.

We queue up the new packets, and after 5 consecutive ones they are all output after resetting the jitterbuffer

> 3. Interleaved madness: 1, 2, 3, 4, 60000, 5, 60001, 6, 60002, 7 etc.

Chaos :) We would output the 1, 2, 3... stream and queue up the 6000x packets just to drop them again on the next packet. What comes first wins until there are 5 consecutive packets of something new

If this is the desired behaviour, I don't know. It seems sensible though and that was is currently implemented.


The problem here now is that with your patch, case 1) and case 3) are updating jitterbuffer state for the 6000x packets although they are thrown away. As the jitterbuffer stores the ext timestamp, recvtimes, sendtimes, this seems undesirable.

Comment 7 Håvard Graff (hgr) 2016-10-22 08:13:16 UTC

Ok, so how about this as a plan:

1. I will write up a test for case 1. and try to craft it so that it will fail with the suggested patch.

2. I will move around the gap-handling code a bit (maybe refactor and split into functions where possible) to do the "reset"-cases prior to calculating pts, which should give the exact same behavior as before.

Sounds good?

Comment 8 Sebastian Dröge (slomo) 2016-10-22 08:52:08 UTC

Sounds good, yes. Thanks :)

Comment 9 Håvard Graff (hgr) 2016-10-25 20:32:23 UTC

So I have experimented a bit over the last few days, and I can't make it fail...

Here is the test I ended up writing:

GST_START_TEST (test_single_bad_packet_in_stream)
{
  GstHarness *h = gst_harness_new ("rtpjitterbuffer");
  GstClockTime now = 0;
  const gint num_consecutive = 5;
  gint i;

  g_object_set (h->element, "latency", 80, NULL);

  gst_harness_set_src_caps (h, generate_caps ());

  for (i = 0; i < num_consecutive; i++) {
    gst_harness_set_time (h, now);
    fail_unless_equals_int (GST_FLOW_OK, gst_harness_push (h,
        generate_test_buffer_full (now, TRUE, i, i * PCMU_RTP_TS_DURATION)));
    now += PCMU_BUF_DURATION;
  }

  fail_unless (gst_harness_crank_single_clock_wait (h));

  for (i = 0; i < num_consecutive; i++) {
    GstBuffer *buf = gst_harness_pull (h);
    fail_unless_equals_int (i, get_rtp_seq_num (buf));
    fail_unless_equals_uint64 (i * PCMU_BUF_DURATION, GST_BUFFER_PTS (buf));
    gst_buffer_unref (buf);
  }

  /* push one packet with a completely wrong sequence number and rtp-timestamp */
  fail_unless_equals_int (GST_FLOW_OK, gst_harness_push (h,
      generate_test_buffer_full (now, TRUE, 1000, 22222 * PCMU_RTP_TS_DURATION)));


  /* and then continue the same flow of buffers */
  for (i = num_consecutive; i < num_consecutive * 2; i++) {
    gst_harness_set_time (h, now);
    fail_unless_equals_int (GST_FLOW_OK, gst_harness_push (h,
        generate_test_buffer_full (now, TRUE, i, i * PCMU_RTP_TS_DURATION)));
    now += PCMU_BUF_DURATION;
  }

  for (i = num_consecutive; i < num_consecutive * 2; i++) {
    GstBuffer *buf = gst_harness_pull (h);
    fail_unless_equals_int (i, get_rtp_seq_num (buf));
    fail_unless_equals_uint64 (i * PCMU_BUF_DURATION, GST_BUFFER_PTS (buf));
    gst_buffer_unref (buf);
  }

  gst_harness_teardown (h);
}

GST_END_TEST;


As you can see, I verify that pts is correct on all outgoing buffers, and for the one "bad" packet in the middle there, I am able to trigger a few different code-paths (some of which looks like bugs) by changing the seqnum and rtp-timestamp to various things, but I am unable to make the stream fail. Maybe you could try and experiment a bit with that test to see if you can get a failing scenario I could work with?

Comment 10 Sebastian Dröge (slomo) 2016-10-26 07:00:09 UTC

I don't have the time for that at least in the next two weeks, and lots of other bugs that are already piling up on my list, sorry.


But even if you can't actually cause problems here, you surely agree that changing the jitterbuffer state based on such broken packets that we also drop afterwards is less than ideal. It's going to bite us sooner or later and then we'll have a difficult to debug problem somewhere.
But maybe that you can't make it fail now is an indication that the existing jitterbuffer code is broken. Does your wrong packet go into the skew calculation code, does it cause updating of the times in the jitterbuffer, and are those times used then for the following packets? It might not have any effect here because we only calculate the skew from the packets with the lowest jitter (the window of packets there), and your packet is always at the very top?

Comment 11 Håvard Graff (hgr) 2016-11-02 19:46:03 UTC

Created attachment 338991 [details] [review]
refactor the gap-handling

I have tried to make the gap-handling code a bit clearer and added the pts-calculation in two places where it is now actually needed.

Comment 12 Håvard Graff (hgr) 2016-11-03 16:34:01 UTC

Created attachment 339053 [details] [review]
updated patch

Found a small difference in behavior, where the lost-timer > max_dropout would trigger also without gap > 0, so changed that to be identical with old behavior.

Comment 13 Sebastian Dröge (slomo) 2016-11-04 14:51:55 UTC

commit fb9c75db360ff1aa61710804b08788b6c35f44df
Author: Havard Graff <havard.graff@gmail.com>
Date:   Sun Oct 9 15:59:05 2016 +0200

    rtpjitterbuffer: fix lost-event using dts instead of pts
    
    The lost-event was using a different time-domain (dts) than the outgoing
    buffers (pts). Given certain network-conditions these two would become
    sufficiently different and the lost-event contained timestamp/duration
    that was really wrong. As an example GstAudioDecoder could produce
    a stream that jumps back and forth in time after receiving a lost-event.
    
    The previous behavior calculated the pts (based on the rtptime) inside the
    rtp_jitter_buffer_insert function, but now this functionality has been
    refactored into a new function rtp_jitter_buffer_calculate_pts that is
    called much earlier in the _chain function to make pts available to
    various calculations that wrongly used dts previously
    (like the lost-event).
    
    There are however two calculations where using dts is the right thing to
    do: calculating the receive-jitter and the rtx-round-trip-time, where the
    arrival time of the buffer from the network is the right metric
    (and is what dts in fact is today).
    
    The patch also adds two tests regarding B-frames or the
    “rtptime-going-backwards”-scenario, as there were some concerns that this
    patch might break this behavior (which the tests shows it does not).