Bug 770934 – rtsp: Make RTSP TIMEOUT in RTP/RTSP/TCP transport deterministic

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 770934 - rtsp: Make RTSP TIMEOUT in RTP/RTSP/TCP transport deterministic


Summary:	rtsp: Make RTSP TIMEOUT in RTP/RTSP/TCP transport deterministic


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-09-06 09:22 UTC by Dag Gullberg
Modified:	2018-11-03 11:49 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fix BACKLOG delayed TIMEOUT trigger (1012 bytes, patch) 2016-09-06 11:51 UTC, Dag Gullberg	needs-work	Details \| Review
Updated patch including check test modification (1.82 KB, patch) 2016-09-23 09:35 UTC, Dag Gullberg	none	Details \| Review
Update (2.77 KB, patch) 2016-09-30 14:18 UTC, Dag Gullberg	none	Details \| Review
New solution plugins_base (7.01 KB, patch) 2017-01-27 09:45 UTC, Dag Gullberg	none	Details \| Review
New solution rtsp-server (7.55 KB, patch) 2017-01-27 09:51 UTC, Dag Gullberg	none	Details \| Review
New solution plugins base (8.76 KB, patch) 2017-02-03 12:13 UTC, Dag Gullberg	none	Details \| Review
New solution rtsp server (10.27 KB, patch) 2017-02-03 12:14 UTC, Dag Gullberg	reviewed	Details \| Review
New solution plugins base - update 1 (4.72 KB, patch) 2017-02-27 10:44 UTC, Dag Gullberg	reviewed	Details \| Review

Description Dag Gullberg 2016-09-06 09:22:34 UTC

When streaming over RTP/RTSP/TCP, and loosing network connection, there is a RTSP TIMEOUT triggered to run.

When there is congestion or otherwise some problem with the network, a watch BACKLOG is used to store messages intended for send. In the case of a disconnect, the queue will fill up, but it take a few moments (queue size set to e.g. 100). 

The issue is that during the use of the BACKLOG queue, the keep-alive is still handled, although there are no connection. This means that the RTSP TIMEOUT (60+5s) is touched and reset at the keep-alive and not allowed to run until after up to several seconds have passed, and the queue is full. 

It would be natural that the RTSP TIMEOUT is allowed to run at the first possible moment there is an indication of LOSS, like the RESOURCE TEMPORARILY UNAVAILABLE indication from the non-blocking write socket when it receives EWOULDBLOCK in the function write_bytes().

The impact is that functionality depending on deterministic values of the time delay between the last frame of the lost stream and the timeout expiration, will have a hard time bridging that gap. (e.g. fail over file save locally)

This manifest itself only in TCP not in UDP.

A patch has been prepared that addresses this issue.

Comment 1 Dag Gullberg 2016-09-06 11:51:18 UTC

Created attachment 334897 [details] [review]
Fix BACKLOG delayed TIMEOUT trigger

The fix is simple, as to allow for a default error indication to pass to upper layers, where the send_func is evaluated for its response to trigger keep-alive and ultimately reset the timeout when it should not. Do not force OK if it is not. The BACKLOG activity will still be ensured.

Comment 2 Sebastian Dröge (slomo) 2016-09-07 08:04:13 UTC

Comment on attachment 334897 [details] [review]
Fix BACKLOG delayed TIMEOUT trigger

Please include the whole explanation what this fixes and how in the commit message.


But basically the only thing this does is to default to ERROR (so nothing happens => ERROR), not override it with OK and propagate upwards whatever writing data has returned. That seems reasonable but I'm not sure why that fixes this specific problem.

Comment 3 Dag Gullberg 2016-09-07 09:00:04 UTC

Thanks for quick feedback.

gst_rtsp_stream_transport_send_rtp() in rtsp-stream-transport.c have a construct like:

  if (priv->send_rtp)
    res =
        priv->send_rtp (buffer, priv->transport->interleaved.min,
        priv->user_data);

  if (res)
    gst_rtsp_stream_transport_keep_alive (trans);


The gst_rtsp_stream_transport_keep_alive function touches the timeout, resetting it. This is what we would like to avoid if we don't actually have a connection, momentary or not.

The send_rtp CB is set in rtsp-client handle_setup_request()

  if (ct->lower_transport == GST_RTSP_LOWER_TRANS_TCP) {
    /* our callbacks to send data on this TCP connection */
    gst_rtsp_stream_transport_set_callbacks (trans,
        (GstRTSPSendFunc) do_send_data,
        (GstRTSPSendFunc) do_send_data, client, NULL);

When streaming the following call chain is used:

rtsp-stream-transport::gst_rtsp_stream_transport_send_rtp()
	rtsp-client<cb>::do_send_data()
		rtsp-client<cb>::do_send_message()
			gstrtspconnection::gst_rtsp_watch_send_message()
				gstrtspconnection::gst_rtsp_watch_write_data()
					gstrtspconnection::write_bytes()

It is in the rtsp-client.c function do_send_data() where the boolean return value used to do keep-alive or not is set by

  return res == GST_RTSP_OK;

If gst_rtsp_watch_write_data() returns GST_RTSP_ERROR or GST_RSTP_EINTR, the only effect, as far as have been understood, is that the do_keep_alive function in gst_rtsp_stream_transport_send_rtp() will not be executed, which is what we want. It only touches the rtsp session timeout.

And we want it so, because the BACKLOG constitutes a UNKNOWN delay in the timeout. This unknown makes it hard to secure e.g. the failover use case, saving a file locally.We can guess it, but it is not optimal. If we go on preset, deterministic timeout values, we get a stream GAP. This is seen only in TCP, not UDP.

It is reasonable that the RTSP timeout starts at the first possible moment where it is seen that the connection may be down. 

Cheers.

Comment 4 Dag Gullberg 2016-09-15 07:48:40 UTC

Weekly PING. /Cheers

Comment 5 Dag Gullberg 2016-09-22 07:44:31 UTC

Weekly PING. /Cheers

Comment 6 Dag Gullberg 2016-09-23 09:35:08 UTC

Created attachment 336147 [details] [review]
Updated patch including check test modification

Same as before, but also fixing the broken unit test.

Comment 7 Dag Gullberg 2016-09-29 12:14:29 UTC

Hi. Weekly ping. I know u guys are busy with release work, but it would be peachy if u could find time to review a couple of my patches. /Cheers

Comment 8 Nicolas Dufresne (ndufresne) 2016-09-29 14:15:23 UTC

Review of attachment 336147 [details] [review]:

The commit message should be improved, it's neither clear or precise enough.

::: gst-libs/gst/rtsp/gstrtspconnection.c
@@ +3780,3 @@
     guint size, guint * id)
 {
+  GstRTSPResult res = GST_RTSP_ERROR;

How is this related to timeout ?

Comment 9 Dag Gullberg 2016-09-30 14:18:24 UTC

Created attachment 336676 [details] [review]
Update

Added comment in the patch

Comment 10 Dag Gullberg 2016-09-30 14:29:38 UTC

(In reply to Dag Gullberg from comment #9)
> Created attachment 336676 [details] [review] [review]
> Update
> 
> Added comment in the patch

(In reply to Nicolas Dufresne (stormer) from comment #8)
> Review of attachment 336147 [details] [review] [review]:
> 
> The commit message should be improved, it's neither clear or precise enough.
> 
> ::: gst-libs/gst/rtsp/gstrtspconnection.c
> @@ +3780,3 @@
>      guint size, guint * id)
>  {
> +  GstRTSPResult res = GST_RTSP_ERROR;
> 
> How is this related to timeout ?

I have now written a detailed commit message that clearly point out the benefit of having the gst_rtsp_watch_write_data() return something else than GST_RTSP_OK when the non-blocking write socket buffer is full. It has through a double layer of CBs the benefit of not resetting the TIMEOUT timer when infact there are problems sending.

I have checked the call chain and there are no other dependencies. Note that this patch then allow for a even more elaborate improvent to be done, checkeing when the TCP stack notice the NW problem. See https://bugzilla.gnome.org/show_bug.cgi?id=770935 . Both these fixes together makes the timeout behave like UDP/RTP using RTCP, where there is no unduly delay in the starting of the timeout timer.

Ultimately this is beneficial to use cases concerned with NW going down, e.g. fail-over local file save. As the other way to remedy this is to increase buffer sizes to cover the delay, a hard to define quantity, it eats from a high priced resource from the system, memory.

Comment 11 Dag Gullberg 2016-10-06 07:58:06 UTC

Just a remainder for review /Cheers

Comment 12 Dag Gullberg 2016-10-17 08:06:49 UTC

Please review

Comment 13 Dag Gullberg 2016-10-26 08:25:16 UTC

Weekly remainder. Please review.

Comment 14 Dag Gullberg 2016-11-03 08:42:20 UTC

Please review

Comment 15 Dag Gullberg 2016-11-18 09:05:35 UTC

Please review this patch

Comment 16 Dag Gullberg 2016-11-29 07:49:51 UTC

Please review this patch

Comment 17 Tim-Philipp Müller 2016-12-06 20:45:54 UTC

I see what you're trying to achieve here, and I think it makes sense.

But I don't really like the way this is achieved here :) We're basically changing the return value of a public functions to return ERROR where it returned OK before, for something that might happen under normal operation. I don't think that's great, and we don't know if there are any other users or not.

I think it would be great if we could come up with another way to do this that doesn't change behaviour like this unexpectedly.

Brainstorming: maybe we could make it an opt-in by setting some flag on the connection, or we can add a function to check if there's pending backlog or not, or similar?

Comment 18 Tim-Philipp Müller 2016-12-07 07:44:27 UTC

Taking into account also bug #770935 I wonder whether some kind of gst_rtsp_watch_{get,check}_status() might make sense here?

For this bug, it might be enough for the caller to take into account the returned id value as well, no? (if return is OK and ID is >0 then there's backlog)

Comment 19 Dag Gullberg 2017-01-27 09:45:37 UTC

Created attachment 344382 [details] [review]
New solution plugins_base

Patch 1(2) The other patch is for gst-rtsp-server

Comment 20 Dag Gullberg 2017-01-27 09:51:32 UTC

Created attachment 344383 [details] [review]
New solution rtsp-server

Comment 21 Dag Gullberg 2017-02-03 12:13:19 UTC

Created attachment 344846 [details] [review]
New solution plugins base

Comment 22 Dag Gullberg 2017-02-03 12:14:23 UTC

Created attachment 344847 [details] [review]
New solution rtsp server

Comment 23 Dag Gullberg 2017-02-03 12:15:55 UTC

Have made the feature configurable. It is now setup as a property of the rtsp-client.

Plz review.

Comment 24 Tim-Philipp Müller 2017-02-23 11:18:50 UTC

Comment on attachment 344846 [details] [review]
New solution plugins base

>--- a/gst-libs/gst/rtsp/gstrtspconnection.h
>+++ b/gst-libs/gst/rtsp/gstrtspconnection.h
>@@ -192,6 +195,10 @@ typedef struct {
>                                              GstRTSPMessage *request,
>                                              GstRTSPMessage *response,
>                                              gpointer user_data);
>+  void              (*connloss_status)  (GstRTSPWatch *watch,
>+                                         GstRTSPResult result,
>+                                         GSocket *write_socket,
>+                                         gpointer user_data);
> 
>   /*< private >*/
>   gpointer _gst_reserved[GST_PADDING-1];

No opinion on the new vfunc or the rest of the patch yet, but you need to change the padding here to GST_PADDING-2, that's what it's for :)

>diff --git a/tests/check/libs/struct_arm.h b/tests/check/libs/struct_arm.h
>index d46b37b..b7b3e6e 100644
>--- a/tests/check/libs/struct_arm.h
>+++ b/tests/check/libs/struct_arm.h
>@@ -56,7 +56,7 @@ GstCheckABIStruct list[] = {
>   {"GstRTSPTimeRange", sizeof (GstRTSPTimeRange), 40},
>   {"GstRTSPTransport", sizeof (GstRTSPTransport), 76},
>   {"GstRTSPUrl", sizeof (GstRTSPUrl), 32},
>-  {"GstRTSPWatchFuncs", sizeof (GstRTSPWatchFuncs), 40},
>+  {"GstRTSPWatchFuncs", sizeof (GstRTSPWatchFuncs), 44},

And then you also don't need to fix up these structure sizes, which are supposed to stay constant to maintain ABI stability.

Comment 25 Dag Gullberg 2017-02-27 10:44:51 UTC

Created attachment 346810 [details] [review]
New solution plugins base - update 1

Changed to not include ABI related structures, instead modifying struct padding

Comment 26 Dag Gullberg 2017-03-20 08:19:52 UTC

Plz review

Comment 27 Dag Gullberg 2017-04-18 11:11:23 UTC

Please review this patch

Comment 28 Tim-Philipp Müller 2017-04-18 11:22:39 UTC

Too late now for 1.12, has to wait until after, sorry :)

Comment 29 Dag Gullberg 2017-05-30 08:43:06 UTC

Review?

Comment 30 Tim-Philipp Müller 2017-09-14 18:21:24 UTC

Comment on attachment 344847 [details] [review]
New solution rtsp server

>From 4ffbd89b7451efb40e6c8b583dd5af6bbc55f636 Mon Sep 17 00:00:00 2001
>From: Dag Gullberg <dagg@axis.com>
>Date: Fri, 3 Feb 2017 13:05:22 +0100
>Subject: [PATCH 1/1] RTSP TIMEOUT in RTP/RTSP/TCP transport not deterministic
>
>In using TCP as transport, the resetting of the RTSP session
>timeout is handled via call to do_keepalive() in
>gst_rtsp_stream_transport_send_rtp(). The do_keep_alive function
>does only reset timestamps for the timeout timer.
>
>When write_bytes() in gstrtspconnection.c ends up in
>RESOURCE_TEMPORARILY_UNAVAILABLE - typically caused by the socket
>buffer being full - while doing its non-blocking writes, it starts
>to fill the BACKLOG queue. At this point it would be good to have
>the TIMEOUT started, and not reset until conditions are OK.
>
>Currently, the TIMEOUT starts when the BACKLOG is FULL, which may
>add seconds to the actual TIMEOUT.
>
>The solution is to use the watch to propagate the true value of
>the write operaion. A connloss flag is defined and propagated via
>callbacks to the transport layer.

Why are BACKLOG and TIMEOUT capitalised here? :)

It sounds like you're saying we want to start the timeout when the backlog queue (in GstRTSPConnection) is starting to fill up, but I was under the impression the whole point of this patchset was to start the timeout when the socket buffer starts building up beyond some normal limit (CA_LOSS), i.e. even earlier ?


 
>+#include <netinet/tcp.h>
>+#include <gio/gnetworking.h>

I think these includes belong into the other patch that does the tcp socket stuff, and first one should probably be guarded by G_OS_UNIX? Is the first needed given the second? Do we need to bump glib req for the second?


>@@ -90,6 +92,8 @@ struct _GstRTSPClientPrivate
>+  gboolean connloss;
>+  gboolean tcp_connloss_detect;

Please add a comment for each boolean describing what they represent, one is the property, the other one is 'TRUE if we have detected a potential loss of the connection' or somesuch.


>+  g_object_class_install_property (gobject_class, PROP_TCP_CONNLOSS_DETECT,
>+      g_param_spec_boolean ("tcp-connloss-detect", "TCP Connection Loss Detect",
>+          "Early detection of connection loss to exactly time the RTSP timeout",
>+          DEFAULT_TCP_CONNLOSS_DETECT,
>+          G_PARAM_READWRITE | G_PARAM_STATIC_STRINGS));

Should we mention that it might cost some performance?

There should be a gtk-doc blurb for the property as well, that copies the description and also includes a 'Since: 1.14' marker (check gstreamer/gst/gstpipeline.c for example).




>@@ -729,6 +740,9 @@ gst_rtsp_client_get_property (GObject * object, guint propid,
>+    case PROP_TCP_CONNLOSS_DETECT:
>+      g_value_set_boolean (value, priv->tcp_connloss_detect);
>+      break;

Why no locking here, but locking in the setter below? Should both take lock then imho.


 
>+static void
>+connloss_status (GstRTSPWatch * watch, GstRTSPResult result,
>+    GSocket * write_socket, gpointer user_data)
>+{
>+  GstRTSPClient *client = GST_RTSP_CLIENT (user_data);
>+  GstRTSPClientPrivate *priv = client->priv;
>+
>+  priv->connloss = (result == GST_RTSP_EINTR);
>+
>+  GST_LOG ("client %p: connloss_status %s", client,
>+      (priv->connloss ? "CONNECTION LOST" : "OK"));
>+}

That's quite a strong message ("connection lost") - is it really accurate here? (Also taking into account the follow-up patch that checks the socket). It might just indicate a potential problem, could be bandwidth too low or a short-lived networking issue or something, no?

> static gboolean
> handle_tunnel (GstRTSPClient * client)
> {
>@@ -4279,7 +4325,8 @@ static GstRTSPWatchFuncs watch_funcs = {
>   tunnel_post,
>   error_full,
>   tunnel_lost,
>-  tunnel_http_response
>+  tunnel_http_response,
>+  connloss_status
> };
> 
> static void
>@@ -4330,6 +4377,7 @@ gst_rtsp_client_attach (GstRTSPClient * client, GMainContext * context)
>   /* create watch for the connection and attach */
>   priv->watch = gst_rtsp_watch_new (priv->connection, &watch_funcs,
>       g_object_ref (client), (GDestroyNotify) client_watch_notify);
>+  gst_rtsp_watch_set_connloss_detect (priv->watch, priv->tcp_connloss_detect);
>   gst_rtsp_client_set_send_func (client, do_send_message, priv->watch,
>       (GDestroyNotify) gst_rtsp_watch_unref);


>--- a/gst/rtsp-server/rtsp-stream-transport.c
>+++ b/gst/rtsp-server/rtsp-stream-transport.c
>@@ -52,6 +52,7 @@ struct _GstRTSPStreamTransportPrivate
> 
>   GstRTSPSendFunc send_rtp;
>   GstRTSPSendFunc send_rtcp;
>+  GstRTSPSendConnLossFunc send_connloss;
>   gpointer user_data;
>   GDestroyNotify notify;

See comment about name below please.



>+/**
>+ * gst_rtsp_stream_transport_set_callbacks:
>+ * @trans: a #GstRTSPStreamTransport
>+ * @send_connloss: a callback called to check lates low level connloss status
>+ *
>+ * Install callbacks that will be called when data for a stream should be sent
>+ * to a client. This is usually used when sending RTP/RTCP over TCP.
>+ */

Wrong name -> gst_rtsp_stream_transport_set_connloss_callback

Typo: -> "the latest"

Add 'Since: 1.14' here please.


>+void
>+gst_rtsp_stream_transport_set_connloss_callback (GstRTSPStreamTransport * trans,
>+    GstRTSPSendConnLossFunc send_connloss)
>+{
>+  g_return_if_fail (GST_IS_RTSP_STREAM_TRANSPORT (trans));
>+  trans->priv->send_connloss = send_connloss;
>+}

Why is this called *send*_conloss? I find this confusing seeing what send_rtp/send_rtcp do. We're not sending anything here, we're checking something or getting some status, no?



>@@ -460,7 +491,7 @@ gst_rtsp_stream_transport_send_rtp (GstRTSPStreamTransport * trans,
>
>-  if (res)
>+  if (res && !gst_rtsp_stream_low_level_connloss (trans))
>     gst_rtsp_stream_transport_keep_alive (trans);
> 
>@@ -489,7 +520,7 @@ gst_rtsp_stream_transport_send_rtcp (GstRTSPStreamTransport * trans,
> 
>-  if (res)
>+  if (res && !gst_rtsp_stream_low_level_connloss (trans))
>     gst_rtsp_stream_transport_keep_alive (trans);

Would be nice to have a short comment here (something like that from the patch in the other bug), e.g. "Reset timeout whenever we managed to successfully send some data, but don't reset it if we have a send backlog building up in the tcp stack, which might indicate a problem."



>--- a/gst/rtsp-server/rtsp-stream-transport.h
>+++ b/gst/rtsp-server/rtsp-stream-transport.h

> /**
>+ * GstRTSPSendConnLoss:

Which one is it - GstRTSPSendConnLossFunc or GstRTSPSendConnLoss ? :)

>+ * @user_data: user data
>+ *
>+ * Function registered with gst_rtsp_stream_transport_set_callbacks() and called
>+ * when data is sent to check low level connection loss.
>+ */
>+typedef gboolean (*GstRTSPSendConnLossFunc) (gpointer user_data);

Please add a 'Since: 1.14' marker to gtk-doc chunk, and document what the boolean return value indicates.


>+void                     gst_rtsp_stream_transport_set_connloss_callback (GstRTSPStreamTransport *trans,
>+                                                                  GstRTSPSendConnLossFunc send_connloss);

Needs to be updated for git master (needs a GST_EXPORT decorator in front now). Please also run make update-exports so it gets added to the .def file.

Comment 31 Tim-Philipp Müller 2017-09-14 18:26:00 UTC

Comment on attachment 346810 [details] [review]
New solution plugins base - update 1

>--- a/gst-libs/gst/rtsp/gstrtspconnection.c
>+++ b/gst-libs/gst/rtsp/gstrtspconnection.c

>+#include <netinet/tcp.h>

Is this include still needed?

 
>+/**
>+ * gst_rtsp_watch_set_connloss_detect:
>+ * @watch: a #GstRTSPWatch
>+ * @use_connloss_detect: use conncloss detection
>+ *
>+ * When @use_connloss_detect is %TRUE, an early connection loss detection
>+ * is used. This is needed to exactly time the RTSP timeout in higher layers.
>+ *
>+ * Since: 1.11
>+ */

Please update to 1.14 :)



>diff --git a/gst-libs/gst/rtsp/gstrtspconnection.h b/gst-libs/gst/rtsp/gstrtspconnection.h
>index 43479ae..b0e116c 100644
>--- a/gst-libs/gst/rtsp/gstrtspconnection.h
>+++ b/gst-libs/gst/rtsp/gstrtspconnection.h
>@@ -171,6 +171,9 @@ typedef struct _GstRTSPWatch GstRTSPWatch;
>  * @tunnel_http_response: callback when an HTTP response to the GET request
>  *   is about to be sent for a tunneled connection. The response can be
>  *   modified in the callback. Since 1.4.
>+ * @connloss_status: Callback when sending rtp/rtcp over TCP to send back the
>+ *   result of the socket send operation, indication actual loss of connection,
>+ *   temporary or otherwise.

Perhaps 'send back the result of the socket send operation' -> 'check the low-level state of the socket to detect...' or have I got this wrong conceptually?

Also please add a Since marker: "...or otherwise. (Since: 1.14)"



>+
>+void               gst_rtsp_watch_set_connloss_detect (GstRTSPWatch * watch,
>+                                                      gboolean use_connloss_detect);

I'm afraid this needs to be updated for git master, needs to have a GST_EXPORT decorator, and should also be in the .def (make upate-exports) and in the docs (docs/libs/...-sections.txt).

Comment 32 Joakim Johansson 2018-02-12 12:53:03 UTC

I think that this ticket and the related ticket 770935 shall be closed.

Reason:
The changes in the patch changed the default return code for gst_rtsp_watch_write_data to ERROR. The result of this change is that there can be unwanted (faulty) session timeouts when the client (or network) is slow in reading.

The change to return ERROR when an item is added to the backlog queue is causing the timeout variables only to be reset when there are no items in the backlog queue.

Comment 33 GStreamer system administrator 2018-11-03 11:49:25 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/288.