Bug 770935 – RTP/RTSP/TCP transport RTSP TIMEOUT should start at once when tcpi_ca_state is TCP_CA_Loss

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 770935 - RTP/RTSP/TCP transport RTSP TIMEOUT should start at once when tcpi_ca_state is TCP_CA_Loss


Summary:	RTP/RTSP/TCP transport RTSP TIMEOUT should start at once when tcpi_ca_state i...


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	git master
Assigned To:	Tim-Philipp Müller
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-09-06 09:38 UTC by Dag Gullberg
Modified:	2018-11-03 11:49 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
RTP/RTSP/TCP transport RTSP TIMEOUT should start at once when tcpi_ca_state is TCP_CA_Loss (1.81 KB, patch) 2016-09-06 12:15 UTC, Dag Gullberg	needs-work	Details \| Review
Updated version with idefs and better handling (1.82 KB, patch) 2016-09-07 11:33 UTC, Dag Gullberg	none	Details \| Review
Due to change to 770934 this patch change too. (1.76 KB, patch) 2017-02-13 09:03 UTC, Dag Gullberg	reviewed	Details \| Review

Description Dag Gullberg 2016-09-06 09:38:29 UTC

This is a bug-report related to bug 770934.

The issue is to have a deterministic handle on the timeout on the platform. This works fine in UDP, but in TCP tunneling the timing is at the mercy of the persistent nature of TCP.

It stand clear that the non-blocking-write socket will continue to act normally (as it has a buffer) even at network disconnect. Possibly also retransmit schemes, related timeouts, number of retries, and timeout back-off scheme add to this problem as well.

The net result is a delay between the actual tcpi_ca_state in the TCP stack being set to TCP_CA_Loss (the first retransmit) and the user of the socket seeing RESOURCE TEMPORARILY UNAVAILABLE (EWOULDBLOCK) that amounts up to several seconds. In actual tests actually >10s (DEBUG mode runs though)

The impact is that functionality depending on deterministic values of the time delay between the last frame of the lost stream and the timeout expiration, will have a hard time bridging that gap. (e.g. fail over file save locally)

A patch has been prepared that addresses this issue.

Comment 1 Dag Gullberg 2016-09-06 12:15:18 UTC

Created attachment 334899 [details] [review]
RTP/RTSP/TCP transport RTSP TIMEOUT should start at once when tcpi_ca_state is TCP_CA_Loss

This patch depends on patch in https://bugzilla.gnome.org/show_bug.cgi?id=770934

Understanding that this likely is a Linux specific fix, I still put it up for review. Should perhaps be set in a platform independent API. Suggestions most welcome.

The idea is to make sure that we capture the first possible moment when there is loss to trigger the RTSP TIMEOUT.

Having the TCP stack indicating TCP_CA_Loss would be that moment.

With this fix together with the previous one in 770934, the time gap is reduced in one case from 14s to 0s, making the TCP tunneling TIMEOUT-wise behave like the UDP case.

A desirable feature.

Comment 2 Sebastian Dröge (slomo) 2016-09-07 08:31:52 UTC

Review of attachment 334899 [details] [review]:

::: gst-libs/gst/rtsp/gstrtspconnection.c
@@ +3763,3 @@
+  getsockopt(g_socket_get_fd (sock),
+                      IPPROTO_TCP, TCP_INFO, &tcp_info, &optlen);
+  res = (tcp_info.tcpi_ca_state == TCP_CA_Loss);

This all should be conditional in #ifdefs :)

@@ +3820,3 @@
+       treat same as WOULD_BLOCK. Need this to avoid delay. */
+    if (check_tcp_ca_state_loss(watch->conn->write_socket))
+      res = GST_RTSP_EINTR;

The main difference between WOULD_BLOCK and this however is that WOULD_BLOCK means that no data was written, here all data was written (to the OS internal buffer). Is this handled correctly?

Also this can happen basically all the time without there being an actual problem. TCP packets getting lost tell you nothing other than the link being not ideal

Comment 3 Dag Gullberg 2016-09-07 09:14:28 UTC

Yes, you are right. 

The intent is simply to leave things as they are, but allow the RTSP to start at the earliest possible moment.

But true, if we set it to GST_RTSP_EINTR, then the BACKLOG will be exercised and we do not want that, because if the result of the send to the socket was OK, then the data is either in the window or in the socket buffer, and should be treated as *handled*. 

It worked out for my disconnect test scenarios for the obvious reason.

I will contemplate these comments and get back with an updated patch.

cheers.

Comment 4 Dag Gullberg 2016-09-07 11:33:41 UTC

Created attachment 334978 [details] [review]
Updated version with idefs and better handling

I used G_OS_UNIX ifdefs and put the use of it in a place where this feature will only be used if the send to socket was OK, and then terminates as if OK, but return generic error, if tcpi_ca_state == CA_STAT_Loss.

Tested and proved to work the same for the disconnect/failover case i am working with.

On the reasoning if we want to check that state flag, well, the RTSP timeout is by default 60s. The TCP timeout is set to 30-35 seconds, the retransmit  timeout is small in the first rounds. If you over the span of 60s (65s as it turns out) does not see another state than Loss, well, then I think it is reasonable to trigger that timeout.

The ambition is only to handle the RTSP timeout start more precise, not to otherwise change the handling of sending data.

Comment 5 Dag Gullberg 2016-09-07 11:36:42 UTC

Just adding a note to my timeout rant. Obviously the socket buffer (also if you add the BACKLOG) will be full long before that 60s has passed.

Comment 6 Dag Gullberg 2016-09-15 07:49:23 UTC

Weekly PING on this one as well. /Cheers

Comment 7 Dag Gullberg 2016-09-22 07:44:53 UTC

Weekly PING

Comment 8 Dag Gullberg 2016-09-29 12:14:11 UTC

Hi. Weekly ping. I know u guys are busy with release work, but it would be peachy if u could find time to review a couple of my patches. /Cheers

Comment 9 Dag Gullberg 2016-10-06 07:59:10 UTC

Just a remainder for review /Cheers

Comment 10 Dag Gullberg 2016-10-17 08:06:18 UTC

Please review

Comment 11 Dag Gullberg 2016-10-26 08:25:40 UTC

Weekly remainder. Please review.

Comment 12 Dag Gullberg 2016-11-03 08:42:43 UTC

Please review

Comment 13 Dag Gullberg 2016-11-18 09:05:56 UTC

Please review this patch

Comment 14 Dag Gullberg 2016-11-29 07:50:17 UTC

please review this patch

Comment 15 Tim-Philipp Müller 2016-12-06 21:20:02 UTC

So if I understand correctly this would call the new check_tcp_ca_state_loss() doing a getsockopt for every single chunk of data sent successfully?

I wonder if this has any impact on performance?

As for the new ERROR return, the same comments apply as in the other bug :)

Comment 16 Dag Gullberg 2016-12-12 10:43:08 UTC

It is an external call, but it is a mem copy of a struct as far as I understand. I doubt it is more expensive than the socket send. To lessen the impact I could have a counter that made the call more seldom like

#define N 2  /* inverted frequency of getsockopt calls to check TCP_CA_Loss) */
guint cnt;

..

cnt++;

if ( cnt % N == 0 ) {
 check_tcp_ca_state_loss(watch->conn->write_socket));
}

I will try to do some timing checks with the call enabled and disabled.

Comment 17 Dag Gullberg 2017-02-13 09:03:20 UTC

Created attachment 345606 [details] [review]
Due to change to 770934 this patch change too.

Note that the patch in this report depends on the the patch in 770934, and cannot be applied separately.

Comment 18 Dag Gullberg 2017-03-20 08:20:18 UTC

plz review

Comment 19 Dag Gullberg 2017-04-18 11:10:57 UTC

Please review this patch

Comment 20 Dag Gullberg 2017-05-30 08:43:48 UTC

review?

Comment 21 Tim-Philipp Müller 2017-09-14 17:39:34 UTC

Comment on attachment 345606 [details] [review]
Due to change to 770934 this patch change too.

Was this patch supposed to be attached to bug #770934 instead?

>+#ifdef G_OS_UNIX
>+static gboolean
>+check_tcp_ca_state_loss (GSocket * sock)
>+{
>+  gboolean res;
>+  struct tcp_info tcp_info;
>+  socklen_t optlen = sizeof (tcp_info);
>+
>+  getsockopt (g_socket_get_fd (sock),
>+      IPPROTO_TCP, TCP_INFO, &tcp_info, &optlen);
>+  res = (tcp_info.tcpi_ca_state == TCP_CA_Loss);
>+
>+  GST_DEBUG ("tcpi_ca_state=%u, check_tcp_ca_state_loss=%d",
>+      tcp_info.tcpi_ca_state, res);
>+
>+  return res;
>+}
>+#endif

Will this work on all common Unices? macOS? *BSD? If not, then we need to check what's available.

Also, the includes needed for this are the other patch in the other bug (although not needed there I think).


> static void
> connloss_status (GstRTSPWatch * watch, GstRTSPResult result,
>     GSocket * write_socket, gpointer user_data)
> {
>   GstRTSPClient *client = GST_RTSP_CLIENT (user_data);
>   GstRTSPClientPrivate *priv = client->priv;
>+  gboolean is_eintr = (result == GST_RTSP_EINTR);
> 
>-  priv->connloss = (result == GST_RTSP_EINTR);
>+  priv->connloss = is_eintr;
>+
>+#ifdef G_OS_UNIX
>+  /* If TCP stack state indicate loss, don't reset session TIMEOUT.
>+   * Need this to avoid delay. Return generic error.
>+   */

This comment sounds like it belongs elsewhere in the other patch that makes use of priv->connloss to skip the timeout reset. I don't understand the bit about 'Return generic error', what does it mean here? The comment here should be more descriptive as to what we're trying to do here and then why, e.g. 'Indicate potential connection loss as soon as we detect a write backlog, so that we stop resetting the timout from the moment any backlog is building up. This makes sure we can honour the actually requested timeout more accurately.' or somesuch.


>+  if (!is_eintr && write_socket)
>+    priv->connloss = check_tcp_ca_state_loss (write_socket);
>+#endif
> 
>   GST_LOG ("client %p: connloss_status %s", client,
>       (priv->connloss ? "CONNECTION LOST" : "OK"));
>-- 
>2.1.4
>

Comment 22 Joakim Johansson 2018-02-12 12:52:16 UTC

I think that this ticket and the related ticket 770934 shall be closed.

Reason:
The changes in the patch changed the default return code for gst_rtsp_watch_write_data to ERROR. The result of this change is that there can be unwanted (faulty) session timeouts when the client (or network) is slow in reading.

The change to return ERROR when an item is added to the backlog queue is causing the timeout variables only to be reset when there are no items in the backlog queue.

Comment 23 GStreamer system administrator 2018-11-03 11:49:35 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/289.