GNOME Bugzilla – Bug 768249
rtspsrc lockup on gst_rtspsrc_stop
Last modified: 2016-07-25 10:27:01 UTC
I am getting a situation that is quite hard to reproduce which essentially locks up a pipeline making it unrecoverable.
+ Trace 236400
Thread 2 (Thread 27966.27968)
It seems that the socket is being created as blocking with an infinite timeout.
There are two things here: a) the reading happens without timeout (there should be one, GstRTSPConnection has one and that should've been configured there), b) the reading happens without a GCancellable too (it's 0x0) and ideally there should be one which would be triggered exactly before gst_rtspsrc_stop() takes the mutex so that everything just shuts down asap.
Looking at the backtrace g_socket_input_stream_read() doesn't take into account any timeout, it calls g_socket_receive_with_blocking() which calls g_socket_receive_with_timeout() with timeout=-1 always. I have noticed that gstreamer sometimes sets canceable to false, probably from build_next() in gstrtspconnection.c
I think I provided some wip patch addressing this in another bug no?
That would be https://bugzilla.gnome.org/show_bug.cgi?id=744209 I guess? That seems to target a different problem, but the general problem here really is the same (GstRTSPConnection needing some bigger refactoring to prevent all such problems and make the code simpler).
Thanks Nicolas, I will soon try out that patch but first I would like to understand why this is happening. Why is conn->may_cancel being set to FALSE in build_next() what consequence does removing this have? It seems like it could resolve the issue or not setting the socket to block forever.
Created attachment 330857 [details] [review] Fix for hanging GstRTSPConnection If conn->may_cancel isn't reset to TRUE the next entrance in build_next may hang if there is no data received through read_bytes since the socket is blocking with an infinite timeout.
I have been using this patch for several days now and it seems to resolve the issue.
Review of attachment 330857 [details] [review]: ::: gst-libs/gst/rtsp/gstrtspconnection.c @@ +2162,3 @@ } done: + conn->may_cancel = TRUE; This seems to defeat the purpose of this may_cancel flag. If you end up here because STATE_START and then STATE_DATA_HEADER/READ_LINES, and then no more data is there until the next iteration, it would go out here. The intention of the flag seems to be to not allow interruptions in the middle of RTSP messages (why?!) and at the end of the message it will go to STATE_END where may_cancel is set to TRUE again. Where does it end up for you, what's the events that lead to reading being called without timeout/cancellable? @@ +2168,3 @@ invalid_body_len: { + conn->may_cancel = TRUE; This one probably makes sense, but after this the connection is unusable anyway, right?
Hi Sebastian, thank you for your help. It looks like the may_cancel flag is used only inside the while(true) loop in build_next() and NOT re-establishing it's state on exit looks odd. If an error inside the loop occurs (maybe because of lost data, lost synchronisation because the remote endpoint sent invalid data, ...) it seems important to re-establish the cancelability, i.e. not doing that only upon receiving a whole, valid rtsp message. If the remote endpoint randomly disappears and somehow the read functions return an error the next build_next() re-entrance read of the first byte of the header hangs forever when it should return 0 and let gst_rtspsrc_loop_udp() call gst_rtsp_conninfo_reconnect(). In that case you can't change the state of the rtspsrc element to NULL for instance, resulting in a deadlock.
But isn't this while loop just running until all data was read from the connection? That is, isn't it stopping when currently no new data is available in the middle of a message... and would then later continue from the middle of the message if data becomes available again?
Yes it seems to try and complete a whole message. But if the rtsp server looses it's state and responds with invalid data that triggers an error on our side, and after that stops responding won't try and establish a new rtsp connection. If this makes any sense.
Yes if state is lost, things should reset. However you would now also make reading in the middle of a message interruptible if reading a single messages is split over multiple calls. Or not?
I don't think so because fill_raw_bytes() calls g_input_stream_read() and handles it "NULL" if conn->may_cancel is FALSE. This calls g_socket_input_stream_read() which calls g_socket_receive_with_blocking() with blocking=TRUE BUT with a NULL cancellable, which ultimately means the rtspsrc holds the task mutex until data arrives, which might be never.
Right, so in that case problems are even worse and we would block there potentially forever :) I guess the patch is ok then, just that there are far more problems.
commit 75f43da4c8f1531f714f17c1674dde21d09fc43d Author: Sergio Torres Soldado <torres.soldado@gmail.com> Date: Mon Jul 4 17:19:08 2016 +0100 rtspconnection: Fix potential deadlock caused by blocking read forever Reset the connection "may_cancel" property to avoid a potential deadlock if there is no data to read and the socket stays blocked forever. https://bugzilla.gnome.org/show_bug.cgi?id=768249
cool, thank you Sebastian :)