GNOME Bugzilla – Bug 764744
Crashes when multiple udpsrc are created for each client on a shared media, misses tracking and cleanup
Last modified: 2016-07-19 14:02:22 UTC
In our application we are seeing a couple of issues when using rtsp-server in TLS mode with a media-factory that is set as 'shared'. These issues can be reproduced using a modified version of the test-video example code. In our tests, we enable the WITH_AUTH and WITH_TLS flags and set the media factory to shared ( gst_rtsp_media_factory_set_shared(factory, TRUE); ). Issue 1: When streaming multiple rtsp streams from the same factory, occasionally a stream will connect but not be able to decode the SRTP packets. Steps to reproduce : - Run test-video (with TLS, Auth, and shared factory) - Connect a client using gst-launch-1.0 rtspsrc location=.... - Wait around 6 minutes. - Connect another client using gst-launch-1.0 --gst-debug=srtpdec:5 rtspsrc location=... . (Now there will be 2 clients connected to the same session) - Observe : The second client will connect with server and will not be able to decode the SRTP packets. srtpdec will post a warnings "Unable to unprotect buffer (unprotect failed code 7)" and "Error authentication packet, dropping". It seems as if there is some timeout, or mismanagement of the the key used between the SRTP elements. This behavior was observed for both gstreamer version 1.6.3 and 1.8.0. Issue 2: When streaming multiple rtsp streams from the same factory, the example program will crash when stopping all streams. Steps to reproduce : - Run test-video (with TLS, Auth, and shared factory) - Connect a client using gst-launch-1.0 rtspsrc location=.... - Connect another client using gst-launch-1.0 rtspsrc location=... . (Now there will be 2 clients connected to the same session) - Stop both clients. - Observe : The test-video program will abort with the error "g_mutex_clear() called on uninitialised or locked mutex" This behavior is only seen in gstreamer 1.8.0. I could not reproduce this in version 1.6.3.
Created attachment 325557 [details] Stack Trace for crash
Can you get a backtrace of all threads when it crashes, and the output printed to the terminal from that assertion?
Here is the console output. I'll post the backtrace once I get it. $ ./test-video stream ready at rtsps://127.0.0.1:8554/test (lt-test-video:6682): GStreamer-CRITICAL **: Trying to dispose element udpsrc23, but it is in PLAYING (locked) instead of the NULL state. You need to explicitly set elements to the NULL state before dropping the final reference, to allow them to clean up. This problem may also be caused by a refcounting bug in the application or some element. g_mutex_clear() called on uninitialised or locked mutex Aborted (core dumped)
Created attachment 325558 [details] Stack trace of all threads
Can you check why the udpsrc is not set to NULL state and destroyed in gst_rtsp_stream_leave_bin() in rtsp-stream.c? Is that function ever called for you?
Thanks for providing some direction. I've been debugging the gst_rtsp_stream_leave_bin() function. It is being called, and I have verified that it is actually setting the elements to the NULL state. It looks like it may be a udpsrc not managed within rtsp-stream. In my debug test, I can see: - udpsrc4 --> STATE_NULL - udpsrc5 --> STATE_NULL - udpsrc6 --> STATE_NULL - udpsrc7 --> STATE_NULL Then, after the leave_bin function has been called I get the crash which provides the error that udpsrc3 was not in the NULL state when unreffing. Any idea where this other udpsrc is managed?
It should be managed exactly there, that's already a useful information :) Can you get a GST_DEBUG=6 debug log, and for the exactly same run the exact text of the assertion (i.e. what the name of the problematic udpsrc is)?
Created attachment 325791 [details] gst logs Here are the gstreamer logs for the issue. The Gstreamer-Critical is contained in the logs, and in this case "udpsrc3" was the problem element.
Sebastian, I think I may have narrowed in on the problem. When connecting the client to the server, the gst_rtsp_stream_allocate_udp_sockets() function is called which in our case allocates two udpsrc elements in stream->priv->udpsrc_v4[0] and stream->priv->udpsrc_v4[1]. When another client connects to the server and we're dealing with a media configured to be shared, gst_rtsp_stream_allocate_udp_sockets() is called with the same stream. Thus, two new udpsrc elements are created and the pointers stored in stream->priv->udpsrc_v4[0] and stream->priv->udpsrc_v4[1]. The previous pointers were overwritten and now it looks like we have leaked the old udpsrc_v4[0] and the old udpsrc_v4[1] is now managed/owned by a bin element. Furthermore, it looks like udpsrc_v4[1] is configured to be locked with the function gst_element_set_locked_state(), so now when the bin is set to NULL state this element will remain in its PLAYING state which eventually becomes a problem when unreffing the bin.
That's probably related to bug #757488 then. Even if it was not caused by that, that commit refactored all the relevant code :) Not sure how this is supposed to work: for shared media you use the same multiudpsink, but probably want to have new udpsrcs and funnel them into the single rtpbin. I don't think this did the right thing before.
Yea, it seems like it was incorrect before as well. I'm able to get around the crash by removing line rtsp-stream.c:1120 (gst_element_set_locked_state()). I would expect this to have some repercussions, but the streams have been playing fine. Do you know why the element is set to locked? I think the correct solution would be to have rtsp-stream maintain a handle to all of the udpsrcs created for the shared media and have them properly cleaned up when a client disconnects. In the current implementation, it is possible to have a lot of udpsrcs needlessly hanging around if a shared media has been open for a long time with multiple clients connecting and disconnecting. It looks like the sending side of the pipeline is cleaned up in the update_transport() function where the "remove" signal is called on the multiudpsink. Should we not clean up the udpsrcs here too? I have confirmed this crash is not related to the TLS configuration. Looking back at my first comment, I originally thought the two issues were related but that appears to be incorrect. Should I create another bug for the SRTP issue?
Yes, please open a new bug for the SRTP issue. And you're right, it should keep around all udpsrc of all clients and remove them as needed when clients disconnect. All udpsrc should go to the funnel. Do you want to try to write a patch for this?
Sure, I'll work on the patch.
Great, let me know if you need some help :) Thanks!
Created attachment 326051 [details] [review] initial patch Sebastian, here is an initial patch with the changes. I've provided some notes below and there are a couple sections I would like your input on. Notes: -- I've assumed that there can exist multiple unicast udpsrcs but only a single set of multicast udpsrcs (one for IPV4 and one for IPV6). - The multicast udpsrcs remain the same. - The unicast udpsrcs are now maintained in a hash table that uses the RTSPTransport as a key. It didn't look like there was any special logic required for handling IPV4 v. IPV6 except for the creation of the sockets, so the hash table can contain both types. - When unicast sockets are allocated, the udpsrcs are stored in the hash table. See the gst_rtsp_stream_allocate_udp_sockets() function. - When a client disconnects and the connection is unicast, we now cleanup the udpsrcs related to the RTSPTransport. See update_transport() function. - In the gst_rtsp_stream_leave_bin() function, all udpsrcs are now cleaned up. -- I've made some fundamental changes to the gst_rtsp_stream_allocate_udp_sockets() function that I think are correct, but you should verify. They include: - IPV6 unicast sockets were only allowed to be allocated one time, in the same manner as multicast. This is different for IPV4 unicast. I assumed that we wanted to allow multiple IPV6 unicast connections, so I removed the 'have_ipv6' check that would skip allocation. - I changed the logic for calculating the return 'result'. Before, it performed 'result = have_ipv4 || have_ipv4_mcast || have_ipv6 || have_ipv6_mcast'. This would return TRUE if a client was trying to create an IPV4 unicast connection and it failed to allocate the unicast udpsrcs, but a client had previously connected with IPV6 multicast successfully. -- There are two sections of the code where the code has been commented out with '#ifdef NOTHING'. Please review these portions. I think they are no longer needed since we now postpone the creation of udpsrc elements to after the join_bin function, but there may be a scenario I am missing. Please review the patch and make recommendations. I'll make changes as needed.
(In reply to Jake Foytik from comment #15) > -- I've assumed that there can exist multiple unicast udpsrcs but only a > single set of multicast udpsrcs (one for IPV4 and one for IPV6). > - The multicast udpsrcs remain the same. Correct > - The unicast udpsrcs are now maintained in a hash table that uses the > RTSPTransport as a key. It didn't look like there was any special logic > required for handling IPV4 v. IPV6 except for the creation of the sockets, > so the hash table can contain both types. No special handling, correct > - When unicast sockets are allocated, the udpsrcs are stored in the hash > table. See the gst_rtsp_stream_allocate_udp_sockets() function. > - When a client disconnects and the connection is unicast, we now cleanup > the udpsrcs related to the RTSPTransport. See update_transport() function. > - In the gst_rtsp_stream_leave_bin() function, all udpsrcs are now > cleaned up. Makes sense > -- I've made some fundamental changes to the > gst_rtsp_stream_allocate_udp_sockets() function that I think are correct, > but you should verify. They include: > - IPV6 unicast sockets were only allowed to be allocated one time, in > the same manner as multicast. This is different for IPV4 unicast. I assumed > that we wanted to allow multiple IPV6 unicast connections, so I removed the > 'have_ipv6' check that would skip allocation. Makes sense and is probably just an overlook. IPv4/6 should behave exactly the same. > - I changed the logic for calculating the return 'result'. Before, it > performed 'result = have_ipv4 || have_ipv4_mcast || have_ipv6 || > have_ipv6_mcast'. This would return TRUE if a client was trying to create an > IPV4 unicast connection and it failed to allocate the unicast udpsrcs, but a > client had previously connected with IPV6 multicast successfully. That part looked weird to me last time I looked too, so that fix is probably fine :) > Please review the patch and make recommendations. I'll make changes as > needed. A unit test would be nice to have
Review of attachment 326051 [details] [review]: For a unit test, adding/removing clients and checking if things still work afterwards would be a good test I guess... also checking if all but the first client properly work Check the existing tests for ideas :) ::: gst/rtsp-server/rtsp-stream.c @@ +264,3 @@ priv->ptmap = g_hash_table_new_full (NULL, NULL, NULL, (GDestroyNotify) gst_caps_unref); + priv->udpsrcs = g_hash_table_new (g_direct_hash, g_direct_equal); You probably want to create the hash table with a free function, so that you don't have to manually free the items when they get removed @@ +1467,3 @@ + g_slice_new0 (GstRTSPStreamUDPSrcs); + transport_udpsrcs->udpsrc[0] = NULL; + transport_udpsrcs->udpsrc[1] = NULL; They are already NULL because you allocate them with new0() @@ +1480,3 @@ + /* TODO: Seems like this should be the same as the IPV4 unicast case, so + * I've commented out this check that prevented multiple IPV6 unicast udpsrcs. + * This could be incorrect. */ It seems correct, they should behave the same @@ +1494,3 @@ + * Otherwise, add it to the hash table */ + if (transport_udpsrcs->udpsrc[0] == NULL + && transport_udpsrcs->udpsrc[1] == NULL) I think udpsrc[1] could be NULL if RTCP is disabled, or not? @@ +2517,3 @@ + * postponed. + * See https://bugzilla.gnome.org/show_bug.cgi?id=757488 */ +#ifdef NOTHING I think you're right, also for the other one below. This code can disappear, it's now done in play_udpsources_one_family() @@ +3305,3 @@ +/* Properly dispose udpsrcs that were created for a given transport. */ +/* Must be called with a lock. */ Which lock? :) Better to be explicit here in the comments
*** Bug 765132 has been marked as a duplicate of this bug. ***
Created attachment 326247 [details] [review] Updated patch with recommended changes This is the original patch with the following changes: - Added hash table free function. - Removed unneeded initialization of udpsrcs. - Removed commented code. - Better comments. This addresses all of the suggestions in the review except for: > I think udpsrc[1] could be NULL if RTCP is disabled, or not? That is fine if RTCP is disabled. No udpsrcs will be created and an entry will not be stored in the hash table.
Created attachment 326288 [details] [review] Unit tests Here is a separate patch for the changes to the Unit tests. I can squash this commit with the other changes if you would prefer. Since I changed the return logic of gst_rtsp_stream_allocate_udp_sockets(), two of the unit tests were now failing. The problem seems to be with allocating sockets for Multicast IPV6. I think this issue was hidden in there before. The problematic sections of the test are commented with '#ifdef ENABLE_FAILING_TESTS'. I also added a test to rtsp-server to add and remove clients as suggested.
Review of attachment 326247 [details] [review]: Generally looks good, thanks! Just some minor things How does the mcast ipv6 test fail exactly? ::: gst/rtsp-server/rtsp-stream.c @@ +314,3 @@ + /* We expect all udpsrcs to be cleaned up by this point. */ + if (g_hash_table_size (priv->udpsrcs) > 0) + GST_ERROR ("Unreffing udpsrcs hash table that contains elements."); This should probably be more like a g_critical() :) @@ +1492,3 @@ + if (transport_udpsrcs->udpsrc[0] == NULL + && transport_udpsrcs->udpsrc[1] == NULL) + g_slice_free (GstRTSPStreamUDPSrcs, transport_udpsrcs); The problem here is that you free the transport if udpsrc[1] == NULL. No matter if udpsrc[0] is NULL or not. Doesn't this fail for streams without RTCP now because you will free the data and not keep track of it? @@ +2704,3 @@ } +/* Must be called with a lock. */ Which lock? :)
Review of attachment 326288 [details] [review]: ::: tests/check/gst/stream.c @@ +141,3 @@ pool = gst_rtsp_address_pool_new (); fail_unless (gst_rtsp_address_pool_add_range (pool, "192.168.1.1", + "192.168.1.1", 6000, 6001, 0)); I assume this will also fail if you don't have an interface in the network when binding the socket? @@ +325,3 @@ +/* Having an issue with g_socket_bind() for the IPV6 multicast address. With the + * return logic in gst_rtsp_stream_allocate_udp_sockets(), this portion now + * fails. */ Is it https://bugzilla.gnome.org/show_bug.cgi?id=764679 ?
> @@ +1492,3 @@ > + if (transport_udpsrcs->udpsrc[0] == NULL > + && transport_udpsrcs->udpsrc[1] == NULL) > + g_slice_free (GstRTSPStreamUDPSrcs, transport_udpsrcs); > > The problem here is that you free the transport if udpsrc[1] == NULL. No matter > if udpsrc[0] is NULL or not. Doesn't this fail for streams without RTCP now > because you will free the data and not keep track of it? The transport_udpsrcs struct will only be freed if both udpsrc[0] AND udpsrc[1] are NULL. If they are both NULL, there is no need to add an empty struct to the hash table. If only one udpsrc is NULL, then the struct is added to the hash table. This can be changed to the following if you think it makes it clearer: if (transport_udpsrcs->udpsrc[0] != NULL || transport_udpsrcs->udpsrc[1] != NULL) g_hash_table_insert (priv->udpsrcs, ct, transport_udpsrcs); else g_slice_free (GstRTSPStreamUDPSrcs, transport_udpsrcs); I'll create an updated path to include the recommended changes.
Created attachment 326483 [details] [review] Updated patch Made the recommended changes and combined the changes into a single patch.
Review of attachment 326483 [details] [review]: Generally looks good now, we just need to solve the unit test failure :) ::: tests/check/gst/stream.c @@ +326,3 @@ + * return logic in gst_rtsp_stream_allocate_udp_sockets(), this portion now + * fails. */ +#ifdef ENABLE_FAILING_TESTS Are these here now the same as https://bugzilla.gnome.org/show_bug.cgi?id=764679 ?
> Are these here now the same as https://bugzilla.gnome.org/show_bug.cgi?id=764679 ? It seems like it is the same. I can get past the unit test errors if I bind to ANY for IPV6 multicast (as it is currently done for Windows only). See rtsp-stream.c:1294.
Thanks, I'll try to get all the patches in together then. This one here is definitely good to go, we only have to find a proper solution for the UDP problem in the other bug first :)
Comment on attachment 326483 [details] [review] Updated patch Can you attach a new patch that a) doesn't have the changes in "common", and b) keeps the failing tests enabled? Thanks :)
Created attachment 326687 [details] [review] Final Patch Removed the changes in common and removed the #ifdef statements around failing tests.
commit fe5f8077c1523206147c746cc40364ea16da669f Author: Jake Foytik <jake.foytik@ipconfigure.com> Date: Mon Apr 25 08:55:25 2016 -0400 rtsp-stream: Fix crash on cleanup with shared media and multiple udpsrc - Unicast udpsrcs are now managed in a hash table. This allows for proper cleanup in with shared streams and fixes a memory leak. - Unicast udpsrcs are now properly cleaned up when shared connections exit. See the update_transport() function. - Create unit test for shared media. https://bugzilla.gnome.org/show_bug.cgi?id=764744
Actually I think this is the wrong fix, we shouldn't have more than one udpsrc in the first place. That's another regression introduced by: commit f62a9a7eb9fda096fb88c013d6787c28090037a2 Author: Patricia Muscalu <patricia@axis.com> Date: Tue Feb 23 14:59:32 2016 +0100 rtsp-stream: postpone UDP socket allocation until SETUP Postpone the allocation of the UDP sockets until we know what transport has been chosen by the client. Both unicast and multicast UDP sources are created in one function. https://bugzilla.gnome.org/show_bug.cgi?id=757488
I'm working in a proper fix as part of bug #766612