GNOME Bugzilla – Bug 797251
webrtcbin: might leak resources when changing to NULL
Last modified: 2018-10-08 17:41:21 UTC
If a pipeline that contains a webrtcbin element is set to PLAYING, the "on-negotiation-needed" is emitted and in the callback we call "create-offer" which enqueues an internal task (an idle source). Then, while the task is executing we set the pipeline to NULL. This might not allow the idle source to complete an all the resources associated to it are lost. The same might be true for any other webrtcbin tasks (e.g. set-remote-description). I have an app that creates and destroys webrtcbin and I'm running an stress test to it. After a while, a bunch of resources (eventfd and network sockets as well) are leaked. I don't really have a backtrace or anything I can share right now, but I have added a bunch of print statements to debug this and understand what's happening.\. It might also be that there is something wrong in my application, something I'm still checking but reading at the code I think something weird is going on. A bit more about my app. It's an app that might create multiple pipelines (in my test currently 2) and each of those have a webrtcbin. Then, we might set the pipelines to NULL at any point, unref them and create new ones, and the same over and over.
Created attachment 373849 [details] [review] avoid leaks
Here's a proposal patch. Not sure if this is the way to fix it. But it seems to be working well so far. I'm going to leave my test overnight and see how it goes.
Review of attachment 373849 [details] [review]: I'm not sure this is needed or what it changes for you? Execution of tasks are already blocked on webtrtc->priv->is_closed which is set to TRUE when the peerconnection thread is shutdown on _dispose(). Can you check that gst_webrtc_bin_dispose() is being called for you? If not, that indicates a leak of the element itself. ::: ext/webrtc/gstwebrtcbin.c @@ +627,3 @@ PC_LOCK (op->webrtc); + if (!op->webrtc->priv->running) { + GST_DEBUG_OBJECT (op->webrtc, "Element is not running, aborting execution"); Do the unit tests pass with this addition? A lot of the media tests perform the entire test in the NULL state which seems like this would prevent. @@ +3784,3 @@ case GST_STATE_CHANGE_PAUSED_TO_READY: webrtc->priv->running = FALSE; + PC_LOCK (webrtc); This lock should probably one line up protecting the running = FALSE.
(In reply to Matthew Waters (ystreet00) from comment #3) > Review of attachment 373849 [details] [review] [review]: > > I'm not sure this is needed or what it changes for you? > So far, the patch fixes the leaks. > Execution of tasks are already blocked on webtrtc->priv->is_closed which is > set to TRUE when the peerconnection thread is shutdown on _dispose(). Can > you check that gst_webrtc_bin_dispose() is being called for you? If not, > that indicates a leak of the element itself. > The problem is that the idle sources (whatever is executed in their callbacks) still keep references to webrtcbin when the pipeline is being set to NULL and unrefed. So, the element is never disposed therefore the thread doesn't terminate and is_closed keeps being TRUE. That is, nothing prevents the idle sources to keep executing when the element changes state to NULL. In my app I have a ref to the pipeline and a ref to the webrtcbin element. I set the pipeline to NULL and I unref it next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the webrtcbin element to always be 1 (my copy should be the last one), but it's not. > ::: ext/webrtc/gstwebrtcbin.c > @@ +627,3 @@ > PC_LOCK (op->webrtc); > + if (!op->webrtc->priv->running) { > + GST_DEBUG_OBJECT (op->webrtc, "Element is not running, aborting > execution"); > > Do the unit tests pass with this addition? > I'm trying now and just ran into a weird problem... fatal: making test-suite.log: failed to create webrtcbin.trs fatal: making test-suite.log: failed to create webrtcbin.log First time I see this. Checking... > A lot of the media tests perform the entire test in the NULL state which > seems like this would prevent. > > @@ +3784,3 @@ > case GST_STATE_CHANGE_PAUSED_TO_READY: > webrtc->priv->running = FALSE; > + PC_LOCK (webrtc); > > This lock should probably one line up protecting the running = FALSE. Ooops. Fixed.
Created attachment 373850 [details] [review] avoid leaks fix up 1
(In reply to Aleix Conchillo Flaqué from comment #4) > That is, nothing prevents the idle sources to keep executing when the > element changes state to NULL. In my app I have a ref to the pipeline and a > ref to the webrtcbin element. I set the pipeline to NULL and I unref it > next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the > webrtcbin element to always be 1 (my copy should be the last one), but it's > not. > Also, my app has it's own main loop where I run the gstreamer pipelines. After setting the pipelines to NULL I stop my main loop. At this point I have no idea what happens with the webrtcbin idle sources. Since webrtcbin has its own main loop they should still be executing. With own of my prints I have seen an idle source blocking here: _create_transport_channel (...) { ... gst_element_sync_state_with_parent (GST_ELEMENT (ret->send_bin)); gst_element_sync_state_with_parent (GST_ELEMENT (ret->receive_bin)); My guess is that once the pipeline is set to NULL (and the idle source still running), this will just be blocked somewhere. But since it's an idle source and it's running on its own main context it will not affect the main application.
If this is a hang, a backtrace is most useful. A debug log would also be useful. (In reply to Aleix Conchillo Flaqué from comment #6) > (In reply to Aleix Conchillo Flaqué from comment #4) > > > That is, nothing prevents the idle sources to keep executing when the > > element changes state to NULL. In my app I have a ref to the pipeline and a > > ref to the webrtcbin element. I set the pipeline to NULL and I unref it > > next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the > > webrtcbin element to always be 1 (my copy should be the last one), but it's > > not. Nothing guarantees that your reference is the last one so you can't use that as a metric. > Also, my app has it's own main loop where I run the gstreamer pipelines. > After setting the pipelines to NULL I stop my main loop. At this point I > have no idea what happens with the webrtcbin idle sources. Since webrtcbin > has its own main loop they should still be executing. Correct, they would execute normally. > With own of my prints I have seen an idle source blocking here: > > _create_transport_channel (...) > { > ... > gst_element_sync_state_with_parent (GST_ELEMENT (ret->send_bin)); > gst_element_sync_state_with_parent (GST_ELEMENT (ret->receive_bin)); > > My guess is that once the pipeline is set to NULL (and the idle source still > running), this will just be blocked somewhere. But since it's an idle source > and it's running on its own main context it will not affect the main > application. It should not be blocked though and a backtrace of all the threads and a debug log is definitely needed to figure out what else might be going on.
Btw, I forgot to thank you for this webrtc implementation. I think it's fantastic and very well done and clean. So, thanks for that! (In reply to Matthew Waters (ystreet00) from comment #7) > If this is a hang, a backtrace is most useful. A debug log would also be > useful. > > (In reply to Aleix Conchillo Flaqué from comment #6) > > (In reply to Aleix Conchillo Flaqué from comment #4) > > > > > That is, nothing prevents the idle sources to keep executing when the > > > element changes state to NULL. In my app I have a ref to the pipeline and a > > > ref to the webrtcbin element. I set the pipeline to NULL and I unref it > > > next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the > > > webrtcbin element to always be 1 (my copy should be the last one), but it's > > > not. > > Nothing guarantees that your reference is the last one so you can't use that > as a metric. > I thought that setting a pipeline to NULL guarantees that since all elements should free their resources. Why would we want the tasks to keep executing after the pipeline is NULL? > > Also, my app has it's own main loop where I run the gstreamer pipelines. > > After setting the pipelines to NULL I stop my main loop. At this point I > > have no idea what happens with the webrtcbin idle sources. Since webrtcbin > > has its own main loop they should still be executing. > > Correct, they would execute normally. > Same as above. Why would we want this? > > With own of my prints I have seen an idle source blocking here: > > > > _create_transport_channel (...) > > { > > ... > > gst_element_sync_state_with_parent (GST_ELEMENT (ret->send_bin)); > > gst_element_sync_state_with_parent (GST_ELEMENT (ret->receive_bin)); > > > > My guess is that once the pipeline is set to NULL (and the idle source still > > running), this will just be blocked somewhere. But since it's an idle source > > and it's running on its own main context it will not affect the main > > application. > > It should not be blocked though and a backtrace of all the threads and a > debug log is definitely needed to figure out what else might be going on. Very good point! I wrote that before going to sleep and I didn't realize I could verify my guess. So, with my patch and after 10,000 iterations there are almost no more leaks, except for one case where my ref counter is 2 after setting the pipeline to NULL (I have an explanation below). So, I went to gdb and below is the thread that is hanging. The main loop doesn't quit. In this case, I think the issue is that the idle source doesn't hold a reference to webrtcbin (op only stores a pointer) so if in my app I unref webrtcbin, then g_signal_emit will unref the last reference which means the idle source will be waiting for the main loop to finish, but it can't finish because the idle source is still running. Does that make any sense? Also, considering my patch, in my app the webrtc refcount is 2 because _on_ice_candidate_task calls PC_UNBLOCK but then emits the signal. This means that the state will go to NULL while the signal is being emitted.
+ Trace 238694
Thread 8 (Thread 0x7f474b5af700 (LWP 30664))
Hi Matthew, yes thx for this great gstwebrtc, just curious why did you decide to put webrtcbin's thread creation/destruction in _init/_dispose instead of state_change_NULL_TO_READY/state_change_READY_TO_NULL, i.e. like glimagesink does for the gl thread ?
Without any patch, these are some of the threads that freeze. I believe the problem with all of them (as I mentioned in comment 8) is that webrtcbin is trying to be disposed from inside _execute_op.
+ Trace 238695
Thread 9 (Thread 0x7f8f2b7fe700 (LWP 28229))
Note that comment 10 has two types of traces, but bugzilla didn't separate them in the comment.
Created attachment 373855 [details] [review] start/stop thread when changing state This is a different patch following Julien's suggestion. I actually like it better than the initial one and it also allows the idle sources to finish. So far, now leaks either. Still need to try the UT.
Tests pass on patch from comment 12 but not from patch from comment 5 (marking that as obsolete). ----- The problem with the tests is that I was using make check TESTS=webrtcbin This give these errors: fatal: making test-suite.log: failed to create webrtcbin.trs fatal: making test-suite.log: failed to create webrtcbin.log Makefile:3258: recipe for target 'test-suite.log' failed make[4]: *** [test-suite.log] Error 1 But the same is true for another tests. For example "make check TESTS=srtp" I'm not using meson, probably no one pays attention to autotools anymore.
(In reply to Aleix Conchillo Flaqué from comment #8) > ... In this case, I think the issue is that the idle source > doesn't hold a reference to webrtcbin (op only stores a pointer) so > Just to clarify: the idle source shouldn't hold a reference to webrtcbin, otherwise we might end up in the deadlocks from the stack traces above.
This patch has been working without issues for more than 30,000 iterations starting and stopping webrtcbin pipelines, and keeps going. At this point I would say it looks pretty robust. Also, note that currently in my test I'm only trying sending video form the webrtcbin. I will try the receiving side next. In my app, sending and receiving happens in different webrtcbin pipelines.
This one is better. Tests pass :). commit c4fe52395b21b54fd6ee6b9a5010737404889242 Author: Aleix Conchillo Flaqué <aleix@oblong.com> Date: Fri Oct 5 12:10:06 2018 -0700 webrtcbin: start and stop thread when changing state It might be possible that if we set webrtcbin to the NULL state some tasks (idle sources) are still executed and they might even freeze. The freeze is caused because the webrtcbin tasks don't hold a reference to webrtcbin and if it's last unref inside the idle source itself this will not allow the main loop to finish because the main loop is waiting on the idle source to finish. We now start and stop webrtcbin thread when changing states. This will allow the idle sources to finish properly. https://bugzilla.gnome.org/show_bug.cgi?id=797251
(In reply to Aleix Conchillo Flaqué from comment #13) > The problem with the tests is that I was using > > make check TESTS=webrtcbin > > This give these errors: > > fatal: making test-suite.log: failed to create webrtcbin.trs > fatal: making test-suite.log: failed to create webrtcbin.log > Makefile:3258: recipe for target 'test-suite.log' failed > make[4]: *** [test-suite.log] Error 1 > > But the same is true for another tests. For example "make check TESTS=srtp" run 'make help' in tests/check for a full list of options for running tests. e.g. GST_CHECKS=test_audio_video make -C tests/check elements/webrtcbin.check
(In reply to Matthew Waters (ystreet00) from comment #17) > > run 'make help' in tests/check for a full list of options for running tests. > > e.g. GST_CHECKS=test_audio_video make -C tests/check elements/webrtcbin.check Thanks! As you can tell I haven't ran many tests on the past :-).