After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 797251 - webrtcbin: might leak resources when changing to NULL
webrtcbin: might leak resources when changing to NULL
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-bad
unspecified
Other Mac OS
: Normal normal
: 1.15.1
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2018-10-05 04:38 UTC by Aleix Conchillo Flaqué
Modified: 2018-10-08 17:41 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
avoid leaks (1.40 KB, patch)
2018-10-05 05:28 UTC, Aleix Conchillo Flaqué
none Details | Review
avoid leaks fix up 1 (1.44 KB, patch)
2018-10-05 07:25 UTC, Aleix Conchillo Flaqué
none Details | Review
start/stop thread when changing state (2.12 KB, patch)
2018-10-05 19:20 UTC, Aleix Conchillo Flaqué
committed Details | Review

Description Aleix Conchillo Flaqué 2018-10-05 04:38:22 UTC
If a pipeline that contains a webrtcbin element is set to PLAYING, the "on-negotiation-needed" is emitted and in the callback we call "create-offer" which enqueues an internal task (an idle source). Then, while the task is executing we set the pipeline to NULL. This might not allow the idle source to complete an all the resources associated to it are lost. The same might be true for any other webrtcbin tasks (e.g. set-remote-description).

I have an app that creates and destroys webrtcbin and I'm running an stress test to it. After a while, a bunch of resources (eventfd and network sockets as well) are leaked.

I don't really have a backtrace or anything I can share right now, but I have added a bunch of print statements to debug this and understand what's happening.\.

It might also be that there is something wrong in my application, something I'm still checking but reading at the code I think something weird is going on.

A bit more about my app. It's an app that might create multiple pipelines (in my test currently 2) and each of those have a webrtcbin. Then, we might set the pipelines to NULL at any point, unref them and create new ones, and the same over and over.
Comment 1 Aleix Conchillo Flaqué 2018-10-05 05:28:56 UTC
Created attachment 373849 [details] [review]
avoid leaks
Comment 2 Aleix Conchillo Flaqué 2018-10-05 05:29:51 UTC
Here's a proposal patch. Not sure if this is the way to fix it. But it seems to be working well so far. I'm going to leave my test overnight and see how it goes.
Comment 3 Matthew Waters (ystreet00) 2018-10-05 06:14:24 UTC
Review of attachment 373849 [details] [review]:

I'm not sure this is needed or what it changes for you?

Execution of tasks are already blocked on webtrtc->priv->is_closed which is set to TRUE when the peerconnection thread is shutdown on _dispose().  Can you check that gst_webrtc_bin_dispose() is being called for you?  If not, that indicates a leak of the element itself.

::: ext/webrtc/gstwebrtcbin.c
@@ +627,3 @@
   PC_LOCK (op->webrtc);
+  if (!op->webrtc->priv->running) {
+    GST_DEBUG_OBJECT (op->webrtc, "Element is not running, aborting execution");

Do the unit tests pass with this addition?

A lot of the media tests perform the entire test in the NULL state which seems like this would prevent.

@@ +3784,3 @@
     case GST_STATE_CHANGE_PAUSED_TO_READY:
       webrtc->priv->running = FALSE;
+      PC_LOCK (webrtc);

This lock should probably one line up protecting the running = FALSE.
Comment 4 Aleix Conchillo Flaqué 2018-10-05 07:18:52 UTC
(In reply to Matthew Waters (ystreet00) from comment #3)
> Review of attachment 373849 [details] [review] [review]:
> 
> I'm not sure this is needed or what it changes for you?
> 

So far, the patch fixes the leaks.

> Execution of tasks are already blocked on webtrtc->priv->is_closed which is
> set to TRUE when the peerconnection thread is shutdown on _dispose().  Can
> you check that gst_webrtc_bin_dispose() is being called for you?  If not,
> that indicates a leak of the element itself.
> 

The problem is that the idle sources (whatever is executed in their callbacks) still keep references to webrtcbin when the pipeline is being set to NULL and unrefed. So, the element is never disposed therefore the thread doesn't terminate and is_closed keeps being TRUE.

That is, nothing prevents the idle sources to keep executing when the element changes state to NULL. In my app I have a ref to the pipeline and a ref to the webrtcbin element. I set the pipeline to NULL and I unref it next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the webrtcbin element to always be 1 (my copy should be the last one), but it's not.

> ::: ext/webrtc/gstwebrtcbin.c
> @@ +627,3 @@
>    PC_LOCK (op->webrtc);
> +  if (!op->webrtc->priv->running) {
> +    GST_DEBUG_OBJECT (op->webrtc, "Element is not running, aborting
> execution");
> 
> Do the unit tests pass with this addition?
> 

I'm trying now and just ran into a weird problem...

fatal: making test-suite.log: failed to create webrtcbin.trs
fatal: making test-suite.log: failed to create webrtcbin.log

First time I see this. Checking...

> A lot of the media tests perform the entire test in the NULL state which
> seems like this would prevent.
> 
> @@ +3784,3 @@
>      case GST_STATE_CHANGE_PAUSED_TO_READY:
>        webrtc->priv->running = FALSE;
> +      PC_LOCK (webrtc);
> 
> This lock should probably one line up protecting the running = FALSE.

Ooops. Fixed.
Comment 5 Aleix Conchillo Flaqué 2018-10-05 07:25:44 UTC
Created attachment 373850 [details] [review]
avoid leaks fix up 1
Comment 6 Aleix Conchillo Flaqué 2018-10-05 07:34:34 UTC
(In reply to Aleix Conchillo Flaqué from comment #4)

> That is, nothing prevents the idle sources to keep executing when the
> element changes state to NULL. In my app I have a ref to the pipeline and a
> ref to the webrtcbin element. I set the pipeline to NULL and I unref it
> next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the
> webrtcbin element to always be 1 (my copy should be the last one), but it's
> not.
>

Also, my app has it's own main loop where I run the gstreamer pipelines. After setting the pipelines to NULL I stop my main loop. At this point I have no idea what happens with the webrtcbin idle sources. Since webrtcbin has its own main loop they should still be executing.

With own of my prints I have seen an idle source blocking here:

_create_transport_channel (...)
{
  ...
  gst_element_sync_state_with_parent (GST_ELEMENT (ret->send_bin));
  gst_element_sync_state_with_parent (GST_ELEMENT (ret->receive_bin));

My guess is that once the pipeline is set to NULL (and the idle source still running), this will just be blocked somewhere. But since it's an idle source and it's running on its own main context it will not affect the main application.
Comment 7 Matthew Waters (ystreet00) 2018-10-05 12:02:59 UTC
If this is a hang, a backtrace is most useful. A debug log would also be useful.

(In reply to Aleix Conchillo Flaqué from comment #6)
> (In reply to Aleix Conchillo Flaqué from comment #4)
> 
> > That is, nothing prevents the idle sources to keep executing when the
> > element changes state to NULL. In my app I have a ref to the pipeline and a
> > ref to the webrtcbin element. I set the pipeline to NULL and I unref it
> > next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the
> > webrtcbin element to always be 1 (my copy should be the last one), but it's
> > not.

Nothing guarantees that your reference is the last one so you can't use that as a metric.

> Also, my app has it's own main loop where I run the gstreamer pipelines.
> After setting the pipelines to NULL I stop my main loop. At this point I
> have no idea what happens with the webrtcbin idle sources. Since webrtcbin
> has its own main loop they should still be executing.

Correct, they would execute normally.

> With own of my prints I have seen an idle source blocking here:
> 
> _create_transport_channel (...)
> {
>   ...
>   gst_element_sync_state_with_parent (GST_ELEMENT (ret->send_bin));
>   gst_element_sync_state_with_parent (GST_ELEMENT (ret->receive_bin));
> 
> My guess is that once the pipeline is set to NULL (and the idle source still
> running), this will just be blocked somewhere. But since it's an idle source
> and it's running on its own main context it will not affect the main
> application.

It should not be blocked though and a backtrace of all the threads and a debug log is definitely needed to figure out what else might be going on.
Comment 8 Aleix Conchillo Flaqué 2018-10-05 16:05:05 UTC
Btw, I forgot to thank you for this webrtc implementation. I think it's fantastic and very well done and clean. So, thanks for that!

(In reply to Matthew Waters (ystreet00) from comment #7)
> If this is a hang, a backtrace is most useful. A debug log would also be
> useful.
> 
> (In reply to Aleix Conchillo Flaqué from comment #6)
> > (In reply to Aleix Conchillo Flaqué from comment #4)
> > 
> > > That is, nothing prevents the idle sources to keep executing when the
> > > element changes state to NULL. In my app I have a ref to the pipeline and a
> > > ref to the webrtcbin element. I set the pipeline to NULL and I unref it
> > > next. At this point I would expect GST_OBJECT_REFCOUNT_VALUE for the
> > > webrtcbin element to always be 1 (my copy should be the last one), but it's
> > > not.
> 
> Nothing guarantees that your reference is the last one so you can't use that
> as a metric.
> 

I thought that setting a pipeline to NULL guarantees that since all elements should free their resources. Why would we want the tasks to keep executing after the pipeline is NULL?

> > Also, my app has it's own main loop where I run the gstreamer pipelines.
> > After setting the pipelines to NULL I stop my main loop. At this point I
> > have no idea what happens with the webrtcbin idle sources. Since webrtcbin
> > has its own main loop they should still be executing.
> 
> Correct, they would execute normally.
> 

Same as above. Why would we want this?

> > With own of my prints I have seen an idle source blocking here:
> > 
> > _create_transport_channel (...)
> > {
> >   ...
> >   gst_element_sync_state_with_parent (GST_ELEMENT (ret->send_bin));
> >   gst_element_sync_state_with_parent (GST_ELEMENT (ret->receive_bin));
> > 
> > My guess is that once the pipeline is set to NULL (and the idle source still
> > running), this will just be blocked somewhere. But since it's an idle source
> > and it's running on its own main context it will not affect the main
> > application.
> 
> It should not be blocked though and a backtrace of all the threads and a
> debug log is definitely needed to figure out what else might be going on.

Very good point! I wrote that before going to sleep and I didn't realize I could verify my guess. So, with my patch and after 10,000 iterations there are almost no more leaks, except for one case where my ref counter is 2 after setting the pipeline to NULL (I have an explanation below).

So, I went to gdb and below is the thread that is hanging. The main loop doesn't quit. In this case, I think the issue is that the idle source doesn't hold a reference to webrtcbin (op only stores a pointer) so if in my app I unref webrtcbin, then g_signal_emit will unref the last reference which means the idle source will be waiting for the main loop to finish, but it can't finish because the idle source is still running. Does that make any sense?

Also, considering my patch, in my app the webrtc refcount is 2 because _on_ice_candidate_task calls PC_UNBLOCK but then emits the signal. This means that the state will go to NULL while the signal is being emitted.

Thread 8 (Thread 0x7f474b5af700 (LWP 30664))

  • #0 syscall
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
  • #1 g_cond_wait
    at gthread-posix.c line 1402
  • #2 _stop_thread
    at gstwebrtcbin.c line 618
  • #3 gst_webrtc_bin_dispose
    at gstwebrtcbin.c line 3967
  • #4 g_object_unref
    at gobject.c line 3309
  • #5 g_value_unset
    at gvalue.c line 275
  • #6 g_signal_emit_valist
    at gsignal.c line 3421
  • #7 g_signal_emit
    at gsignal.c line 3447
  • #8 _on_ice_candidate_task
    at gstwebrtcbin.c line 3166
  • #9 _execute_op
    at gstwebrtcbin.c line 638
  • #10 g_main_dispatch
    at gmain.c line 3182
  • #11 g_main_context_dispatch
    at gmain.c line 3847
  • #12 g_main_context_iterate
    at gmain.c line 3920
  • #13 g_main_loop_run
    at gmain.c line 4116
  • #14 _gst_pc_thread
    at gstwebrtcbin.c line 585
  • #15 g_thread_proxy
    at gthread.c line 784
  • #16 start_thread
    at pthread_create.c line 333
  • #17 clone
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Comment 9 Julien Isorce 2018-10-05 17:35:13 UTC
Hi Matthew, yes thx for this great gstwebrtc, just curious why did you decide to put webrtcbin's thread creation/destruction in _init/_dispose instead of state_change_NULL_TO_READY/state_change_READY_TO_NULL, i.e. like glimagesink does for the gl thread ?
Comment 10 Aleix Conchillo Flaqué 2018-10-05 18:03:34 UTC
Without any patch, these are some of the threads that freeze. I believe the problem  with all of them (as I mentioned in comment 8) is that webrtcbin is trying to be disposed from inside _execute_op.

Thread 9 (Thread 0x7f8f2b7fe700 (LWP 28229))

  • #0 syscall
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
  • #1 g_cond_wait
    at gthread-posix.c line 1402
  • #2 _stop_thread
    at gstwebrtcbin.c line 618
  • #3 gst_webrtc_bin_dispose
    at gstwebrtcbin.c line 3961
  • #4 g_object_unref
    at gobject.c line 3309
  • #5 g_value_unset
    at gvalue.c line 275
  • #6 g_signal_emit_valist
    at gsignal.c line 3421
  • #7 g_signal_emit
    at gsignal.c line 3447
  • #8 _on_ice_candidate_task
    at gstwebrtcbin.c line 3162
  • #9 _execute_op
    at gstwebrtcbin.c line 634
  • #10 g_main_dispatch
    at gmain.c line 3182
  • #11 g_main_context_dispatch
    at gmain.c line 3847
  • #12 g_main_context_iterate
    at gmain.c line 3920
  • #13 g_main_loop_run
    at gmain.c line 4116
  • #14 _gst_pc_thread
    at gstwebrtcbin.c line 585
  • #15 g_thread_proxy
    at gthread.c line 784
  • #16 start_thread
    at pthread_create.c line 333
  • #17 clone
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Comment 11 Aleix Conchillo Flaqué 2018-10-05 18:05:14 UTC
Note that comment 10 has two types of traces, but bugzilla didn't separate them in the comment.
Comment 12 Aleix Conchillo Flaqué 2018-10-05 19:20:51 UTC
Created attachment 373855 [details] [review]
start/stop thread when changing state

This is a different patch following Julien's suggestion. I actually like it better than the initial one and it also allows the idle sources to finish. So far, now leaks either. Still need to try the UT.
Comment 13 Aleix Conchillo Flaqué 2018-10-05 19:41:51 UTC
Tests pass on patch from comment 12 but not from patch from comment 5 (marking that as obsolete).

-----

The problem with the tests is that I was using

  make check TESTS=webrtcbin

This give these errors:

fatal: making test-suite.log: failed to create webrtcbin.trs
fatal: making test-suite.log: failed to create webrtcbin.log
Makefile:3258: recipe for target 'test-suite.log' failed
make[4]: *** [test-suite.log] Error 1

But the same is true for another tests. For example "make check TESTS=srtp"

I'm not using meson, probably no one pays attention to autotools anymore.
Comment 14 Aleix Conchillo Flaqué 2018-10-05 21:01:42 UTC
(In reply to Aleix Conchillo Flaqué from comment #8)
 
> ... In this case, I think the issue is that the idle source
> doesn't hold a reference to webrtcbin (op only stores a pointer) so 
>

Just to clarify: the idle source shouldn't hold a reference to webrtcbin, otherwise we might end up in the deadlocks from the stack traces above.
Comment 15 Aleix Conchillo Flaqué 2018-10-08 01:13:24 UTC
This patch has been working without issues for more than 30,000 iterations starting and stopping webrtcbin pipelines, and keeps going. At this point I would say it looks pretty robust.

Also, note that currently in my test I'm only trying sending video form the webrtcbin. I will try the receiving side next. In my app, sending and receiving happens in different webrtcbin pipelines.
Comment 16 Matthew Waters (ystreet00) 2018-10-08 03:35:10 UTC
This one is better.  Tests pass :).

commit c4fe52395b21b54fd6ee6b9a5010737404889242
Author: Aleix Conchillo Flaqué <aleix@oblong.com>
Date:   Fri Oct 5 12:10:06 2018 -0700

    webrtcbin: start and stop thread when changing state
    
    It might be possible that if we set webrtcbin to the NULL state some
    tasks (idle sources) are still executed and they might even freeze. The freeze
    is caused because the webrtcbin tasks don't hold a reference to webrtcbin and
    if it's last unref inside the idle source itself this will not allow the main
    loop to finish because the main loop is waiting on the idle source to finish.
    
    We now start and stop webrtcbin thread when changing states. This will allow
    the idle sources to finish properly.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=797251
Comment 17 Matthew Waters (ystreet00) 2018-10-08 03:37:44 UTC
(In reply to Aleix Conchillo Flaqué from comment #13)
> The problem with the tests is that I was using
> 
>   make check TESTS=webrtcbin
> 
> This give these errors:
> 
> fatal: making test-suite.log: failed to create webrtcbin.trs
> fatal: making test-suite.log: failed to create webrtcbin.log
> Makefile:3258: recipe for target 'test-suite.log' failed
> make[4]: *** [test-suite.log] Error 1
> 
> But the same is true for another tests. For example "make check TESTS=srtp"

run 'make help' in tests/check for a full list of options for running tests.

e.g. GST_CHECKS=test_audio_video make -C tests/check elements/webrtcbin.check
Comment 18 Aleix Conchillo Flaqué 2018-10-08 17:41:21 UTC
(In reply to Matthew Waters (ystreet00) from comment #17)
>
> run 'make help' in tests/check for a full list of options for running tests.
> 
> e.g. GST_CHECKS=test_audio_video make -C tests/check elements/webrtcbin.check

Thanks! As you can tell I haven't ran many tests on the past :-).