After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 797203 - shmsrc: race condition when stopping
shmsrc: race condition when stopping
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-bad
1.14.x
Other Linux
: Normal normal
: 1.15.1
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2018-09-25 21:50 UTC by Aleix Conchillo Flaqué
Modified: 2018-10-10 17:51 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
fix race condition (3.92 KB, patch)
2018-09-25 21:56 UTC, Aleix Conchillo Flaqué
none Details | Review
fix race condition cleanup (5.01 KB, patch)
2018-09-26 04:29 UTC, Aleix Conchillo Flaqué
none Details | Review
Add unit test to reproduce the issue (2.86 KB, patch)
2018-10-09 15:17 UTC, Josep Torra Valles
committed Details | Review
Fixes the race condition with respect the pipe access (5.56 KB, patch)
2018-10-09 15:19 UTC, Josep Torra Valles
committed Details | Review
Fixes the race condition with respect the pollfd access (1.69 KB, patch)
2018-10-09 15:20 UTC, Josep Torra Valles
committed Details | Review

Description Aleix Conchillo Flaqué 2018-09-25 21:50:59 UTC
There seems to be a race condition when stopping a shmsrc element. It is possible that "self->pipe" is set to NULL (calling gst_shm_src_stop_reading when setting the pipeline to NULL) while gst_shm_src_create is still being used in the streaming thread.
Comment 1 Aleix Conchillo Flaqué 2018-09-25 21:56:10 UTC
Created attachment 373760 [details] [review]
fix race condition
Comment 2 Olivier Crête 2018-09-25 22:20:06 UTC
Review of attachment 373760 [details] [review]:

::: sys/shm/gstshmsrc.c
@@ +278,2 @@
     gst_shm_pipe_dec (self->pipe);
+    GST_OBJECT_LOCK (self);

Are you sure this unlock+relock is not racy?

@@ +382,3 @@
   gsb = g_slice_new0 (struct GstShmBuffer);
   gsb->buf = buf;
   gsb->pipe = self->pipe;

There is a problem here, if the self->pipe can be NULL here, we'd get an invalid GstShmBuffer.. My guess is that it shouldn't UNLOCK it at the end of the while loop if buf != NULL... But in that case it's super ugly.

I think it should something like:

LOCK();
do {
  UNLOCK();
  poll_wait..()
  LOCK()

  if (test for error condition 1)
   goto error_condition_1;
  

} while ();

...
gsb->pipe = self->pipe;
UNLOCK;

return GST_FLOW_OK;

error_condition_1:
 UNLOCK();
 GST_ELEMENT_ERROR(...);
 return GST_FLOW_...;

@@ +431,2 @@
   self->unlocked = TRUE;
+  GST_OBJECT_UNLOCK (self);

should probably hold the lock across the set flushing for consistency.

@@ +445,2 @@
   self->unlocked = FALSE;
+  GST_OBJECT_UNLOCK (self);

should probably hold the lock across the set flushing for consistency here too.
Comment 3 Aleix Conchillo Flaqué 2018-09-26 04:29:37 UTC
Created attachment 373765 [details] [review]
fix race condition cleanup
Comment 4 Aleix Conchillo Flaqué 2018-09-26 04:41:04 UTC
(In reply to Olivier Crête from comment #2)

Thanks for the comments. Started to do some GStreamer again lately, I'm rusty! :-)

> Review of attachment 373760 [details] [review] [review]:
> 
> ::: sys/shm/gstshmsrc.c
> @@ +278,2 @@
>      gst_shm_pipe_dec (self->pipe);
> +    GST_OBJECT_LOCK (self);
> 
> Are you sure this unlock+relock is not racy?
> 

Yes, it looks racy but I think it should be fine since self->pipe is only NULL when the element is initialized and only set NULL in this function. So, I don't think there's a way to shoot yourself in the foot. I think... ?

Fixed all the other comments.

I was talking to Josep Torra about this. He might have something else to say, let's wait until he jumps in.
Comment 5 Aleix Conchillo Flaqué 2018-09-26 20:06:57 UTC
Just an important note. This happens when shmsrc is-live=TRUE.
Comment 6 Josep Torra Valles 2018-09-28 12:18:10 UTC
I'd wrote a test.

https://cgit.freedesktop.org/~adn770/gst-plugins-bad/commit/?h=shm-fixes&id=7a94096efd9effd0731ae12398e4c07e6a45c68d

And a fix for the issue.

https://cgit.freedesktop.org/~adn770/gst-plugins-bad/commit/?h=shm-fixes&id=fb8c6686813f1535f1be33d2b28b9b54b07d2f8c

The problem is that the test also exhibits that there's a racy deadlock situation and I'm not sure how to fix it. I have the feeling that the issue comes from mostly from shmsink but not completely sure.

Olivier do you have any idea on what should be done to fix the race?

To reproduce it just run the test multiple times.
Comment 7 Aleix Conchillo Flaqué 2018-10-08 20:11:47 UTC
I've checked Josep patch and test. Here's what I found.

Without Josep patch the new test already hangs. This is because the client main socket is opened without O_NONBLOCK. I have filed bug 797258 with a patch.

Then, with the O_NONBLOCK fix, there still another freeze (with or without Josep's  patch). This deadlock only occurs when wait-for-connection=TRUE, with wait-for-connection=FALSE everything works great (with the non-blocking fix).

Here's the stack trace:

Thread 3 (Thread 0x7ff799ed2700 (LWP 30791))

  • #0 syscall
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
  • #1 g_cond_wait
    at gthread-posix.c line 1402
  • #2 gst_shm_sink_render
    at gstshmsink.c line 674
  • #3 gst_base_sink_chain_unlocked
    at gstbasesink.c line 3546
  • #4 gst_base_sink_chain_main
    at gstbasesink.c line 3672
  • #5 gst_base_sink_chain
    at gstbasesink.c line 3701
  • #6 gst_pad_chain_data_unchecked
    at gstpad.c line 4322
  • #7 gst_pad_push_data
    at gstpad.c line 4578
  • #8 gst_pad_push
    at gstpad.c line 4697
  • #9 gst_base_src_loop
    at gstbasesrc.c line 2957
  • #10 gst_task_func
    at gsttask.c line 332
  • #11 g_thread_pool_thread_proxy
    at gthreadpool.c line 307
  • #12 g_thread_proxy
    at gthread.c line 784
  • #13 start_thread
    at pthread_create.c line 333
  • #14 clone
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 109

Thread 1 (Thread 0x7ff7a0e33700 (LWP 30789))

  • #0 syscall
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
  • #1 g_mutex_lock_slowpath
    at gthread-posix.c line 1320
  • #2 g_mutex_lock
    at gthread-posix.c line 1344
  • #3 gst_base_sink_change_state
    at gstbasesink.c line 5161
  • #4 gst_element_change_state
    at gstelement.c line 2952
  • #5 gst_element_set_state_func
    at gstelement.c line 2906
  • #6 gst_bin_element_set_state
    at gstbin.c line 2604
  • #7 gst_bin_change_state_func
    at gstbin.c line 2946
  • #8 gst_pipeline_change_state
    at gstpipeline.c line 508
  • #9 gst_element_change_state
    at gstelement.c line 2952
  • #10 gst_element_set_state_func
    at gstelement.c line 2906
  • #11 test_shm_live
    at elements/shm.c line 218
  • #12 tcase_run_tfun_fork
    at check_run.c line 465
  • #13 srunner_iterate_tcase_tfuns
    at check_run.c line 237
  • #14 srunner_run_tcase
    at check_run.c line 377
  • #15 srunner_iterate_suites
    at check_run.c line 205
  • #16 srunner_run_tagged
    at check_run.c line 740
  • #17 srunner_run
    at check_run.c line 754
  • #18 srunner_run_all
    at check_run.c line 692
  • #19 gst_check_run_suite
    at gstcheck.c line 1067
  • #20 main
    at elements/shm.c line 249

Comment 8 Aleix Conchillo Flaqué 2018-10-08 21:02:53 UTC
I just filed bug 797260 and a patch which solves the issues mentioned in comment 7.

Regarding the UT, I would add a couple of things:

- Do the same test with wait-for-connection=FALSE.

- Remove the socket-path file generated during the test, otherwise this will pollute the file system with shm-unit-test, shm-unit-test.0, shm-unit-test.1, shm-unit-test.2...

I will now stress test Josep's patch.
Comment 9 Aleix Conchillo Flaqué 2018-10-08 21:35:27 UTC
(In reply to Aleix Conchillo Flaqué from comment #8)

> 
> - Remove the socket-path file generated during the test, otherwise this will
> pollute the file system with shm-unit-test, shm-unit-test.0,
> shm-unit-test.1, shm-unit-test.2...
> 
> 

Let me take this back. This is already done by the element, I just had the files that were not deleted by the freezes. All good!
Comment 10 Josep Torra Valles 2018-10-09 15:17:25 UTC
Created attachment 373875 [details] [review]
Add unit test to reproduce the issue

Adding a unit test to reproduce the problem.
Comment 11 Josep Torra Valles 2018-10-09 15:19:18 UTC
Created attachment 373876 [details] [review]
Fixes the race condition with respect the pipe access

This change fixes the issue when accessing the pipe from _create and state change thread.
Comment 12 Josep Torra Valles 2018-10-09 15:20:34 UTC
Created attachment 373877 [details] [review]
Fixes the race condition with respect the pollfd access

This fixes the race condition with respect the pollfd access between _create and state change thread.
Comment 13 Olivier Crête 2018-10-10 17:18:51 UTC
Review of attachment 373875 [details] [review]:

::: tests/check/elements/shm.c
@@ +212,3 @@
+  fail_unless (state_res == GST_STATE_CHANGE_SUCCESS);
+
+  g_usleep (10000);

Is there a way we can test this without a sleep? What are we waiting for?
Comment 14 Olivier Crête 2018-10-10 17:22:40 UTC
The two other patches look good, the only thing to fix is the sleep in the test that is guaranteed to be racy.
Comment 15 Josep Torra Valles 2018-10-10 17:27:38 UTC
Is just waiting for some data flow to happen in the pipelines. It doesn't really matter how much data flows and shouldn't really introduce any racy behavior in the test itseld but is needed to make emerge any racy behavior that should be gracefully handled by the elements(which was the goal on the patches to fix those race condition issues).

(In reply to Olivier Crête from comment #13)
> Review of attachment 373875 [details] [review] [review]:
> 
> ::: tests/check/elements/shm.c
> @@ +212,3 @@
> +  fail_unless (state_res == GST_STATE_CHANGE_SUCCESS);
> +
> +  g_usleep (10000);
> 
> Is there a way we can test this without a sleep? What are we waiting for?
Comment 16 Olivier Crête 2018-10-10 17:50:27 UTC
I changed the test to replace fakesink with appsink to pull one buffer and I can still reproduce the problem from #797258 with this.