GNOME Bugzilla – Bug 797203
shmsrc: race condition when stopping
Last modified: 2018-10-10 17:51:03 UTC
There seems to be a race condition when stopping a shmsrc element. It is possible that "self->pipe" is set to NULL (calling gst_shm_src_stop_reading when setting the pipeline to NULL) while gst_shm_src_create is still being used in the streaming thread.
Created attachment 373760 [details] [review] fix race condition
Review of attachment 373760 [details] [review]: ::: sys/shm/gstshmsrc.c @@ +278,2 @@ gst_shm_pipe_dec (self->pipe); + GST_OBJECT_LOCK (self); Are you sure this unlock+relock is not racy? @@ +382,3 @@ gsb = g_slice_new0 (struct GstShmBuffer); gsb->buf = buf; gsb->pipe = self->pipe; There is a problem here, if the self->pipe can be NULL here, we'd get an invalid GstShmBuffer.. My guess is that it shouldn't UNLOCK it at the end of the while loop if buf != NULL... But in that case it's super ugly. I think it should something like: LOCK(); do { UNLOCK(); poll_wait..() LOCK() if (test for error condition 1) goto error_condition_1; } while (); ... gsb->pipe = self->pipe; UNLOCK; return GST_FLOW_OK; error_condition_1: UNLOCK(); GST_ELEMENT_ERROR(...); return GST_FLOW_...; @@ +431,2 @@ self->unlocked = TRUE; + GST_OBJECT_UNLOCK (self); should probably hold the lock across the set flushing for consistency. @@ +445,2 @@ self->unlocked = FALSE; + GST_OBJECT_UNLOCK (self); should probably hold the lock across the set flushing for consistency here too.
Created attachment 373765 [details] [review] fix race condition cleanup
(In reply to Olivier Crête from comment #2) Thanks for the comments. Started to do some GStreamer again lately, I'm rusty! :-) > Review of attachment 373760 [details] [review] [review]: > > ::: sys/shm/gstshmsrc.c > @@ +278,2 @@ > gst_shm_pipe_dec (self->pipe); > + GST_OBJECT_LOCK (self); > > Are you sure this unlock+relock is not racy? > Yes, it looks racy but I think it should be fine since self->pipe is only NULL when the element is initialized and only set NULL in this function. So, I don't think there's a way to shoot yourself in the foot. I think... ? Fixed all the other comments. I was talking to Josep Torra about this. He might have something else to say, let's wait until he jumps in.
Just an important note. This happens when shmsrc is-live=TRUE.
I'd wrote a test. https://cgit.freedesktop.org/~adn770/gst-plugins-bad/commit/?h=shm-fixes&id=7a94096efd9effd0731ae12398e4c07e6a45c68d And a fix for the issue. https://cgit.freedesktop.org/~adn770/gst-plugins-bad/commit/?h=shm-fixes&id=fb8c6686813f1535f1be33d2b28b9b54b07d2f8c The problem is that the test also exhibits that there's a racy deadlock situation and I'm not sure how to fix it. I have the feeling that the issue comes from mostly from shmsink but not completely sure. Olivier do you have any idea on what should be done to fix the race? To reproduce it just run the test multiple times.
I've checked Josep patch and test. Here's what I found. Without Josep patch the new test already hangs. This is because the client main socket is opened without O_NONBLOCK. I have filed bug 797258 with a patch. Then, with the O_NONBLOCK fix, there still another freeze (with or without Josep's patch). This deadlock only occurs when wait-for-connection=TRUE, with wait-for-connection=FALSE everything works great (with the non-blocking fix). Here's the stack trace:
+ Trace 238697
Thread 3 (Thread 0x7ff799ed2700 (LWP 30791))
Thread 1 (Thread 0x7ff7a0e33700 (LWP 30789))
I just filed bug 797260 and a patch which solves the issues mentioned in comment 7. Regarding the UT, I would add a couple of things: - Do the same test with wait-for-connection=FALSE. - Remove the socket-path file generated during the test, otherwise this will pollute the file system with shm-unit-test, shm-unit-test.0, shm-unit-test.1, shm-unit-test.2... I will now stress test Josep's patch.
(In reply to Aleix Conchillo Flaqué from comment #8) > > - Remove the socket-path file generated during the test, otherwise this will > pollute the file system with shm-unit-test, shm-unit-test.0, > shm-unit-test.1, shm-unit-test.2... > > Let me take this back. This is already done by the element, I just had the files that were not deleted by the freezes. All good!
Created attachment 373875 [details] [review] Add unit test to reproduce the issue Adding a unit test to reproduce the problem.
Created attachment 373876 [details] [review] Fixes the race condition with respect the pipe access This change fixes the issue when accessing the pipe from _create and state change thread.
Created attachment 373877 [details] [review] Fixes the race condition with respect the pollfd access This fixes the race condition with respect the pollfd access between _create and state change thread.
Review of attachment 373875 [details] [review]: ::: tests/check/elements/shm.c @@ +212,3 @@ + fail_unless (state_res == GST_STATE_CHANGE_SUCCESS); + + g_usleep (10000); Is there a way we can test this without a sleep? What are we waiting for?
The two other patches look good, the only thing to fix is the sleep in the test that is guaranteed to be racy.
Is just waiting for some data flow to happen in the pipelines. It doesn't really matter how much data flows and shouldn't really introduce any racy behavior in the test itseld but is needed to make emerge any racy behavior that should be gracefully handled by the elements(which was the goal on the patches to fix those race condition issues). (In reply to Olivier Crête from comment #13) > Review of attachment 373875 [details] [review] [review]: > > ::: tests/check/elements/shm.c > @@ +212,3 @@ > + fail_unless (state_res == GST_STATE_CHANGE_SUCCESS); > + > + g_usleep (10000); > > Is there a way we can test this without a sleep? What are we waiting for?
I changed the test to replace fakesink with appsink to pull one buffer and I can still reproduce the problem from #797258 with this.