GNOME Bugzilla – Bug 742605
Race condition when stopping a pipeline
Last modified: 2015-03-03 09:59:46 UTC
I'm seeing a race condition within vaapidecode when I try and stop pipelines: Both threads want decode->decoder_mutex from gstvaapidecode.c Backtraces from gdb:
+ Trace 234531
Thread 1 (Thread 0x7f34c0595740 (LWP 5185))
It looks like the change from PAUSED to READY does not guarantee to release the decoder thread when we're shutting down, despite gst_vaapidecode_change_state kicking the condvar. I'm going to see if I can work out why this races.
Created attachment 294174 [details] [review] [PATCH] Hack around the vaapidecode race condition gst/vaapi/gstvaapidecode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
The attached patch isn't a proper fix. It's a hack intended to hint at where the problem really lies. However, with it in place, I no longer see occasional lock ups when stopping the pipeline. Someone with a better understanding of the code than I have needs to look at the infinite loop in gst_vaapidecode_decode_frame and work out how you're supposed to drop out of that loop during a state change from PAUSED to READY, given that the src pad task will have stopped, so frames are no longer being pulled from the decoder and sent upstream.
Note that this patch isn't enough in and of itself to fix the lockups; it gets one cause, then I get a lock between:
+ Trace 234545
and the thread running change_state.
I've been able to confirm that this is a regression since git revision 8edcc8f8. I'm going to try reverting the patches from https://bugzilla.gnome.org/show_bug.cgi?id=734616 because they're touching the locks that give me trouble.
Reverting those changes didn't help.
So, now that I've dug deep, I'm confused by the code. Why do we decode in a separate thread? Specifically, why can't gst_vaapidecode_decode_loop be inlined into gst_vaapidecode_decode_frame so that we decode and push frames from the same thread, rather than having a pad_task on the src pad just to run gst_vaapidecode_decode_loop? After all, queue can be used to push downstream into a separate thread if buffering is needed. I'm going to experiment and see what happens if I try and remove the pad task completely.
We have to decode the frames as fast as possible. So we are doing decoding in one thread and giving frames to the hardware for decoding. The second thread is just checking whether any decoded frames are ready for postprocessing/rendering in the queue and will push to downstream. Please also check the logic used in surface_proxy releasing. Once we pushed the surface_proxy downstream, we don't know when it lose its reference. Thats why we set the gst_vaapidecode_release() as destroy_notification function of surface proxy. In a single thread implementation, .... ! vaapidecode ! videosink ... you might not guaranteed to have a free surface always and which can lead to deadlock and all. The initial implementation of gstremaer-vaapi was using single thread(with some time out and all), but later we decided to use multiple threads. You can check the commit history if needed. Also you can see the similar kind of locking mechanism in upstream gst-omx, which is using separate pad_task. BTB, does the patch provided https://bugzilla.gnome.org/show_bug.cgi?id=740645 has ny impact in your problem?
(In reply to comment #7) > We have to decode the frames as fast as possible. So we are doing decoding in > one thread and giving frames to the hardware for decoding. The second thread is > just checking whether any decoded frames are ready for postprocessing/rendering > in the queue and will push to downstream. Please also check the logic used in > surface_proxy releasing. Once we pushed the surface_proxy downstream, we don't > know when it lose its reference. Thats why we set the gst_vaapidecode_release() > as destroy_notification function of surface proxy. > In a single thread implementation, > .... ! vaapidecode ! videosink ... > you might not guaranteed to have a free surface always and which can lead to > deadlock and all. > The initial implementation of gstremaer-vaapi was using single thread(with some > time out and all), but later we decided to use multiple threads. You can check > the commit history if needed. > Also you can see the similar kind of locking mechanism in upstream gst-omx, > which is using separate pad_task. > I understand why you might need a queue between vaapidecode and videosink; however, why can't you require a queue to avoid deadlocks, as is done by other elements (such as tsdemux)? > BTB, does the patch provided https://bugzilla.gnome.org/show_bug.cgi?id=740645 > has ny impact in your problem? That patch doesn't help.
(In reply to comment #7) > We have to decode the frames as fast as possible. So we are doing decoding in > one thread and giving frames to the hardware for decoding. The second thread is > just checking whether any decoded frames are ready for postprocessing/rendering > in the queue and will push to downstream. Please also check the logic used in > surface_proxy releasing. Once we pushed the surface_proxy downstream, we don't > know when it lose its reference. Thats why we set the gst_vaapidecode_release() > as destroy_notification function of surface proxy. > In a single thread implementation, > .... ! vaapidecode ! videosink ... > you might not guaranteed to have a free surface always and which can lead to > deadlock and all. Assuming no bugs in videosink, the surface you've handed to the videosink will come back when its PTS goes past, and you will unblock. Further, playbin automatically plugs a queue between decode and sink, so the majority of applications have a queue in place that allows vaapidecode to run ahead of vaapisink. > The initial implementation of gstremaer-vaapi was using single thread(with some > time out and all), but later we decided to use multiple threads. You can check > the commit history if needed. I've looked at the commit history, and I can't see any justification for multiple threads; they came in from the beginning of vaapidecode, AFAICT, and were partially removed because they caused issues. The decoder is introduced by: commit a203d19a35f2d5f64f4f239e1c6519fb21424094 Author: gb <gb@5584edef-b1fe-4b99-b61b-dd2bab72e969> Date: Fri Apr 23 16:05:58 2010 +0000 Add initial (multithreaded) decoder based on FFmpeg. Which already has multiple threads, and is the first implementation of vaapidecode in the repository. commit a4d201aaf93816fd23ed1eeffae6786cec883020 Author: gb <gb@5584edef-b1fe-4b99-b61b-dd2bab72e969> Date: Fri Apr 30 15:37:28 2010 +0000 Drop excessive threading that over-complicates synchronisation. MPEG-2 & H.264 videos now play but there are other problems (timestamps). Reduces the number of threads in use, because it's causing problems.
(In reply to comment #7) > Also you can see the similar kind of locking mechanism in upstream gst-omx, > which is using separate pad_task. > I've looked at gst-omx, and I don't see the similarity, other than using a pad_task, and having locking between the pad_task and the rest of the decoder element. I can't, for example, find the VA-API equivalent of set_flushing.
--First of all, there are use cases other than playbin and don't want to force the user to use a queue. -- IIRC, the 0.3.x, 0.4.x and earlier releases of 0.5.x were using single thread.(http://www.freedesktop.org/software/vaapi/releases/gstreamer-vaapi/) -- gst_omx is an example for pad_task usage, thats it :) BTB, how can I reproduce the issue?
(In reply to comment #11) > --First of all, there are use cases other than playbin and don't want to force > the user to use a queue. > > -- IIRC, the 0.3.x, 0.4.x and earlier releases of 0.5.x were using single > thread.(http://www.freedesktop.org/software/vaapi/releases/gstreamer-vaapi/) > I've just looked at the 0.3 branch. It's multithreaded back then, just not a pad_task, a free coded thread. That changes during 0.5.x from a free thread to a pad_task, but the thread has been there from day 1. > -- gst_omx is an example for pad_task usage, thats it :) > > BTB, how can I reproduce the issue? I'll attach my pipeline graph and source media. To reproduce, get the pipeline running, then set the pipeline's state to NULL while the pipeline is running.
Created attachment 296102 [details] Pipeline graph when the media is playing correctly From code (my application has extra command and control, and is written using gst-python), set up this pipeline.
Media is at http://90.155.96.198/sfarnsworth/movie.mp4 - too big to attach. Play this media, then once it's been playing for a little (random times in my test), set the pipeline state to GST_STATE_NULL. If you hit the race, the pipeline deadlocks.
(In reply to comment #12) > (In reply to comment #11) > > --First of all, there are use cases other than playbin and don't want to force > > the user to use a queue. > > > > -- IIRC, the 0.3.x, 0.4.x and earlier releases of 0.5.x were using single > > thread.(http://www.freedesktop.org/software/vaapi/releases/gstreamer-vaapi/) > > > I've just looked at the 0.3 branch. It's multithreaded back then, just not a > pad_task, a free coded thread. That changes during 0.5.x from a free thread to > a pad_task, but the thread has been there from day 1. Please show me the code where it create thread in those older versions ! > > > -- gst_omx is an example for pad_task usage, thats it :) > > > > BTB, how can I reproduce the issue? > > I'll attach my pipeline graph and source media. To reproduce, get the pipeline > running, then set the pipeline's state to NULL while the pipeline is running. Sorry, don't have time to write sample code to test it now.
(In reply to comment #15) > (In reply to comment #12) > > (In reply to comment #11) > > > --First of all, there are use cases other than playbin and don't want to force > > > the user to use a queue. > > > > > > -- IIRC, the 0.3.x, 0.4.x and earlier releases of 0.5.x were using single > > > thread.(http://www.freedesktop.org/software/vaapi/releases/gstreamer-vaapi/) > > > > > I've just looked at the 0.3 branch. It's multithreaded back then, just not a > > pad_task, a free coded thread. That changes during 0.5.x from a free thread to > > a pad_task, but the thread has been there from day 1. > > Please show me the code where it create thread in those older versions ! > According to the commit log, which *you* told me to look at, it came as part of: commit a203d19a35f2d5f64f4f239e1c6519fb21424094 Author: gb <gb@5584edef-b1fe-4b99-b61b-dd2bab72e969> Date: Fri Apr 23 16:05:58 2010 +0000 Add initial (multithreaded) decoder based on FFmpeg.
(In reply to comment #16) > (In reply to comment #15) > > (In reply to comment #12) > > > (In reply to comment #11) > > > > --First of all, there are use cases other than playbin and don't want to force > > > > the user to use a queue. > > > > > > > > -- IIRC, the 0.3.x, 0.4.x and earlier releases of 0.5.x were using single > > > > thread.(http://www.freedesktop.org/software/vaapi/releases/gstreamer-vaapi/) > > > > > > > I've just looked at the 0.3 branch. It's multithreaded back then, just not a > > > pad_task, a free coded thread. That changes during 0.5.x from a free thread to > > > a pad_task, but the thread has been there from day 1. > > > > Please show me the code where it create thread in those older versions ! > > > According to the commit log, which *you* told me to look at, it came as part > of: > > commit a203d19a35f2d5f64f4f239e1c6519fb21424094 > Author: gb <gb@5584edef-b1fe-4b99-b61b-dd2bab72e969> > Date: Fri Apr 23 16:05:58 2010 +0000 > > Add initial (multithreaded) decoder based on FFmpeg. --Checkout 0.3-branch --check source code of gst/vaapi/gstvaapidecode.c , line:205 on wards,,there you can see the single thread implementation.
Hi Simon, Please try to provide some sample code which is doing the same thing as your pipeline. You can easily create a pipeline like this pipeline = gst_parse_launch ("filesrc location =movie.mp4 ! typefind ! qtdemux ! multiqueue ! vaapiparse_h264 ! h264parse ! vaapidecode ! queue ! vaapipostproc ! vaapisink", NULL); And set the what ever states you need at regular intervals or so. For eg: you can add a timeout function like g_timeout_add (5000,(GSourceFunc)callback_fun_to_change_state, NULL); to change the state of pipeline at every five seconds. Let me know if there are issues. Also make sure that the pipeline is running fine with software decoder(avdec_h264) , software converter(videoconvert) and software sink(xvimagesink) .Just want to make sure that we are not messing up anything in vaapi elements.
(In reply to sreerenj from comment #18) > Hi Simon, > > Please try to provide some sample code which is doing the same thing as your > pipeline. > > You can easily create a pipeline like this > > pipeline = gst_parse_launch ("filesrc location =movie.mp4 ! typefind ! > qtdemux ! multiqueue ! vaapiparse_h264 ! h264parse ! vaapidecode ! queue ! > vaapipostproc ! vaapisink", NULL); > > And set the what ever states you need at regular intervals or so. > For eg: you can add a timeout function like > > g_timeout_add (5000,(GSourceFunc)callback_fun_to_change_state, NULL); > to change the state of pipeline at every five seconds. > I've tried this - I can't work out the exact set of callbacks I need to trigger outside my system to get the race to hit. It's trivial to hit within my system (simply have 16 separate GStreamer processes playing the same movie simultaneously - at least one of them will race within 15 minutes). > Let me know if there are issues. Also make sure that the pipeline is running > fine with software decoder(avdec_h264) , software converter(videoconvert) > and software sink(xvimagesink) .Just want to make sure that we are not > messing up anything in vaapi elements. Switching out the decoder is enough to make it work. I've not tried changing the sink, as the backtraces clearly show that the deadlock is within the decoder.
Created attachment 296771 [details] [review] [PATCH] vaapidecode: Remove the src pad task Because the decoder uses the thread from handle_frame() to decode a frame, the src pad task creates an unsolveable AB-BA deadlock between handle_frame() waiting for a free surface and decode_loop() pushing decoded frames out. Instead, have handle_frame() take responsibility for pushing surfaces, and remove the deadlock completely. If you need a separate thread downstream, you can insert a queue between vaapidecode and its downstream to get one. Signed-off-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk> --- gst/vaapi/gstvaapidecode.c | 245 ++++++++++++++++----------------------------- gst/vaapi/gstvaapidecode.h | 7 +- 2 files changed, 87 insertions(+), 165 deletions(-)
(In reply to Simon Farnsworth from comment #20) > Created attachment 296771 [details] [review] [review] > [PATCH] vaapidecode: Remove the src pad task > > > Because the decoder uses the thread from handle_frame() to decode a frame, > the src pad task creates an unsolveable AB-BA deadlock between > handle_frame() waiting for a free surface and decode_loop() pushing > decoded frames out. > > Instead, have handle_frame() take responsibility for pushing surfaces, > and remove the deadlock completely. If you need a separate thread > downstream, you can insert a queue between vaapidecode and its downstream > to get one. > > Signed-off-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk> > --- > gst/vaapi/gstvaapidecode.c | 245 > ++++++++++++++++----------------------------- > gst/vaapi/gstvaapidecode.h | 7 +- > 2 files changed, 87 insertions(+), 165 deletions(-) With this patch applied, I've tested the following pipelines (all succeed): First, the various pipelines my code uses internally, including the one that triggered this bug report. Second, three different sinks, without queues - this should trigger any deadlocks caused by the sink holding proxy surfaces: filesrc location=/data/data/media/default/movie.mpg ! qtdemux ! vaapidecode ! vaapisink filesrc location=/data/data/media/default/movie.mpg ! qtdemux ! vaapidecode ! xvimagesink filesrc location=/data/data/media/default/movie.mpg ! qtdemux ! vaapidecode ! glimagesink A transcode pipeline, and I viewed the output file to confirm that it really had transcoded: filesrc location=/data/data/media/default/movie.mpg ! qtdemux ! vaapidecode ! vaapiencode_mpeg2 ! mpegvideoparse ! mpegtsmux ! filesink location=test.mpg A live pipeline. This is slightly more complex, because I need queues with software decode, too. With software decode, I use: udpsrc buffer-size=131072 uri=udp://239.192.133.203:5000 ! queue max-size-bytes=0 max-size-buffers=0 max-size-time=0 ! tsdemux ! mpegvideoparse ! mpeg2dec ! xvimagesink Removing buffer-size or the queue causes problems with the above pipeline, as data is dropped before tsdemux. With a patched vaapidecode: udpsrc buffer-size=131072 uri=udp://239.192.133.203:5000 ! queue max-size-bytes=0 max-size-buffers=0 max-size-time=0 ! tsdemux ! mpegvideoparse ! vaapidecode ! vaapisink udpsrc buffer-size=131072 uri=udp://239.192.133.203:5000 ! queue max-size-bytes=0 max-size-buffers=0 max-size-time=0 ! tsdemux ! mpegvideoparse ! vaapidecode ! xvimagesink works (swapping mpeg2dec to vaapidecode, and using one of two sinks to show that it doesn't deadlock in the sink while live). Finally, I've tested playbin, and confirmed that playbin was using vaapidecode successfully.
Is there no way to reproduce the issue with sample code?? I want to see/know the root cause of the issue instead of fixing it in some other way.... Are you experiencing the dead-lock while doing finish() operations?? Also IMHO it is not so nice to ask for the user to use a queue always..We have customers who use pipelines with out queue, so they might come up with bug reports if I push this.
(In reply to sreerenj from comment #22) > Is there no way to reproduce the issue with sample code?? I want to see/know > the root cause of the issue instead of fixing it in some other way.... Are > you experiencing the dead-lock while doing finish() operations?? > Also IMHO it is not so nice to ask for the user to use a queue always..We > have customers who use pipelines with out queue, so they might come up with > bug reports if I push this. It's a challenging race to hit, and I can't work out how to hit it without a full system around me - it depends on quite a bit of luck, too. As you'll see if you've read my testing report, the only case where I needed a queue to make it work was the case where software decode also needs a queue; I have not been able to find a case where a queue is needed for vaapidecode with the thread removed, but was not also needed when the thread was present. Do you have any example pipelines I can test where the thread is required to work around the lack of a queue?
May be some more complicated pipelines where downstream elements after decoder require more time to do expensive video post processing operations on high resoulution videos, so that the surfaces wont get released as expected in normal cases. Did you mange to figure out where the dead-locks occur? Does that occurring only when changing state from PAUSED to READY? or some other cases like seeking also causing the issue? Would be helpful if you can provide some more explanations..
(In reply to sreerenj from comment #22) > Is there no way to reproduce the issue with sample code?? I want to see/know > the root cause of the issue instead of fixing it in some other way.... Are > you experiencing the dead-lock while doing finish() operations?? > Also IMHO it is not so nice to ask for the user to use a queue always..We > have customers who use pipelines with out queue, so they might come up with > bug reports if I push this. The backtraces in https://bugzilla.gnome.org/page.cgi?id=traceparser/trace.html&trace_id=234531 should make the root cause clear; they should be read with your working copy at 8bf8f110. Before the backtraces were taken, I called gst_element_set_state( my_pipeline, GST_STATE_NULL ) while the pipeline is in GST_STATE_PLAYING. The underlying issue is that you cannot know the state of the GST_VIDEO_DECODER_STREAM lock as you enter this function; there are different routes through GstVideoDecoder which mean that you can't know how many calls to GST_VIDEO_DECODER_STREAM_UNLOCK() are needed to unblock the other threads that take the GST_VIDEO_DECODER_STREAM lock.
(In reply to sreerenj from comment #24) > May be some more complicated pipelines where downstream elements after > decoder require more time to do expensive video post processing operations > on high resoulution videos, so that the surfaces wont get released as > expected in normal cases. > Can you give me an example? Adding vaapipostproc set to do advanced deinterlacing (motion compensated) isn't enough to make this happen, and I cannot envisage a situation where the thread helps, as it only pushes surfaces (it does not offload the decode - decode is done directly in decode_frame's thread). Specifically, gst_vaapi_decoder_decode uses handle_frame()'s thread to do the decode work (this, and gst_vaapi_decoder_flush, are the only functions in gstvaapidecode.c that can end up calling vaEndPicture), and thus there is no way for a buffer to end up queued for output after gst_vaapi_decoder_decode returns but before the next call to gst_vaapi_decoder_decode. vaapidecode provides a parse() method to GstVideoDecoder, which ensures that handle_frame() is only called with at most one frame's worth of data at a time. As a result, gst_vaapi_decoder_decode can only return GST_VAAPI_DECODER_STATUS_ERROR_NO_SURFACE if either (a) there are insufficient surfaces available to decode this stream, and the decoder is holding onto all of them (e.g. as reference surfaces), or (b) at least one surface has left the decoder, and we're waiting for it to come back. The thread does not help with case (a); the decoder holds all the surfaces, and is not letting any of them out. In case (b), as long as we ensure that all surfaces are pushed out immediately after gst_vaapi_decoder_decode returns, we will eventually see a call to gst_vaapidecode_release that unblocks us. If I've missed a case in the above analysis, please point it out. Note that all that occurs in the src pad task is that completed frames are pulled from the output queue of the decoder; no decoding takes place in the src pad task. > Did you mange to figure out where the dead-locks occur? Does that occurring > only when changing state from PAUSED to READY? or some other cases like > seeking also causing the issue? Would be helpful if you can provide some > more explanations.. You need a very specific interleaving of threads to make it happen. Specifically, you need the scheduler to fail to schedule the src pad task while the following events happen: 1. Precondition: gst_vaapidecode_decode_frame has been called enough times to fill all the surfaces in the pool, and the src pad task has not yet been scheduled to push these surfaces from the decoder to downstream. As a result, gst_vaapidecode_release will not be called, as the surfaces are in the decoder output queue. 2. gst_vaapidecode_decode_frame enters the if (status == GST_VAAPI_DECODER_STATUS_ERROR_NO_SURFACE) block, and is then scheduled away before it does the cond_wait. 3. Another thread sets the pipeline state to GST_STATE_NULL, and gst_vaapidecode_change_state reaches the return GST_ELEMENT_CLASS(gst_vaapidecode_parent_class)->change_state(element, transition); statement. Note that because gst_vaapidecode_decode_frame has not yet done the cond_wait, the efforts made to unblock gst_vaapidecode_decode_frame before stopping the src pad task don't succeed. 4. gst_vaapidecode_decode_frame now executes far enough to enter the cond_wait, and can never return. Note that even if you unblock the cond_wait, gst_vaapidecode_decode_frame will re-enter the cond_wait, as the decoder is still blocked waiting for the src pad task to run, but the src pad task cannot run because it's been stopped in step 3.
A further note; since applying this patch, the random crashes I've been seeing when using vaapidecode (which I intended to investigate next) have gone away.
I think that all Simon's points are valid. However, it would be interesting to align the architecture with the GStreamer "reference" decoder for hwaccel nowadays, which trends to be omx. There are indeed too many points of locking in the gst-vaapi implementation, and deadlocks are thus possible. Would it help if we could hand over lock functions to GstVaapiDecoder? Or ensure in there additional machinery to help develop a correct decoder with a dedicated task for handle_frame()? More analysis to come later. Thanks.
(In reply to Simon Farnsworth from comment #26) > (In reply to sreerenj from comment #24) > Specifically, gst_vaapi_decoder_decode uses handle_frame()'s thread to do > the decode work (this, and gst_vaapi_decoder_flush, are the only functions > in gstvaapidecode.c that can end up calling vaEndPicture), and thus there is > no way for a buffer to end up queued for output after > gst_vaapi_decoder_decode returns but before the next call to > gst_vaapi_decoder_decode. > But if we handle decode_frame + push_frame_downstream in a single thread (no queue in downstream too), then the streaming thread(handle_frame()) needs to wait for the completion of all postprocessing of decoded frame before accepting the next frame to start decoding. Instead if we handle the push_frame_downstream in a separate thread, handle_frame() can accept new frame just after finishing the decoding of each frame. WDT? > vaapidecode provides a parse() method to GstVideoDecoder, which ensures that > handle_frame() is only called with at most one frame's worth of data at a > time. >
(In reply to Gwenole Beauchesne from comment #28) > I think that all Simon's points are valid. However, it would be interesting > to align the architecture with the GStreamer "reference" decoder for hwaccel > nowadays, which trends to be omx. > The key difference between VA-API and OpenMAX IL is that OpenMAX IL (OMX) is equivalent in power to GStreamer itself, and provides a pipeline of its own internally. VA-API provides the implementation backend for a single OMX component or GStreamer element. This leads to a significant difference between the behaviour of any OMX component wrapper, and a VA-API wrapper; in OMX, to use a single component on its own, I call OMX_EmptyThisBuffer to hand a buffer of data to OMX, and later call OMX_FillThisBuffer to get data out of OMX. This is effectively a FIFO from your code to a hidden OMX thread, and a second FIFO from the hidden OMX thread to you. In contrast, VA-API behaves a lot more like libav; I call vaBeginPicture and vaEndPicture as a pair to supply data to VA-API, and immediately (modulo processing time) get back data to queue elsewhere - there's no vaOutputPicture call required to get the output back from VA-API. > There are indeed too many points of locking in the gst-vaapi implementation, > and deadlocks are thus possible. Would it help if we could hand over lock > functions to GstVaapiDecoder? Or ensure in there additional machinery to > help develop a correct decoder with a dedicated task for handle_frame()? > Note that you already have parallelism on Intel, implied by the Linux GPU interfaces. When the kernel returns to you after submitting the decode batch to the GPU, the GPU may still be busy working on the decode. You don't stall waiting for the GPU until you attempt to inspect the decoder's output.
(In reply to sreerenj from comment #29) > (In reply to Simon Farnsworth from comment #26) > > (In reply to sreerenj from comment #24) > > > Specifically, gst_vaapi_decoder_decode uses handle_frame()'s thread to do > > the decode work (this, and gst_vaapi_decoder_flush, are the only functions > > in gstvaapidecode.c that can end up calling vaEndPicture), and thus there is > > no way for a buffer to end up queued for output after > > gst_vaapi_decoder_decode returns but before the next call to > > gst_vaapi_decoder_decode. > > > > But if we handle decode_frame + push_frame_downstream in a single thread (no > queue in downstream too), then the streaming thread(handle_frame()) needs to > wait for the completion of all postprocessing of decoded frame before > accepting the next frame to start decoding. Instead if we handle the > push_frame_downstream in a separate thread, handle_frame() can accept new > frame just after finishing the decoding of each frame. WDT? > If you do everything in one thread, then yes, you lose some opportunities for parallelism between your post processing and your decoder. You don't completely lose on Intel GPUs, however, as the submission of work to the GPU returns immediately, and you do not wait for the GPU to complete decoding until you attempt to inspect the result. However, this is what you expect when using GStreamer; it's how software decoders behave, for example.
(In reply to Simon Farnsworth from comment #31) > (In reply to sreerenj from comment #29) > > (In reply to Simon Farnsworth from comment #26) > > > (In reply to sreerenj from comment #24) > > > > > Specifically, gst_vaapi_decoder_decode uses handle_frame()'s thread to do > > > the decode work (this, and gst_vaapi_decoder_flush, are the only functions > > > in gstvaapidecode.c that can end up calling vaEndPicture), and thus there is > > > no way for a buffer to end up queued for output after > > > gst_vaapi_decoder_decode returns but before the next call to > > > gst_vaapi_decoder_decode. > > > > > > > But if we handle decode_frame + push_frame_downstream in a single thread (no > > queue in downstream too), then the streaming thread(handle_frame()) needs to > > wait for the completion of all postprocessing of decoded frame before > > accepting the next frame to start decoding. Instead if we handle the > > push_frame_downstream in a separate thread, handle_frame() can accept new > > frame just after finishing the decoding of each frame. WDT? > > > If you do everything in one thread, then yes, you lose some opportunities > for parallelism between your post processing and your decoder. You don't > completely lose on Intel GPUs, however, as the submission of work to the GPU > returns immediately, and you do not wait for the GPU to complete decoding > until you attempt to inspect the result. > There are many other use cases which is doing software postprocessing . For eg: we have to allow the upstream gstreamer vpp elemetns (software) to connect with vaapidecode. OpenCV elements could be a handy example to do expensive operations. > However, this is what you expect when using GStreamer; it's how software > decoders behave, for example.
(In reply to sreerenj from comment #32) > (In reply to Simon Farnsworth from comment #31) > > (In reply to sreerenj from comment #29) > > > (In reply to Simon Farnsworth from comment #26) > > > > (In reply to sreerenj from comment #24) > > > > > > > Specifically, gst_vaapi_decoder_decode uses handle_frame()'s thread to do > > > > the decode work (this, and gst_vaapi_decoder_flush, are the only functions > > > > in gstvaapidecode.c that can end up calling vaEndPicture), and thus there is > > > > no way for a buffer to end up queued for output after > > > > gst_vaapi_decoder_decode returns but before the next call to > > > > gst_vaapi_decoder_decode. > > > > > > > > > > But if we handle decode_frame + push_frame_downstream in a single thread (no > > > queue in downstream too), then the streaming thread(handle_frame()) needs to > > > wait for the completion of all postprocessing of decoded frame before > > > accepting the next frame to start decoding. Instead if we handle the > > > push_frame_downstream in a separate thread, handle_frame() can accept new > > > frame just after finishing the decoding of each frame. WDT? > > > > > If you do everything in one thread, then yes, you lose some opportunities > > for parallelism between your post processing and your decoder. You don't > > completely lose on Intel GPUs, however, as the submission of work to the GPU > > returns immediately, and you do not wait for the GPU to complete decoding > > until you attempt to inspect the result. > > > > There are many other use cases which is doing software postprocessing . > For eg: we have to allow the upstream gstreamer vpp elemetns (software) to > connect with vaapidecode. OpenCV elements could be a handy example to do > expensive operations. > > > However, this is what you expect when using GStreamer; it's how software > > decoders behave, for example. And in those cases, I would expect the thread to (at best) behave equivalently to inserting a queue max-size-time=0 max-size-bytes=0 max-size-buffers=1 between the decoder and the post processing elements. Note, however, that even without the thread, there are no correctness implications; the thread merely provides a small performance increase, which you can easily obtain outside the decoder element by adding a queue. Also, this bug is reproducible with a lot of luck and gst-launch-1.0; if you hit Ctrl-C at just the right moment with the following pipeline running, you will see the deadlock occur: gst-launch-1.0 filesrc location=movie.mpg ! qtdemux ! vaapidecode ! vaapisink However, getting the timing right to hit the bug reliably is challenging.
In a single thread implementation (with out queue), who is going to provide wake-up signal to the same thread blocked in g_cond_wait() ?? I agree that adding a queue will fix the issue with an advantage of simple code to maintain but a disadvantage of no parallelism. I remember an old proposal from Gwenole to use a "vaapibin" . May be it is the time to implement :)
(In reply to sreerenj from comment #34) > In a single thread implementation (with out queue), who is going to provide > wake-up signal to the same thread blocked in g_cond_wait() ?? > > I agree that adding a queue will fix the issue with an advantage of simple > code to maintain but a disadvantage of no parallelism. err: no parallelism -> no parallelsim in vaapielement itself > > I remember an old proposal from Gwenole to use a "vaapibin" . May be it is > the time to implement :)
(In reply to sreerenj from comment #34) > In a single thread implementation (with out queue), who is going to provide > wake-up signal to the same thread blocked in g_cond_wait() ?? > I think I can convince you that the src pad task can't reliably provide the wake-up signal, either. Please pokes holes in the following, and I'll try and fill them. Preconditions: * We treat the entire downstream chain as one unit, even if it consists of multiple elements. * The purpose of the condition wait is to block the decoder until a buffer is available from the buffer pool. * The signal is provided in gst_vaapidecode_release when downstream releases its final reference to the buffer, placing it back into the buffer pool. * The locking provided by GstVideoDecoder is assumed to be correct when a derived element does not start any tasks or threads. * Correct operation of a given pipeline requires that there are sufficient buffers (and hence vaapi surfaces) in the buffer pool for vaapidecode and all elements downstream of vaapidecode. The pool size must take into account the worst case buffer usage of every element. Notes: If there are insufficient buffers then the pipeline will fail eventually; the choice is between non-deterministic failure and deterministic failure. This must be fixed by changing the design to use a larger buffer pool. We only enter the cond_wait when the decoder cannot allocate a buffer from the buffer pool. Once we have entered the cond_wait, we require the elements that hold buffers from the buffer pool to return a buffer before the decoder can provide a new frame, regardless of whether we have a src pad task or not. We're going to imagine a pipeline of the form: upstream ! vaapidecode ! downstream where upstream and downstream can be complex bins if required. There are three cases of interest (where the buffer pool is empty): 1. Downstream provides a thread on which it eventually unrefs video frames. In this case, when downstream unrefs a buffer, gst_vaapidecode_release will be called from downstream's thread and provides the wake-up signal; this will happen with or without a src pad task. We'll ignore this case for later analysis - it's trivially the same with or without the src pad task. 2. Downstream does not return back to gst_video_decoder_finish_frame until it has unrefed the incoming video frame, thus returning a buffer to the buffer pool. 3. Downstream stores incoming video frames, and will unref some once it has received enough frames to do a burst of processing. In case 2, there are two ways it can go with a src pad task: a. handle_frame enters the cond_wait before the src pad task has sent all the frames downstream. Downstream will use the src pad task's thread to release the buffer, causing handle_frame to be signalled and continue. b. The src pad task sends all the frames downstream and starts waiting for a new frame to leave the decoder before handle_frame is called again by upstream. There are enough buffers in the pool, and thus handle_frame does not enter the cond_wait. When handle_frame is called, it pushes new data into the decoder, enabling the src pad task to send more frames downstream. c. The src pad task sends all the frames downstream and starts waiting for a new frame to leave the decoder before handle_frame is called again by upstream. The buffer pool is empty, and thus handle_frame enters the cond_wait at this point (while the src pad task is waiting for a frame from the decoder before it calls downstream again). The cond_wait will never be signalled, as the only time it can be signalled is if the src pad task calls downstream, but the src pad task is waiting for handle_frame to feed new data into the decoder. The result is deadlock. The removal of the src pad task eliminates case 2.a completely; my patch ensures that we are always in case 2.b if the buffer pool is big enough, but will deterministically enter case 2.c when the buffer pool is too small. Because failure is deterministic, diagnosing the bug is simple (even in the face of cool technology like Turbo Boost) and the solution should be clear. With the src pad task, if the buffer pool is big enough, we're nondeterministically in case 2.a or case 2.b, while if the buffer pool is not big enough, we nondeterministically enter case 2.a or 2.c. Case 3 is the same as case 2, but with the added wrinkle that we need to ensure that we have enough surfaces spare to cope with downstream storing frames. Again, the src pad task doesn't help us; if downstream has returned to decode_loop, and decode_loop is not going to call into downstream again until a new frame is output by the decoder, something other than downstream has to unblock handle_frame, but there's nothing else available to signal the condvar. Thus, while I wouldn't be surprised to see more bug reports after dropping down to a single thread, they're all cases where we've gone from non-deterministic deadlocks to deterministic deadlocks - the increase in determinism should make it easier to diagnose the real problem. > I agree that adding a queue will fix the issue with an advantage of simple > code to maintain but a disadvantage of no parallelism. > And adding the queue, limited to a single buffer, allows you to recover the lost parallelism on the output. Further, you can use the queue to smooth out the decode work as compared to the post-processing work, if your post processing is uneven over differing frames; just make the queue larger as needed to ensure that there are frames waiting for the post processor. I can easily imagine a combined scene change detector and face recogniser being cheap when there are no scene changes, and expensive at cuts, for example. I'd also note that there might be room for slight optimization here, if GStreamer's parsers can be relied upon to provide correctly packetized buffers; you can tell GstVideoDecoder that the input is packetized if the sink caps show that it's parsed=true and the rest of the caps indicate that it'll be packetized as VA-API expects. This will reduce overhead for decodebin and playbin uses. > I remember an old proposal from Gwenole to use a "vaapibin" . May be it is > the time to implement :) Might not be a bad thing, if it eases moving from gst-omx on nasty hardware to gst-vaapi :)
As more data; I've been running for 6 days with this patch and no crashes, on a test case that previously had at least 4 crashes within 30 minutes. I've also not seen a single deadlock. Our QA team is assessing this patch, too, and has found no problems yet.
Review of attachment 294174 [details] [review]: Rejecting the older "hack" patch
Simon, Could you please rebase it on top of current master?
Created attachment 298003 [details] [review] [PATCH] vaapidecode: Remove the src pad task Because the decoder uses the thread from handle_frame() to decode a frame, the src pad task creates an unsolveable AB-BA deadlock between handle_frame() waiting for a free surface and decode_loop() pushing decoded frames out. Instead, have handle_frame() take responsibility for pushing surfaces, and remove the deadlock completely. If you need a separate thread downstream, you can insert a queue between vaapidecode and its downstream to get one. Signed-off-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk> --- ext/codecparsers | 2 +- gst/vaapi/gstvaapidecode.c | 275 ++++++++++++++++----------------------------- gst/vaapi/gstvaapidecode.h | 7 +- 3 files changed, 101 insertions(+), 183 deletions(-)
Review of attachment 296771 [details] [review]: Outdated
(In reply to Simon Farnsworth from comment #40) > Created attachment 298003 [details] [review] [review] > [PATCH] vaapidecode: Remove the src pad task > > > Because the decoder uses the thread from handle_frame() to decode a frame, > the src pad task creates an unsolveable AB-BA deadlock between > handle_frame() waiting for a free surface and decode_loop() pushing > decoded frames out. > > Instead, have handle_frame() take responsibility for pushing surfaces, > and remove the deadlock completely. If you need a separate thread > downstream, you can insert a queue between vaapidecode and its downstream > to get one. > > Signed-off-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk> > --- > ext/codecparsers | 2 +- > gst/vaapi/gstvaapidecode.c | 275 > ++++++++++++++++----------------------------- > gst/vaapi/gstvaapidecode.h | 7 +- > 3 files changed, 101 insertions(+), 183 deletions(-) This is rebased on current master. I could do with Victor giving it a quick look over, to make sure I've not lost anything important from his changes.
The attached patch is changing the subproject commit id, which is wrong.You have to rebase it on top of a fresh copy of gstreamer-vaapi.
Created attachment 298071 [details] [review] [PATCH] vaapidecode: Remove the src pad task Because the decoder uses the thread from handle_frame() to decode a frame, the src pad task creates an unsolveable AB-BA deadlock between handle_frame() waiting for a free surface and decode_loop() pushing decoded frames out. Instead, have handle_frame() take responsibility for pushing surfaces, and remove the deadlock completely. If you need a separate thread downstream, you can insert a queue between vaapidecode and its downstream to get one. Signed-off-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk> --- gst/vaapi/gstvaapidecode.c | 275 ++++++++++++++++----------------------------- gst/vaapi/gstvaapidecode.h | 7 +- 2 files changed, 100 insertions(+), 182 deletions(-)
(In reply to sreerenj from comment #44) > The attached patch is changing the subproject commit id, which is wrong.You > have to rebase it on top of a fresh copy of gstreamer-vaapi. Sorry about that - git submodules behave a bit differently to normal files and directories, and I'm not yet used to dealing with them. New version is in attachment 298071 [details] [review].
Comment on attachment 298003 [details] [review] [PATCH] vaapidecode: Remove the src pad task obsoleting this patch, in order to use git bz ;)
doing a smoke test, when changing the playback rate or doing a seek operation throws the error 0:00:46.664861722 25207 0x7f86f4004850 ERROR vaapidecode gstvaapidecode.c:442:gst_vaapidecode_handle_frame: push loop error after decoding -2 But the operation seems to work fine.
(In reply to Víctor Manuel Jáquez Leal from comment #48) > doing a smoke test, when changing the playback rate or doing a seek > operation throws the error > > 0:00:46.664861722 25207 0x7f86f4004850 ERROR vaapidecode > gstvaapidecode.c:442:gst_vaapidecode_handle_frame: push loop error after > decoding -2 > > But the operation seems to work fine. This isn't a new error - it's just deterministically generated now, whereas before (with the src pad task), it could only be generated if the src pad task ran at just the right moment between the controlling thread triggering a seek or rate change, and the decoder getting a new frame to decode. I suspect that the GST_ERROR on line 442 can be removed completely now, with your changes to gst_vaapidecode_push_decoded_frame, as can the one in the error_push_all_decoded_frames: block. In both cases, I can't see a path through gst_vaapidecode_push_decoded_frame or gst_vaapidecode_push_all_decoded_frames that doesn't print an error of its own if something bad has happened. Would you prefer a follow-on patch to remove these GST_ERRORs, or an amended version of my original patch?
(In reply to Simon Farnsworth from comment #49) > Would you prefer a follow-on patch to remove these GST_ERRORs, or an amended > version of my original patch? I can try a follow-on patch this afternoon.
Also it would be good to test with playbin gap-less playback (by using about-to-finish signal).
(In reply to sreerenj from comment #51) > Also it would be good to test with playbin gap-less playback (by using > about-to-finish signal). running gst-play-1.0 --gapless I see a lot of error messages of these types: 0:00:00.190294338 26887 0x7f3140002590 ERROR vaapidecode gstvaapidecode.c:356:gst_vaapidecode_ push_decoded_frame: video sink rejected the video buffer (error: not-negotiated [-4]) and 0:01:22.124027225 26887 0x7f3118007400 ERROR vaapidecode gstvaapidecode.c:356:gst_vaapidecode_ push_decoded_frame: video sink rejected the video buffer (error: not-linked [-1]) mainly the second.
(In reply to Víctor Manuel Jáquez Leal from comment #52) > (In reply to sreerenj from comment #51) > > Also it would be good to test with playbin gap-less playback (by using > > about-to-finish signal). > > running gst-play-1.0 --gapless I see a lot of error messages of these types: > > 0:00:00.190294338 26887 0x7f3140002590 ERROR vaapidecode > gstvaapidecode.c:356:gst_vaapidecode_ > push_decoded_frame: video sink rejected the video buffer (error: > not-negotiated [-4]) > > and > > 0:01:22.124027225 26887 0x7f3118007400 ERROR vaapidecode > gstvaapidecode.c:356:gst_vaapidecode_ > push_decoded_frame: video sink rejected the video buffer (error: not-linked > [-1]) > > mainly the second. Do you see these messages without my patch, just less frequently, or do they only appear when you've applied my patch? If the former, that's expected - I chose one interleaving of the two threads that I knew could not deadlock when merging them together. If the latter, that's unexpected - I shouldn't cause new messages to appear, as it's just one interleaving of the two threads.
Regarding the error message in seek: No need to print error message for flushing pad. I will fix it locally.
Pushed, Thanks for the patch. commit aafa59f01ed5b15bc3628772d1658c63ac87a199 Author: Simon Farnsworth <simon.farnsworth@onelan.co.uk> Date: Mon Mar 2 14:46:38 2015 +0200 vaapidecode: Switch back to Single thread implementation Because the decoder uses the thread from handle_frame() to decode a frame, the src pad task creates an unsolveable AB-BA deadlock between handle_frame() waiting for a free surface and decode_loop() pushing decoded frames out. Instead, have handle_frame() take responsibility for pushing surfaces, and remove the deadlock completely. If you need a separate thread downstream, you can insert a queue between vaapidecode and its downstream to get one. Another justification for the single thread implementation is, there are two many point of locking in gstreamer-vaapi's current implementation which can lead to deadlocks. https://bugzilla.gnome.org/show_bug.cgi?id=742605 Signed-off-by: Simon Farnsworth <simon.farnsworth@onelan.co.uk> Signed-off-by: Víctor Manuel Jáquez Leal <victorx.jaquez@intel.com> Signed-off-by: Sreerenj Balachandran <sreerenj.balachandran@intel.com>
Simon, Is it possible for you to do some testing with the "vaapidecoebin" in your specific environment ?
(In reply to sreerenj from comment #56) > Simon, > Is it possible for you to do some testing with the "vaapidecoebin" in your > specific environment ? Should be possible later this week. I'll move vaapidecodebin stuff to bug 745216 unless I see major issues with it.
(In reply to Simon Farnsworth from comment #57) > (In reply to sreerenj from comment #56) > > Simon, > > Is it possible for you to do some testing with the "vaapidecoebin" in your > > specific environment ? > > Should be possible later this week. I'll move vaapidecodebin stuff to bug > 745216 unless I see major issues with it. Thanks :) Lets close this.