GNOME Bugzilla – Bug 755169
dashdemux: can we have multiple seek events at the same time?
Last modified: 2015-10-29 11:23:37 UTC
I don't know if 2 seek events can be received at the same time from 2 different src pads. I see no reason why not. If it can happen, there is a problem. Stopping of stream tasks is done by gst_adaptive_demux_stop_tasks and is done without the GST_MANIFEST_LOCK taken. The gst_adaptive_demux_stop_tasks is called when a seek event is received on a src pad. If 2 seek events are received at the same time, 2 threads will execute gst_adaptive_demux_stop_tasks at the same time, both trying to delete the same list of tasks. I believe event handling should be serialised using GST_MANIFEST_LOCK, but this will deadlock because during seek adaptive demux will flush all src pads, so it will try to flush the element from which the second seek is coming. That element would probably hold some locks and want to wait to finish the seek operation first, so both elements will deadlock. Can we have 2 seeks at the same time?
Yes
Created attachment 311765 [details] [review] lock access to _stop_tasks This seems to work when sending several seek requests at once (from different threads). However, the original also works with that test seeking code (ie, not deadlock now, but also no errors/crashes before). I have the feeling that after seeking, the stream tends to temporarily pause more from time to time, which may hint at some desync. Also happens with the original code, so unlileky to be linked to this patch.
Review of attachment 311765 [details] [review]: ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +2282,3 @@ gst_task_stop (stream->download_task); if (gst_adaptive_demux_combine_flows (demux) == GST_FLOW_EOS) { + GST_MANIFEST_LOCK (demux); this is not needed, the GST_MANIFEST_LOCK is already taken at this moment. The whole "switch (ret) {" is protected by the lock.
Created attachment 311778 [details] [review] lock access to _stop_tasks Indeed, fixed. That's some unexpected mutex use.
(In reply to Vincent Penquerc'h from comment #4) > Created attachment 311778 [details] [review] [review] > lock access to _stop_tasks > > Indeed, fixed. That's some unexpected mutex use. This solution solves the problem of stopping tasks from 2 different threads. But this is not enough. The manifest lock must be taken at a higher level. The gst_adaptive_demux_stop_tasks is just a step in a sequence of steps that must all be guarded by the manifest lock. For example, during a seek event handling (gst_adaptive_demux_src_event) the manifest lock is taken for the call of gst_adaptive_demux_can_seek. But is not taken for calls to gst_adaptive_demux_is_live, gst_adaptive_demux_get_live_seek_range or even worse gst_segment_copy_into. The gst_adaptive_demux_src_event function makes several accesses to demux object (using the above mentioned functions). In my opinion, the manifest lock should guard the whole "case GST_EVENT_SEEK:" block.
I agree that it seems wrong to lock gst_adaptive_demux_can_seek and not lock the next gst_adaptive_demux_is_live and other manifest using code, given how the former is implemented. I'm less sure about the other object access (ie, the segment use you mentioned above). Some uses of the segment are already protected by the manifest lock, but there are others which do not seem to be (eg, gst_adaptive_demux_stream_push_buffer), and I'd expect the object lock to be used for such, though the code doesn't seem to use it in that way, it seems only used for cancelled, last_ret, and streams, though not always. So I guess I'll try to use the manifest lock for this and test...
(In reply to Vincent Penquerc'h from comment #6) > I agree that it seems wrong to lock gst_adaptive_demux_can_seek and not lock > the next gst_adaptive_demux_is_live and other manifest using code, given how > the former is implemented. I'm less sure about the other object access (ie, > the segment use you mentioned above). Some uses of the segment are already > protected by the manifest lock, but there are others which do not seem to be > (eg, gst_adaptive_demux_stream_push_buffer), and I'd expect the object lock > to be used for such, though the code doesn't seem to use it in that way, it > seems only used for cancelled, last_ret, and streams, though not always. So > I guess I'll try to use the manifest lock for this and test... I would use the manifest lock to create critical sections for a set of steps that try to reconfigure the demux object and must be performed together. For example configuration of streams when a manifest update is received or a period end is detected, or the reconfiguration of demux object when a seek is requested, or a flush event is handled What is not clear to me is when the GST_OBJECT_LOCK (demux) should be used. It seems to guard the demux->cancelled member, but not all accesses to it are done with the GST_OBJECT_LOCK taken, so probably this is another bug.
Created attachment 311848 [details] [review] lock most seeking event code with the manifest lock This seems to work fine. The manifest lock is temporarily dropped in some cases, see details in the commit message. This works fine with the python test case from Alex Ashley. This leaves a number of other uses elsewhere that may need locking.
Review of attachment 311848 [details] [review]: ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +1359,3 @@ } + GST_MANIFEST_UNLOCK (demux); this is wrong. One of the reasons of holding the lock while handling the seek event was to make sure only one thread performs streams cleanup. If you release the lock here, you access the demux->streams without a lock taken. You need to release the manifest lock right before the gst_task_join call and take it back after the call. This will keep accesses to demux->streams protected by the lock and will allow the gst_adaptive_demux_stream_download_loop to get the lock if needed in its way to exit. But now you have allowed another thread to perform a second seek operation while the first one is still in progress. In order to avoid this, you need a new lock to be taken and hold for the duration of the whole seek processing. Basically, an API level lock to serialize API calls.
Hmm, you're right. I should really have seen the stream access after the join, that wasn't very clever :D I'll post an update soon.
Created attachment 311852 [details] [review] ock most seeking event code with the manifest lock Note that I duplicated the loop joining since I'm not sure the streams iterator will always still be valid after temporarily dropping the manifest lock.
Review of attachment 311852 [details] [review]: ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +1371,3 @@ } + GST_MANIFEST_UNLOCK (demux); you cannot do this because you still access demux->streams without a lock taken. For a multithreaded application you simply are not allowed to access a shared variable if you do not hold a lock that is used by all accesses to it. For example, imagine that you release the lock and while you are getting the iter to point to a stream, another thread comes and erases the whole demux->streams list. Next, your thread does GstAdaptiveDemuxStream *stream = iter->data; stream->download_error_count = 0; Bang. Illegal memory access because stream points to a memory no longer valid.
Sure, but we're getting into another problem here. I don't think it's good to wait to have fixed everything that can be connected to this. We'll end up with one gigantic patch in a few months' time. So while you're right, that demux->streams use was broken (in the locking sense) in the first place, so I wouldn't care about that here, and then go on to fix the demux->streams locking separately. Because it's likely fixing demux->streams will also link to other missing locking elsewhere, etc. Unless you know it's not if you already looked at demux->streams locking ?
A possible way to fix this might be to use a copy of the list (made with the lock). We'd have to make _GstAdaptiveDemuxStream refcounted though, as it is not currently. Then do the _join and reset (and a new unref) after unlock.
(In reply to Vincent Penquerc'h from comment #14) > A possible way to fix this might be to use a copy of the list (made with the > lock). We'd have to make _GstAdaptiveDemuxStream refcounted though, as it is > not currently. Then do the _join and reset (and a new unref) after unlock. That's not a solution. What kind of copy will you make? Shallow copy? Deep copy? The _GstAdaptiveDemuxStream holds pointers. How would you copy those? It becomes too complicated because you would need to refcount everything.
I was thinking of copying the list only, not the _GstAdaptiveDemuxStream objects themselves. The refcounting is here to delay the freeing, in case _reset or _expose_pads happens in the time where we released the manifest lock.
(In reply to Vincent Penquerc'h from comment #13) > Sure, but we're getting into another problem here. I don't think it's good > to wait to have fixed everything that can be connected to this. We'll end up > with one gigantic patch in a few months' time. So while you're right, that > demux->streams use was broken (in the locking sense) in the first place, so > I wouldn't care about that here, and then go on to fix the demux->streams > locking separately. Because it's likely fixing demux->streams will also link > to other missing locking elsewhere, etc. Unless you know it's not if you > already looked at demux->streams locking ? I propose that in this ticket we fix the concurrent seek problem. For this, adding a new lock and holding it for the whole duration of handling seek event should be enough. I've raised several bugs related to locks in adaptive demux. In my opinion: 1. we need to review and reduce the usage of demux->cancelled flag, because it is broken. See https://bugzilla.gnome.org/show_bug.cgi?id=755230 2. Add a new lock for API level serialization. We initially use it for seek events, and then reuse it for any event handling (manifest lock was supposed to be used for this, but because it is used in the download_loop we cannot use it) 3. Document all the existing locks and their purpose. Document what data is shared and between which threads. Currently we have 3 categories: a) _src_chain handling streaming; b) download_loop handling stream configuration; c) external calls handling events. Each of those have differnet needs and I don't have a clear picture about where those needs intersect. 4. Review the code to make sure that each shared variable is properly locked before being used
(In reply to Vincent Penquerc'h from comment #16) > I was thinking of copying the list only, not the _GstAdaptiveDemuxStream > objects themselves. The refcounting is here to delay the freeing, in case > _reset or _expose_pads happens in the time where we released the manifest > lock. If you only copy the list, you will have a list containing as data pointers to streams that no longer exist. The stream->download_error_count = 0; that is called after gst_task_join will crash.
I was implying that the copy of the list would ref the streams it contains, though I did not say it. This is why I had suggested adding refcounting to that struct. They'd be unreffed at the end of the function. So stream would still be valid when stream->download_error_count = 0; is called (and then be freed just after that, in case demux->streams was destroyed in the meantime). In this case, "destroyed" changes from "remove pad and free" or "remove pad and schedule for free" to "unref". The remove pad and free/schedule for free would be moved into the dispose function, from where they are now (places which would now unref). It's not clear to me at the moment whether the remove pad should be kept at the same place, or deferred as well.
(In reply to Vincent Penquerc'h from comment #19) > I was implying that the copy of the list would ref the streams it contains, > though I did not say it. This is why I had suggested adding refcounting to > that struct. They'd be unreffed at the end of the function. So stream would > still be valid when stream->download_error_count = 0; is called (and then be > freed just after that, in case demux->streams was destroyed in the > meantime). In this case, "destroyed" changes from "remove pad and free" or > "remove pad and schedule for free" to "unref". The remove pad and > free/schedule for free would be moved into the dispose function, from where > they are now (places which would now unref). It's not clear to me at the > moment whether the remove pad should be kept at the same place, or deferred > as well. That would work, but you still have another problem: when releasing the lock, you allow a second seek thread to reach the gst_task_join. So you would have 2 threads, both trying to call gst_task_join (serialized using manifest lock, but this is not important here).
I'm not sure what you refer by "a second seek thread" in this context. If you mean a second seek request from another thread, this is taken care of by the new top level lock you proposed above. If you meant something else, what ? Though _stop_tasks is also called by _reset, which is itself called from a flush-stop event, or PAUSED->READY state change, so these would also have to be protected by that tpo level lock. But I think the rabbit hole stops after that.
Created attachment 311886 [details] [review] lock most seeking event code with the manifest lock
Review of attachment 311886 [details] [review]: ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +1387,3 @@ + } + + GST_MANIFEST_UNLOCK (demux); copying the list and refcounting the streams does not solve all problems and I don't think it could be a solution gst_adaptive_demux_stream_download_loop can call gst_adaptive_demux_advance_period which calls gst_adaptive_demux_expose_streams which sets old_streams = demux->streams; demux->streams = demux->next_streams; If you release the lock in gst_adaptive_demux_stop_tasks, when you join tasks you are killing streams that were moved to old_streams!
My solution is the following: 1. gst_adaptive_demux_src_event grabs the api_lock so that it is the only one doing cleanup 2. for each stream to stop, gst_adaptive_demux_stop_tasks grabs a new lock stream->lock and sets a new flag stream->cancelled to tell the download loop to abort 3. very important: download_loop grabs the stream->lock every time it wants to read or write the stream object. Which should be almost all its life, except when blocking in conditions 4. very important: every time download_loop grabs the stream->lock , it checks for the stream->cancelled flag and if set, it exits. The 3 and 4 steps allow us to safely ask the download_loop to exit and wait for it to do so. We still need to see how to protect write accesses done by download_loop on the demux object (things like demux->priv->old_streams = NULL; or gst_adaptive_demux_advance_period (demux); ) My feeling is that manifest lock or GST_OBJECT_LOCK (demux) should guard all accesses to demux object and stream->lock should guard all accesses to stream object.
(In reply to Florin Apostol from comment #23) > gst_adaptive_demux_stream_download_loop can call > gst_adaptive_demux_advance_period which calls > gst_adaptive_demux_expose_streams which sets old_streams = demux->streams; > demux->streams = demux->next_streams; > If you release the lock in gst_adaptive_demux_stop_tasks, when you join > tasks you are killing streams that were moved to old_streams! OK, I see how that can happen, because _download_loop earlies out on cancelled just before thw switch, while locking it with the object lock, but not the manifest lock, so there's an opening left. I'm not sure adding yet another lock is a good idea though. I understand your idea, but I'm concerned it will make future changes even harder. What do you think about the patch below (on top of my last one here), which takes care of the issue you mentioned I think, since the manifest lock is held all the time between checking cancelled and calling _expose_streams) and then changing the use of the object lock with the manifest lock (I've not tried it yet, but I think this'd be needed because I'm locking manifest-then-object below, which is the wrong order, and thus possibly racy, but I can't use the other order since I need to unlock object and keep manifest - something which would be solved by merging both and would also get rid of a lock, making things a bit simpler). diff --git a/gst-libs/gst/adaptivedemux/gstadaptivedemux.c b/gst-libs/gst/adaptivedemux/gstadaptivedemux.c index a3f708a..4237262 100644 --- a/gst-libs/gst/adaptivedemux/gstadaptivedemux.c +++ b/gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ -2296,14 +2296,16 @@ gst_adaptive_demux_stream_download_loop (GstAdaptiveDemuxStream * stream) next_download = g_get_monotonic_time (); ret = gst_adaptive_demux_stream_download_fragment (stream); + GST_MANIFEST_LOCK (demux); GST_OBJECT_LOCK (demux); if (demux->cancelled) { stream->last_ret = GST_FLOW_FLUSHING; - goto cancelled; + GST_OBJECT_UNLOCK (demux); + GST_MANIFEST_UNLOCK (demux); + goto cancelled_unlocked; } GST_OBJECT_UNLOCK (demux); - GST_MANIFEST_LOCK (demux); } else { stream->last_ret = ret; }
(In reply to Vincent Penquerc'h from comment #25) > I'm not sure adding yet another lock is a good idea though. I understand > your idea, but I'm concerned it will make future changes even harder. > We need to clarify the use of mutexes. For each shared variable (stream, demux) we should document what mutex protects it. Because now we protect the demux->cancelled sometimes with GST_OBJECT_LOCK (demux) sometimes with GST_MANIFEST_LOCK (demux). And this is not good. > What do you think about the patch below (on top of my last one here), Can you provide a set of patches that should be applied on mainline (or push your tree somewhere I can see it? It is hard for me to figure out the complete solution you propose. For example do you still intend to keep the copy of the streams list? If yes, why? Because I think we should not need it and it complicates things very much.
I pushed a patch for this to my tree to https://git.collabora.com/cgit/user/vincent/gst-plugins-bad/log/?h=dash. The object lock is now not used anymore. If you had got it earlier, I pushed -f as some earlier commits were amended after review. About the list copy, I believe the issue is still here so needs further changes.
Review of attachment 311886 [details] [review]: 1. for each function that must be called with the manifest lock taken, please add a comment to say so. Until now we have gst_adaptive_demux_stream_free, gst_adaptive_demux_stop_tasks, gst_adaptive_demux_expose_streams, gst_adaptive_demux_advance_period, gst_adaptive_demux_has_next_period, gst_adaptive_demux_combine_flows 2. GST_EVENT_RECONFIGURE in gst_adaptive_demux_src_event also needs to grab demux->priv->api_lock. It uses the demux->streams that can be changed by seek event. 3. gst_adaptive_demux_handle_message uses demux->streams. It needs to grab demux->priv->api_lock and the manifest lock 4. gst_adaptive_demux_push_src_event iterates demux->streams. But the callers of gst_adaptive_demux_push_src_event specifically unlock the manifest before calling it. And because the first call is done from the gst_adaptive_demux_src_event case GST_EVENT_SEEK before gst_adaptive_demux_stop_tasks, the download_loop is running and allowed to change the demux->streams, which could invalidate the iterator used in gst_adaptive_demux_push_src_event. I believe the first call of gst_adaptive_demux_push_src_event (when gst_event_new_flush_start is sent) should be done with the manifest lock taken (I know this is dangerous and will deadlock if the element receiving the flush start decides to call back into adaptivedemux, but I cannot see any reason for it to do that). Alternatively, the tasks should be stopped before sending the flush_start, but I can't evaluate the consequences of that. ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +410,2 @@ gst_adaptive_demux_reset (demux); + g_mutex_unlock (&demux->priv->api_lock); api_lock should protect the whole gst_adaptive_demux_change_state function (including the parent class call result = GST_ELEMENT_CLASS (parent_class)->change_state (element, transition);) If it does not, the parent class could use virtual functions to make changes in demux object, so you did not properly serialized 2 change_state calls. @@ +429,3 @@ switch (event->type) { case GST_EVENT_FLUSH_STOP: + g_mutex_lock (&demux->priv->api_lock); same as above, the whole function should be serialized There might be other similar places with the same problem. @@ +910,2 @@ stream = g_malloc0 (demux->stream_struct_size); + g_print ("XX Alloc %p\n", stream); change g_print to GST_DEBUG if you want to keep those messages. @@ +1391,3 @@ GstAdaptiveDemuxStream *stream = iter->data; gst_task_join (stream->download_task); I don't think the copy is needed. download_loop was instructed to exit (by demux->cancelled). As soon as download_loop will try to reacquire the manifest lock (needed for any access to demux->streams), it will check for cancelled flag and will exit. It will have no chance to change the demux->streams list. And if by some chance it manages to change it in its way to exit, copying the list introduces this problem: here you join the task. But later, when the stream is freed, the gst_mini_object_unref will call gst_adaptive_demux_stream_free which will also join the task. So you cannot allow the demux->streams list to change while the manifest lock is released here. Which means the copy is unnecessary. @@ +2184,1 @@ /* Need to unlock as it might post messages to the bus */ here the manifest is unlocked while the streams are unreffed. But gst_adaptive_demux_reset has the same code and there the manifest is not unlocked. Also in all other places where gst_mini_object_unref is used the manifest is not unlocked. Personally I don't see why we should unlock. So what if it post messages on the bus? Can you explain the reason for the copy?
(In reply to Vincent Penquerc'h from comment #27) > I pushed a patch for this to my tree to > https://git.collabora.com/cgit/user/vincent/gst-plugins-bad/log/?h=dash. The > object lock is now not used anymore. > > If you had got it earlier, I pushed -f as some earlier commits were amended > after review. > > About the list copy, I believe the issue is still here so needs further > changes. The top patch on that tree (87db55f5092e9d0d03d6c0eb644bfd92ce061c11) is not the same with the one you posted here. Why do you keep the manifest lock while calling gst_adaptive_demux_stream_download_wait? It will block any attempts to change the demux until the timer expires or the download finishes. gst_adaptive_demux_stream_free and gst_adaptive_demux_stop_tasks cannot signal the condition because they will need to get the manifest lock first!
Do you agree in principle with the merging of the object and manifest locks, as a prerequsite to building upon this change (that's the top patch in the tree, which I made a separate patch for now for clarity) ?
(In reply to Vincent Penquerc'h from comment #30) > Do you agree in principle with the merging of the object and manifest locks, > as a prerequsite to building upon this change (that's the top patch in the > tree, which I made a separate patch for now for clarity) ? It's hard for me to understand what is the set of patches you propose for this ticket. I assume they are the last 3 patches in your tree (17a4a09cc82a43c26f2bc95bd303ef420e2930ba, ec001f4e8511e57fdae7ed6eddf1f651fbf9ab78, 87db55f5092e9d0d03d6c0eb644bfd92ce061c11) I have just reviewed the last 2 and had some comments (see above comments 28 and 29). Sorry, but I don't understand what do you mean by "which I made a separate patch for now for clarity". For clarity, please propose a set of patches (marked for example 1/3, 2/3, 3/3) that should be applied on the current master. Thus everyone can see what changes you propose and in which order. Regarding "the merging of the object and manifest locks": I believe we should clearly document what is the purpose of each lock. At this moment, I can't tell when the demux object lock should be used. I understand the reason for api_lock, I would guess that manifest lock should be used whenever demux object is accessed, but I cannot say when demux object lock should be taken.
The penultimate (ec001f4e8511e57fdae7ed6eddf1f651fbf9ab78, call it 1/2) is the main patch for this bug, and is the one that is already posted here. The top patch (87db55f5092e9d0d03d6c0eb644bfd92ce061c11, call it 2/2) applies on top of 1/2, and is separate because it merges the manifest and object locks, as I mentioned, and I did not want to merge those two patches before you had had a look at whether you agreed with the apprach. I want to know this before working more on this, because I want to know whether to work on top of just 1/2, or on top of 1/2 + 2/2. If you agree with the idea (maybe not the details, I don't really care, as this can be changed later) of 2/2 then I will merge them. 17a4a09cc82a43c26f2bc95bd303ef420e2930ba was made for another bug, but it is relevant here, so I'll include it too. As you requested, I made a new branch on master with just those patches: https://git.collabora.com/cgit/user/vincent/gst-plugins-bad/log/?h=dash-locking
And by "which I made a separate patch for now for clarity", I meant "so the last patch only contains the changes I was asking about (the merging of the manifest and object locks)".
(In reply to Vincent Penquerc'h from comment #32) > The penultimate (ec001f4e8511e57fdae7ed6eddf1f651fbf9ab78, call it 1/2) is > the main patch for this bug, and is the one that is already posted here. The > top patch (87db55f5092e9d0d03d6c0eb644bfd92ce061c11, call it 2/2) applies on > top of 1/2, and is separate because it merges the manifest and object locks, > as I mentioned, and I did not want to merge those two patches before you had > had a look at whether you agreed with the apprach. I want to know this > before working more on this, because I want to know whether to work on top > of just 1/2, or on top of 1/2 + 2/2. If you agree with the idea (maybe not > the details, I don't really care, as this can be changed later) of 2/2 then > I will merge them. Now I get it. Yes, I agree with merging manifest lock and demux object lock and now I see that you removed the demux object lock completely and use only one lock (manifest lock) for locking the demux object. Which is good and a step into the right direction. Traditionally gstreamer accesses to an object are protected by taking that object lock, so I propose to not confuse other readers and use the same approach in adaptivedemux: use the demux object lock. So I propose you start from patch 2/2 and do a find and replace and change GST_MANIFEST_LOCK (demux) with GST_OBJECT_LOCK (demux). In this way we will have only one lock for the demux (its object lock) and the manifest lock will be unused and can be removed. Next we can start discussing the observations I had in comments 28 and 29. In terms of organizing patches, I don't care how many they are. I always apply the whole set on master and review the resulting code. So feel free to organize your changes how you want. But I suggest that each patch contains a set of changes that can be applied on master without breaking the product.
Great. I will keep working on top of master here. I agree with removing manifest lock entirely and using the object lock instead. I'll go through your earlier comments once I've done this change.
(In reply to Florin Apostol from comment #34) > (In reply to Vincent Penquerc'h from comment #32) > Traditionally gstreamer accesses to an object are protected by taking that > object lock, so I propose to not confuse other readers and use the same > approach in adaptivedemux: use the demux object lock. So I propose you start This unfortunately runs into a problem, as the object lock is taken by various other code, outwith adaptivedemux, leading to deadlocks when, eg, posting a message (the object lock is taken by gst_element_post_message_default). Thus, we either need to keep using the manifest lock here, or temporarily unlock the object lock. This also explains why there's a temp unlock where you said: > Personally I don't see why we should unlock. So what if it post messages on the bus? It would not be necessary to unlock if we were using the manifest lock (as I don't think any recursive calls into adaptivedemux by way of messages are expected).
(In reply to Vincent Penquerc'h from comment #36) > (In reply to Florin Apostol from comment #34) > > (In reply to Vincent Penquerc'h from comment #32) > > Traditionally gstreamer accesses to an object are protected by taking that > > object lock, so I propose to not confuse other readers and use the same > > approach in adaptivedemux: use the demux object lock. So I propose you start > > This unfortunately runs into a problem, as the object lock is taken by > various other code, outwith adaptivedemux, leading to deadlocks when, eg, > posting a message (the object lock is taken by > gst_element_post_message_default). > > Thus, we either need to keep using the manifest lock here, or temporarily > unlock the object lock. Using the manifest lock will not solve the problem. gst_adaptive_demux_handle_message will need to get the manifest lock to access the demux->streams (currently it doesn't and that's a bug), so you run into the same deadlock. The solution is to temporarily unlock the object lock, but for each of those cases we need to see what are the possible consequences. > > This also explains why there's a temp unlock where you said: > > Personally I don't see why we should unlock. So what if it post messages on the bus? > > It would not be necessary to unlock if we were using the manifest lock (as I > don't think any recursive calls into adaptivedemux by way of messages are > expected). It will be necessary because you need to protect gst_adaptive_demux_handle_message. Try to grab the manifest lock there and I think it will deadlock.
(In reply to Florin Apostol from comment #37) > Using the manifest lock will not solve the problem. > gst_adaptive_demux_handle_message will need to get the manifest lock to > access the demux->streams (currently it doesn't and that's a bug), so you It doesn't need to take the lock before bailing out if the message source is the demux itself, as AFAICT it doesn't need to send messages to itself, just listen to messages from the sources.
Created attachment 312064 [details] [review] lock most seeking event code with the manifest lock That addresses most of your comments, with the following exceptions: - the main lock is the manifest lock, rather than the object lock, but reason above. I think it's fine, unless you find another reason why it's not. - the list copy is still here. I think _expose_streams can be called still while the manifest lock is not held. I will look more into it. - why do you think _stream_free needs the manifest lock ?
I can get deadlocks with that last patch when repeatedly seeking.
Review of attachment 312064 [details] [review]: these ones found at a quick review. I'll start a more thorough review now. ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +127,3 @@ guint32 segment_seqnum; + + GRecMutex api_lock; why does it need to be recursive? @@ +561,1 @@ return gst_pad_event_default (pad, parent, event); api_lock should be released after gst_pad_event_default @@ +2182,3 @@ if (GST_CLOCK_TIME_IS_VALID (demux->segment.stop) && stream->segment.position >= stream->segment.stop) { + GST_MANIFEST_LOCK (demux); this should be unlock @@ +2190,3 @@ if (GST_CLOCK_TIME_IS_VALID (demux->segment.start) && stream->segment.position < stream->segment.start) { + GST_MANIFEST_LOCK (demux); this should be unlock
Created attachment 312124 [details] [review] lock most seeking event code with the manifest lock Bugfix, but still getting deadlocks.
The API lock needs to be recursive because we end up in two event handlers at one point, in the same thread. I can't recall which exactly, but I could rechange the code to reproduce the lock if you want. The api lock is released just before that line. I posted a fix for the lock/unlock mismatch above.
(In reply to Vincent Penquerc'h from comment #43) > The API lock needs to be recursive because we end up in two event handlers > at one point, in the same thread. I can't recall which exactly, but I could > rechange the code to reproduce the lock if you want. > A recursive mutex is dangerous. We really should understand why it is needed. I think it can be avoided. The original developer avoided it in one scenario using demux->priv->exposing. Maybe we can do this here too, for the particular scenario that caused you problems. > The api lock is released just before that line. I can't identify that place. Can you recall the scenario? Don't need to prove it, just describe it.
(In reply to Florin Apostol from comment #44) > (In reply to Vincent Penquerc'h from comment #43) > > The API lock needs to be recursive because we end up in two event handlers > > at one point, in the same thread. I can't recall which exactly, but I could > > rechange the code to reproduce the lock if you want. > > > A recursive mutex is dangerous. We really should understand why it is > needed. I think it can be avoided. The original developer avoided it in one > scenario using demux->priv->exposing. Maybe we can do this here too, for the > particular scenario that caused you problems. Something like this may be possible, I'll have a look. > > The api lock is released just before that line. > I can't identify that place. Can you recall the scenario? Don't need to > prove it, just describe it. I can't recall just now, but I'll reproduce it and post. That line about api lock was a reply to "api_lock should be released after gst_pad_event_default", though: g_rec_mutex_unlock (&demux->priv->api_lock); return gst_pad_event_default (pad, parent, event); } I may be slow to reply today, I seem to be not getting any bugzilla mail notifications.
(In reply to Florin Apostol from comment #44) > A recursive mutex is dangerous. We really should understand why it is > needed. I think it can be avoided. The original developer avoided it in one > scenario using demux->priv->exposing. Maybe we can do this here too, for the > particular scenario that caused you problems. That particular sequence of events leading this this reentrant call happens when starting playback: gst_adaptive_demux_sink_event is called with GST_EVENT_EOS, ends up calling _expose_streams, which adds pads. Pads get link, and a RECONFIGURE event gets emitted, calling gst_adaptive_demux_src_event. This happens when exposing, and the exposing flag is TRUE. Using that flag to avoid locking/unlocking happens to work.
Created attachment 312126 [details] [review] lock most seeking event code with the manifest lock Use a non recursive mutex, and avoid locking when exposing is TRUE instead. Deadlocks still present.
> Deadlocks still present. Are you having trouble identifying deadlocks? I have a patch that instruments the code and detects them.
I'm having trouble seeing what the reason is for the deadlock is. I can see which locks are taken in gdb (it's the manifest lock, and a wait in gst_adaptive_demux_stream_download_uri which does not end). What I'm missing is what should have happen differently when before that, which requires a working knowledge of how all the moving parts in this element interact.
Created attachment 312136 [details] [review] added suppoort for output indented per thread and deadlock detection apply this on gstreamer repository to get output indentation and thread deadlock detection
Created attachment 312137 [details] [review] adde trace support in adaptive demux add this in gst-plugins-bad to get increased verbosity for adaptive demux messages and more important, logs about when locks are being acquired and released.
(In reply to Vincent Penquerc'h from comment #49) > I'm having trouble seeing what the reason is for the deadlock is. I can see > which locks are taken in gdb (it's the manifest lock, and a wait in > gst_adaptive_demux_stream_download_uri which does not end). What I'm missing > is what should have happen differently when before that, which requires a > working knowledge of how all the moving parts in this element interact. I've added 2 patches. They should allow you to see the deadlock and what the threads were doing up to that point. Feel free to remove unnecessary warning messages and add your own.
Review of attachment 312126 [details] [review]: 1. comment in gst_adaptive_demux_stream_download_uri is wrong saying the function should be called with fragment_download_lock. It shouldn't because the function grabs it. ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +1069,3 @@ + if (!demux->priv->exposing) + g_mutex_lock (&demux->priv->api_lock); + this has 2 problems. First, the exposing condition should be used only for RECONFIGURE event. All others need to be serialized using the api_lock. Otherwise, while one thread is exposing, a second seek gets a free ticket to bypass the locks and mess the data :) Second, to avoid the free ticket problem also for reconfigure events, you need to determine if this is the thread holding the lock (and doing the exposure) or not. And to determine that, you need the mutex to be reentrant. I suggest to add a reentrant lock demux->private->exposing_lock. The gst_adaptive_demux_expose_streams function, while holding the api_lock and the manifest_lock will also hold exposing_lock for the duration between demux->priv->exposing = TRUE; and demux->priv->exposing = FALSE; The gst_adaptive_demux_src_event function, for GST_EVENT_RECONFIGURE will grab this exposing_lock, read the exposing value and release the lock. If the exposing was TRUE, it is the same thread and it can bypass getting the api_lock and manifest lock. If exposing was false, this is a random RECONFIGURE event received outside exposing, so it must get api_lock and manifest lock. @@ +2862,3 @@ } +/* with manifest_lock */ comment is wrong. Function is called with the lock not taken. But maybe it should be called with the lock taken and temporarily release it while sleeping Also the function should call g_cond_wait_until in a while loop and check for demux->cancelled. The code is similar to the one from gst_adaptive_demux_stream_download_loop (search for g_cond_wait_until) but there the demux->cancelled is checked.
Thanks, it was helpful. I understand the deadlock now: A seek event causes _stop_tasks to be called. The manifest lock is held during this. This causes the source elements to be stopped. The source's pad stream lock is taken when its task stops. In another thread, a buffer is being pushed to adaptivedemux from the source. gst_adaptive_demux_stream_push_buffer takes the manifest lock. This deadlocks. I added the manifest lock in gst_adaptive_demux_stream_push_buffer to protect calls to gst_adaptive_demux_stream_get_presentation_offset and gst_adaptive_demux_get_period_start_time, as well as uses of the demux object (segment field). This is complicated by the fact there is a second deadlock (so my original try did not work, but I did not realize at the time I had been hitting a second deadlock): A source pushes a buffer. This triggers gst_qtdemux_check_seekability to send a SEEKING query to check for seekability. This takes the api lock in gst_adaptive_demux_src_query Another thread seeks, which ends up taking the api lock in gst_adaptive_demux_src_event, then pauses the task that is busy pushing. Taking out both: - the api lock from gst_adaptive_demux_src_query - the manifest lock from gst_adaptive_demux_stream_push_buffer Fixes all deadlocks as far as I can tell. This seems like a good argument to switch to the object lock instead of the manifest lock, since we deadlock with the streaming thread here. This might still cause trouble with the api lock when querying though.
Review of attachment 312126 [details] [review]: 1. gst_adaptive_demux_stream_free function: g_cond_signal (&stream->fragment_download_cond); gst_task_stop (stream->download_task); The g_cond_signal call might have no effect if not issued with the lock taken and after setting a flag that the waiting thread will watch for. The waiting thread should sleep in a while loop that checks for the flag. See https://developer.gnome.org/glib/stable/glib-Threads.html#g-cond-wait-until for a reason and example. ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +1276,3 @@ + g_mutex_lock (&demux->priv->api_lock); + this deadlocks on the following scenario: manifest received, gst_adaptive_demux_expose_streams called. gst_adaptive_demux_expose_stream will create src pads. I have a callback registered so that when dash creates a src pad, I create an AppSink element and link them. During linking, gst_element_link will call gst_adaptive_demux_src_query which will try to grab api_lock. It deadlocks because the api_lock is still taken. So, queries can happen from the same tread, while creating pads. Need a reentrant mutex here too, similar to the exposing one
please replace g_mutex_lock (&demux->priv->api_lock) with some macro like GST_MANIFEST_LOCK. Something like GST_ADAPTIVE_DEMUX_API_LOCK (demux); It will make it easier to use my deadlock detection mechanism. Thanks.
Why not just have the api lock recursive as I had earlier ? That's already two exceptions, and the first one adding yet another lock. While it might mean we allow some other nested event/query instead of deadlocking, it doesn't seem like a large problem, and we avoid the extra complexity.
(In reply to Vincent Penquerc'h from comment #57) > Why not just have the api lock recursive as I had earlier ? That's already > two exceptions, and the first one adding yet another lock. While it might > mean we allow some other nested event/query instead of deadlocking, it > doesn't seem like a large problem, and we avoid the extra complexity. The api_lock is intended to make sure that the operation (eg seek) is serialised. The operations protected by this lock should be the ones that do multiple changes on demux object (eg reconfigure, stopping tasks, changing all the streams, etc) and you want to make sure other threads do not interrupt until all the steps are done. Basically, it guarantees the invariant at the begging and at the end of the operation. But if the lock is recursive, it will fail to detect situations when you end up calling seek from another seek. If this happen, a thread might start a sequence of changes while it was already in a sequence of changes. Because demux uses virtual functions, it is hard to prove this cannot happen. This is why I want to avoid recursive mutexes and use them only to detect if that thread is already performing a sequence of operations or not. The recursive mutex is used to determine if the api_lock is already hold or not.
(In reply to Florin Apostol from comment #55) > ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c > @@ +1276,3 @@ > > + g_mutex_lock (&demux->priv->api_lock); > + > > this deadlocks on the following scenario: > > manifest received, gst_adaptive_demux_expose_streams called. > gst_adaptive_demux_expose_stream will create src pads. I have a callback > registered so that when dash creates a src pad, I create an AppSink element > and link them. During linking, gst_element_link will call > gst_adaptive_demux_src_query which will try to grab api_lock. It deadlocks > because the api_lock is still taken. > So, queries can happen from the same tread, while creating pads. Need a > reentrant mutex here too, similar to the exposing one A different one ? Or using the exposing lock would do here, since it's called by _expose_streams, and thus exposing is TRUE ?
This can deadlock anyway: A seek event is received, taking the api lock, and stopping the tasks. This waits on the streaming thread to stop, but the streaming thread might just be in the process of pushing a buffer which causes a seekability query to be sent, which would try to get the api lock, deadlocking. I've just had this happen here spamming the seek button.
Created attachment 312313 [details] [review] Locking overhaul Here's the latest, with a similar locking trick using a recursive mutex for _src_query. This does not seem to deadlock in button spamming tests.
Review of attachment 312313 [details] [review]: ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +1321,3 @@ + GST_ADAPTIVE_DEMUX_SEEKING_UNLOCK (demux); + if (use_lock) + GST_ADAPTIVE_DEMUX_API_LOCK (demux); this deadlocks in my scenario (explained in comment 55). In that case the query was called from gst_adaptive_demux_sink_event case GST_EVENT_EOS. You added protection only for cases when query is called from seek. By the way, how do you intend to prove that the system doesn't deadlock (for example there aren't other scenarios except seek and sink_event in which query can be called from the same thread so it should not get the API_LOCK)?
Created attachment 312440 [details] [review] Locking overhaul
(In reply to Florin Apostol from comment #62) > this deadlocks in my scenario (explained in comment 55). In that case the > query was called from gst_adaptive_demux_sink_event case GST_EVENT_EOS. You > added protection only for cases when query is called from seek. I added a flag and lock for this too. > By the way, how do you intend to prove that the system doesn't deadlock (for > example there aren't other scenarios except seek and sink_event in which > query can be called from the same thread so it should not get the API_LOCK)? I don't intend to prove this (assuming it can be proved).
Review of attachment 312440 [details] [review]: 1. the gst_adaptive_demux_src_event function, case GST_EVENT_RECONFIGURE still uses demux->priv->exposing. It should use use_lock flag. 2. gst_adaptive_demux_reset sets demux->priv->exposing = FALSE; It should hold the GST_ADAPTIVE_DEMUX_EXPOSING_LOCK when doing that. ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +455,3 @@ GstAdaptiveDemuxClass *demux_class; + GST_ADAPTIVE_DEMUX_API_LOCK (demux); don't need to lock it for SEGMENT events. @@ +1339,3 @@ + if (use_lock) + GST_ADAPTIVE_DEMUX_API_LOCK (demux); + besides EOS and SEEK, gst_adaptive_demux_expose_streams can also be called from gst_adaptive_demux_advance_period (called from gst_adaptive_demux_stream_download_loop) or from gst_adaptive_demux_stream_advance_fragment_unlocked (which can be called from _src_event). In those situations the system will deadlock when it will try to create the pads which call the gst_adaptive_demux_src_query.
> > By the way, how do you intend to prove that the system doesn't deadlock (for > > example there aren't other scenarios except seek and sink_event in which > > query can be called from the same thread so it should not get the API_LOCK)? > > I don't intend to prove this (assuming it can be proved). Well, I found 2 new scenarios (see comment 65). I believe that with a careful design, it can be proved.
Created attachment 312489 [details] [review] Locking overhaul
Review of attachment 312489 [details] [review]: ::: gst-libs/gst/adaptivedemux/gstadaptivedemux.c @@ +599,3 @@ + GST_ADAPTIVE_DEMUX_EOS_UNLOCK (demux); + + GST_ADAPTIVE_DEMUX_API_UNLOCK (demux); second API_UNLOCK call. You have another one 3 lines above
Created attachment 313199 [details] [review] Locking overhaul
Review of attachment 313199 [details] [review]: I see a few problems with this design: 1. You introduced several recursive locks (exposing, seeking, eos, src_event, dlloop_advance) to cope with the fact that some API functions (gst_adaptive_demux_src_query was identified until now) can be called from other API functions while the API or MANIFEST lock is taken. The problem with introducing a separate lock for each scenario is that we can never be sure that we covered all the scenarios and so the design cannot guarantee a no hang behavior. 2. There are 2 main locks being used (API and manifest). It is not documented (and not clear to me) what are they guarding. For example, if a variable is shared between threads, the same lock must be acquired for all accesses. But gst_adaptive_demux_updates_loop does not acquire any lock, gst_adaptive_demux_update_manifest will call klass->update_manifest without any lock, etc. Basically, every access to demux object must be guarded by the manifest lock and this is not happening in all the places.
Created attachment 313290 [details] [review] proposed fix this is my proposed solution for this ticket. It uses only 2 locks (API and MANIFEST) and I made sure all accesses to demux object are guarded by the manifest lock. The desing is documented at the beginning of gstadaptivedemux.c file
Thanks. It seems fine to me, fwiw. Certainly a lot better than the mutex forest.
my fix also covers the following bugs: https://bugzilla.gnome.org/show_bug.cgi?id=755120 https://bugzilla.gnome.org/show_bug.cgi?id=755121 https://bugzilla.gnome.org/show_bug.cgi?id=755226
(In reply to Vincent Penquerc'h from comment #72) > Thanks. It seems fine to me, fwiw. Certainly a lot better than the mutex > forest. In order to not confuse reviewers, maybe you should mark your patch obsolete so that it is removed from the list of available patches.
commit eab158a669685f872a7d54877c60d9bc6a26a9b5 Author: Florin Apostol <florin.apostol@oregan.net> Date: Wed Sep 16 16:46:29 2015 +0100 adaptivedemux: fixed multithread support https://bugzilla.gnome.org/show_bug.cgi?id=755169
*** Bug 756556 has been marked as a duplicate of this bug. ***