GNOME Bugzilla – Bug 763491
typefind: behavior has changed on have-type signal, sets pad caps after signal handlers
Last modified: 2016-04-29 08:14:32 UTC
We have an application that queries the caps of the typefind source pad in the context of the "have-type" signal callback. This worked well on 0.10 and on 1.4.5. We are currently updrading to 1.6.3 and we see that the behavior of gst_pad_query_caps have changed in this case, since the bug 735896. Before this commit, it was possible to query the pad caps because the object method handler was called first. So the caps were already set when our signal handler was called. A possible fix in our application is to use the caps provided in the signal. But since the code querying the caps is not directly in the handler but in some more generic code, we will have to add a specific behavior for this. Looking at the description and commit of bug 735896 it seems that being able to query the pad caps in the signal handler is incompatible with the aim of setting them after the signal handlers have been called. Do you see any way to allow both ?
You mean from the have-type callback you call gst_pad_get_current_caps() on the pad, and it does not have the caps yet? Yes that indeed seems incompatible with that change :) Let's think about this
Created attachment 323710 [details] [review] typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Romain, can you check if this patch solves your problem? Thanks
Created attachment 323745 [details] [review] typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Created attachment 323746 [details] [review] typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Created attachment 323749 [details] [review] typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Attachment 323749 [details] pushed as 0835c3d - typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
commit a15fca193485ef8f9ae71fd44b8f794ba4132588 Author: Sebastian Dröge <sebastian@centricular.com> Date: Sat Mar 12 12:56:28 2016 +0200 Revert "typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler" This reverts commit 0835c3d6569dde0ec9e5524436367c7678cc4a4a. It causes deadlocks in decodebin, which currently would deadlock if the caps are already on the pad in have-type and are forwarded while copying the sticky events (while holding the decodebin lock)... as that might cause the next element to expose pads, which then calls back into decodebin and takes the decodebin lock. This needs some more thoughts.
It's not clear why decodebin takes the EXPOSE_LOCK in type_found() for calling analyze_new_pad, while in other places it does not (pad_added_cb).
Created attachment 323752 [details] [review] decodebin: Don't hold EXPOSE_LOCK in type_found() analyze_new_pad() might call back into decodebin and take the EXPOSE_LOCK another time, resulting in a deadlock.
Because commit ee44337fc3e3030a5155d28b3561af157e6c6003 Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk> Date: Thu Aug 20 17:29:36 2015 +0100 decodebin: lock the expose lock around decode_chain use Helps with a crash in decodebin when quickly switching states. https://bugzilla.gnome.org/show_bug.cgi?id=752651
That commit is inconsistent though as mentioned above: sometimes analyze_new_pad() is called with expose lock, sometimes not.
For reference, IRC discussion / monologue: <thaytan> slomo_, I don't understand what's happened there in decodebin <thaytan> I thought the bug report was that typefind used to expose caps before-hand, and this was just restoring that behaviour <slomo_> thaytan: and then v added a change to decodebin that expected typefind to give the caps later <slomo_> thaytan: specifically his change expects that no element linked after the typefind in decodebin will directly add sometimes pads when started inside decodebin <slomo_> thaytan: the call stack is type_found() taking the expose lock -> analyze_new_pad() -> connect_pad() -> gst_pad_sticky_events_foreach() -> pad_added_cb() taking the expose lock <slomo_> thaytan: with typefind sending caps before type_found() (because then the caps are part of the sticky events, otherwise not) <slomo_> thaytan: imho the decodebin code is wrong. especially because the locking is inconsistent and the above can also happen in other cases in theory
Backtrace of the playbin-complex test for completeness:
+ Trace 236065
ee44337fc3e3030a5155d28b3561af157e6c6003 also introduces a lock order problem: type_found has EXPOSE_LOCK before STREAM_LOCK, pad_added_cb the other way around.
Created attachment 323769 [details] [review] decodebin: Don't hold EXPOSE_LOCK in type_found() outside the stream lock In other places we lock it the other way around, leading to possible deadlocks. Also this will deadlock if analyze_pad() causes a new element to be autoplugged that adds new pads on itself when its state is changed.
Created attachment 323770 [details] [review] decodebin: Don't check twice if the decode chain is complete in pad_added_cb() expose_pad() already does the same.
Created attachment 323771 [details] [review] decodebin: expose_pad() is always called with lock==TRUE, simplify code This basically reverts ee44337fc3e3030a5155d28b3561af157e6c6003 .
These patches should solve the problems, but I'm not sure if it doesn't introduce bug #752651 again. OTOH as part of bug #755260, more related changes were done that should keep bug #752651 solved.
Well, presumable as part of fixing bug #752651, a unit test was added that would catch any regression ;) I'm not clear on why this decodebin deadlock is visible now. Isn't the implication of the bug that typefind used to set caps this way in the not-too-distant past? Has decodebin locking changed since then?
First typefind was changed to only set caps after application (aka decodebin) have-type handlers were called. Then ee44337fc3e3030a5155d28b3561af157e6c6003 in decodebin happened, which would deadlock with the old typefind behaviour. That is, ee44337fc3e3030a5155d28b3561af157e6c6003 depended on the typefind change to work correctly. (Independent of that ee44337fc3e3030a5155d28b3561af157e6c6003 also introduces lock order problems and other inconsistencies) No unit test was added obviously ;) It's a crash that happened when running decodebin in a loop starting and shutting down all the time for a few minutes.
Makes sense. Let's put these in.
With the four patches in this bug, I got the original test case from https://bugzilla.gnome.org/show_bug.cgi?id=752623 to crash after 6 minutes (so still a fair bit better than it used to be):
+ Trace 236078
Thread 1 (Thread 0x7f7addac1700 (LWP 19288))
commit 57bbd18471c7a9f82cf33e2b270e95c73d27910c Author: Sebastian Dröge <sebastian@centricular.com> Date: Fri Mar 11 14:17:13 2016 +0200 typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler https://bugzilla.gnome.org/show_bug.cgi?id=763491 commit 916746e73165ca95a061bc2f975354524c8e3a57 Author: Sebastian Dröge <sebastian@centricular.com> Date: Sat Mar 12 19:47:47 2016 +0200 decodebin: expose_pad() is always called with lock==TRUE, simplify code This basically reverts ee44337fc3e3030a5155d28b3561af157e6c6003 . https://bugzilla.gnome.org/show_bug.cgi?id=763491 commit 65d09c1495ec9093ab95b7539671b48b3c4c9305 Author: Sebastian Dröge <sebastian@centricular.com> Date: Sat Mar 12 19:46:44 2016 +0200 decodebin: Don't check twice if the decode chain is complete in pad_added_cb() expose_pad() already does the same. https://bugzilla.gnome.org/show_bug.cgi?id=763491 commit 001c7f04a06e261cebd191a541c1f23d13e01022 Author: Sebastian Dröge <sebastian@centricular.com> Date: Sat Mar 12 19:45:26 2016 +0200 decodebin: Don't hold EXPOSE_LOCK in type_found() outside the stream lock In other places we lock it the other way around, leading to possible deadlocks. Also this will deadlock if analyze_pad() causes a new element to be autoplugged that adds new pads on itself when its state is changed. https://bugzilla.gnome.org/show_bug.cgi?id=763491
Created attachment 323868 [details] [review] switch elements to NULL before changing anything else With this fix from slomo, the test case has been running for 45 minutes without problems so far (it'd usually take 5 minutes to break).
Review of attachment 323868 [details] [review]: This is not enough. Chains can contain groups which contain more chains which can contain groups ... :) And we need to shut down everything from the very bottom to the top. Let me send a new patch ::: gst/playback/gstdecodebin2.c @@ +5178,3 @@ + GstDecodeElement *delem = tmp->data; + GstElement *element = delem->element; + l = g_list_append (l, gst_object_ref (element)); prepend()
Let's continue this specific problem in https://bugzilla.gnome.org/show_bug.cgi?id=763625
Created attachment 324002 [details] [review] typefind: Use sticky caps event in caps query instead of the caps member.
Sebastian, The typefind patch did not work on our application because gst_type_find_handle_src_query uses the caps member, and it is not set yet when the application does a caps query from the "have-type" handler. attachment 324002 [details] [review] is an additional fix proposal that works for us.
Review of attachment 324002 [details] [review]: Thanks, can you update the patch? ::: plugins/elements/gsttypefindelement.c @@ +424,3 @@ + if (event) { + GstCaps *event_caps; + gst_event_parse_caps (event, &event_caps); You could just do gst_pad_get_current_caps() here instead
And please add more of a description in the commit message :)
ok. There is probably the same issue in gst_type_find_element_get_property. We can apply the same change here if needed. I also wonder if we need to remove the caps sticky event in start_typefinding and gst_type_find_element_change_state (case GST_STATE_CHANGE_READY_TO_NULL). I am not sure about whether these can be called several times during the lifetime of the object.
Or maybe just set ->caps before sending the caps event? I.e. setting ->caps at the same time as storing the sticky event? (In reply to Romain from comment #32) > I also wonder if we need to remove the caps sticky event in > start_typefinding and gst_type_find_element_change_state (case > GST_STATE_CHANGE_READY_TO_NULL). I am not sure about whether these can be > called several times during the lifetime of the object. The sticky event will disappear when the pad is deactivated
Created attachment 324026 [details] [review] typefind: Allow caps query in "have-type" signal handlers
Attachment 324026 [details] pushed as 20859af - typefind: Allow caps query in "have-type" signal handlers
We have found a regression in the two typefind patches in WebKitForWayland for http://yt-dash-mse-test.commondatastorage.googleapis.com/unit-tests/2016.html?enablewebm=false&command=run&test_type=progressive-test&loop=true&stoponfailure=true&tests=22 Test is already a bit flaky, but this makes the flakyness much more prone to be triggered.
Can you file a new bug about that, ideally we a testcase and/or debug log? And you say the test is flaky before already, maybe the test is just racy and the new behaviour makes it more racy?
(In reply to Sebastian Dröge (slomo) from comment #37) > Can you file a new bug about that, ideally we a testcase and/or debug log? > And you say the test is flaky before already, maybe the test is just racy > and the new behaviour makes it more racy? The current flakyness which I am already investigating is caused by some marshalling parameter marshalling in function calling in the WK2 layer, so I think that the possibility that it is unrelated is 90%. I wish I could reduce it to a mininum test case. I'll try it though and create a new bug as soon as I have more info ;)