After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 763491 - typefind: behavior has changed on have-type signal, sets pad caps after signal handlers
typefind: behavior has changed on have-type signal, sets pad caps after signa...
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gstreamer (core)
git master
Other All
: Normal blocker
: 1.6.4
Assigned To: GStreamer Maintainers
GStreamer Maintainers
1.6.4
Depends on: 735896
Blocks: 763625
 
 
Reported: 2016-03-11 10:29 UTC by Romain
Modified: 2016-04-29 08:14 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler (3.29 KB, patch)
2016-03-11 14:01 UTC, Sebastian Dröge (slomo)
none Details | Review
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler (3.46 KB, patch)
2016-03-12 08:14 UTC, Sebastian Dröge (slomo)
none Details | Review
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler (3.36 KB, patch)
2016-03-12 08:43 UTC, Sebastian Dröge (slomo)
none Details | Review
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler (3.97 KB, patch)
2016-03-12 09:50 UTC, Sebastian Dröge (slomo)
committed Details | Review
decodebin: Don't hold EXPOSE_LOCK in type_found() (1.08 KB, patch)
2016-03-12 11:12 UTC, Sebastian Dröge (slomo)
rejected Details | Review
decodebin: Don't hold EXPOSE_LOCK in type_found() outside the stream lock (1.60 KB, patch)
2016-03-12 17:50 UTC, Sebastian Dröge (slomo)
committed Details | Review
decodebin: Don't check twice if the decode chain is complete in pad_added_cb() (1.22 KB, patch)
2016-03-12 17:50 UTC, Sebastian Dröge (slomo)
committed Details | Review
decodebin: expose_pad() is always called with lock==TRUE, simplify code (3.65 KB, patch)
2016-03-12 17:50 UTC, Sebastian Dröge (slomo)
committed Details | Review
switch elements to NULL before changing anything else (1.89 KB, patch)
2016-03-14 13:23 UTC, Vincent Penquerc'h
needs-work Details | Review
typefind: Use sticky caps event in caps query instead of the caps member. (1.29 KB, patch)
2016-03-15 14:53 UTC, Romain
needs-work Details | Review
typefind: Allow caps query in "have-type" signal handlers (1.71 KB, patch)
2016-03-15 16:33 UTC, Romain
committed Details | Review

Description Romain 2016-03-11 10:29:06 UTC
We have an application that queries the caps of the typefind source pad in the context of the "have-type" signal callback.

This worked well on 0.10 and on 1.4.5. We are currently updrading to 1.6.3 and we see that the behavior of gst_pad_query_caps have changed in this case, since the bug 735896.

Before this commit, it was possible to query the pad caps because the object method handler was called first. So the caps were already set when our signal handler was called. 

A possible fix in our application is to use the caps provided in the signal. But since the code querying the caps is not directly in the handler but in some more generic code, we will have to add a specific behavior for this.

Looking at the description and commit of bug 735896 it seems that being able to query the pad caps in the signal handler is incompatible with the aim of setting them after the signal handlers have been called. Do you see any way to allow both ?
Comment 1 Sebastian Dröge (slomo) 2016-03-11 12:03:05 UTC
You mean from the have-type callback you call gst_pad_get_current_caps() on the pad, and it does not have the caps yet? Yes that indeed seems incompatible with that change :) Let's think about this
Comment 2 Sebastian Dröge (slomo) 2016-03-11 14:01:44 UTC
Created attachment 323710 [details] [review]
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Comment 3 Sebastian Dröge (slomo) 2016-03-11 14:02:09 UTC
Romain, can you check if this patch solves your problem? Thanks
Comment 4 Sebastian Dröge (slomo) 2016-03-12 08:14:57 UTC
Created attachment 323745 [details] [review]
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Comment 5 Sebastian Dröge (slomo) 2016-03-12 08:43:34 UTC
Created attachment 323746 [details] [review]
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Comment 6 Sebastian Dröge (slomo) 2016-03-12 09:50:47 UTC
Created attachment 323749 [details] [review]
typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Comment 7 Sebastian Dröge (slomo) 2016-03-12 09:58:52 UTC
Attachment 323749 [details] pushed as 0835c3d - typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
Comment 8 Sebastian Dröge (slomo) 2016-03-12 11:00:14 UTC
commit a15fca193485ef8f9ae71fd44b8f794ba4132588
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Sat Mar 12 12:56:28 2016 +0200

    Revert "typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler"
    
    This reverts commit 0835c3d6569dde0ec9e5524436367c7678cc4a4a.
    
    It causes deadlocks in decodebin, which currently would deadlock if the caps
    are already on the pad in have-type and are forwarded while copying the sticky
    events (while holding the decodebin lock)... as that might cause the next
    element to expose pads, which then calls back into decodebin and takes the
    decodebin lock.
    
    This needs some more thoughts.
Comment 9 Sebastian Dröge (slomo) 2016-03-12 11:10:08 UTC
It's not clear why decodebin takes the EXPOSE_LOCK in type_found() for calling analyze_new_pad, while in other places it does not (pad_added_cb).
Comment 10 Sebastian Dröge (slomo) 2016-03-12 11:12:26 UTC
Created attachment 323752 [details] [review]
decodebin: Don't hold EXPOSE_LOCK in type_found()

analyze_new_pad() might call back into decodebin and take the EXPOSE_LOCK
another time, resulting in a deadlock.
Comment 11 Sebastian Dröge (slomo) 2016-03-12 11:22:48 UTC
Because

commit ee44337fc3e3030a5155d28b3561af157e6c6003
Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk>
Date:   Thu Aug 20 17:29:36 2015 +0100

    decodebin: lock the expose lock around decode_chain use
    
    Helps with a crash in decodebin when quickly switching states.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=752651
Comment 12 Sebastian Dröge (slomo) 2016-03-12 11:24:00 UTC
That commit is inconsistent though as mentioned above: sometimes analyze_new_pad() is called with expose lock, sometimes not.
Comment 13 Sebastian Dröge (slomo) 2016-03-12 15:25:58 UTC
For reference, IRC discussion / monologue:

<thaytan> slomo_, I don't understand what's happened there in decodebin
<thaytan> I thought the bug report was that typefind used to expose caps before-hand, and this was just restoring that behaviour
<slomo_> thaytan: and then v added a change to decodebin that expected typefind to give the caps later
<slomo_> thaytan: specifically his change expects that no element linked after the typefind in decodebin will directly add sometimes pads when started inside decodebin
<slomo_> thaytan: the call stack is type_found() taking the expose lock -> analyze_new_pad() -> connect_pad() -> gst_pad_sticky_events_foreach() -> pad_added_cb() taking the expose lock
<slomo_> thaytan: with typefind sending caps before type_found() (because then the caps are part of the sticky events, otherwise not)
<slomo_> thaytan: imho the decodebin code is wrong. especially because the locking is inconsistent and the above can also happen in other cases in theory
Comment 14 Sebastian Dröge (slomo) 2016-03-12 15:43:35 UTC
Backtrace of the playbin-complex test for completeness:

  • #0 syscall
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S line 38
  • #1 g_mutex_lock_slowpath
    at /build/glib2.0-rHXeJh/glib2.0-2.47.6/./glib/gthread-posix.c line 1315
  • #2 g_mutex_lock
    at /build/glib2.0-rHXeJh/glib2.0-2.47.6/./glib/gthread-posix.c line 1339
  • #3 pad_added_cb
    at gstdecodebin2.c line 2936
  • #4 g_cclosure_marshal_VOID__OBJECTv
    at /build/glib2.0-rHXeJh/glib2.0-2.47.6/./gobject/gmarshal.c line 2102
  • #5 _g_closure_invoke_va
    at /build/glib2.0-rHXeJh/glib2.0-2.47.6/./gobject/gclosure.c line 867
  • #6 g_signal_emit_valist
    at /build/glib2.0-rHXeJh/glib2.0-2.47.6/./gobject/gsignal.c line 3294
  • #7 g_signal_emit
    at /build/glib2.0-rHXeJh/glib2.0-2.47.6/./gobject/gsignal.c line 3441
  • #8 gst_element_add_pad
    at gstelement.c line 704
  • #9 gst_codec_demuxer_setup_pad
    at elements/playbin-complex.c line 510
  • #10 gst_codec_demuxer_event
    at elements/playbin-complex.c line 548
  • #11 gst_codec_demuxer_event
    at elements/playbin-complex.c line 569
  • #12 gst_pad_send_event_unchecked
    at gstpad.c line 5554
  • #13 gst_pad_send_event
    at gstpad.c line 5724
  • #14 send_sticky_event
    at gstdecodebin2.c line 1977
  • #15 foreach_dispatch_function
    at gstpad.c line 5823
  • #16 events_foreach
    at gstpad.c line 598
  • #17 gst_pad_sticky_events_foreach
    at gstpad.c line 5854
  • #18 connect_pad
    at gstdecodebin2.c line 1992
  • #19 connect_pad
    at gstdecodebin2.c line 2492
  • #20 analyze_new_pad
    at gstdecodebin2.c line 1806
  • #21 type_found
    at gstdecodebin2.c line 2862

Comment 15 Sebastian Dröge (slomo) 2016-03-12 17:40:49 UTC
ee44337fc3e3030a5155d28b3561af157e6c6003 also introduces a lock order problem: type_found has EXPOSE_LOCK before STREAM_LOCK, pad_added_cb the other way around.
Comment 16 Sebastian Dröge (slomo) 2016-03-12 17:50:27 UTC
Created attachment 323769 [details] [review]
decodebin: Don't hold EXPOSE_LOCK in type_found() outside the stream lock

In other places we lock it the other way around, leading to possible
deadlocks. Also this will deadlock if analyze_pad() causes a new element to be
autoplugged that adds new pads on itself when its state is changed.
Comment 17 Sebastian Dröge (slomo) 2016-03-12 17:50:32 UTC
Created attachment 323770 [details] [review]
decodebin: Don't check twice if the decode chain is complete in pad_added_cb()

expose_pad() already does the same.
Comment 18 Sebastian Dröge (slomo) 2016-03-12 17:50:38 UTC
Created attachment 323771 [details] [review]
decodebin: expose_pad() is always called with lock==TRUE, simplify code

This basically reverts ee44337fc3e3030a5155d28b3561af157e6c6003 .
Comment 19 Sebastian Dröge (slomo) 2016-03-12 17:53:33 UTC
These patches should solve the problems, but I'm not sure if it doesn't introduce bug #752651 again. OTOH as part of bug #755260, more related changes were done that should keep bug #752651 solved.
Comment 20 Jan Schmidt 2016-03-14 11:12:45 UTC
Well, presumable as part of fixing bug #752651, a unit test was added that would catch any regression ;)

I'm not clear on why this decodebin deadlock is visible now. Isn't the implication of the bug that typefind used to set caps this way in the not-too-distant past? Has decodebin locking changed since then?
Comment 21 Sebastian Dröge (slomo) 2016-03-14 11:15:28 UTC
First typefind was changed to only set caps after application (aka decodebin) have-type handlers were called. Then ee44337fc3e3030a5155d28b3561af157e6c6003 in decodebin happened, which would deadlock with the old typefind behaviour. That is, ee44337fc3e3030a5155d28b3561af157e6c6003 depended on the typefind change to work correctly.

(Independent of that ee44337fc3e3030a5155d28b3561af157e6c6003 also introduces lock order problems and other inconsistencies)


No unit test was added obviously ;) It's a crash that happened when running decodebin in a loop starting and shutting down all the time for a few minutes.
Comment 22 Jan Schmidt 2016-03-14 11:24:18 UTC
Makes sense. Let's put these in.
Comment 23 Vincent Penquerc'h 2016-03-14 11:26:18 UTC
With the four patches in this bug, I got the original test case from https://bugzilla.gnome.org/show_bug.cgi?id=752623 to crash after 6 minutes (so still a fair bit better than it used to be):


Thread 1 (Thread 0x7f7addac1700 (LWP 19288))

  • #0 gst_element_get_request_pad
    at gstelement.c line 1009
  • #1 gst_decode_group_control_demuxer_pad
    at gstdecodebin2.c line 3825
  • #2 connect_pad
    at gstdecodebin2.c line 2112
  • #3 analyze_new_pad
    at gstdecodebin2.c line 1805
  • #4 pad_added_cb
    at gstdecodebin2.c line 2924
  • #5 g_cclosure_marshal_VOID__OBJECTv
    at gmarshal.c line 1312
  • #6 _g_closure_invoke_va
    at gclosure.c line 831
  • #7 g_signal_emit_valist
    at gsignal.c line 3215
  • #8 g_signal_emit
    at gsignal.c line 3363
  • #9 gst_element_add_pad
    at gstelement.c line 704
  • #10 gst_ogg_demux_activate_chain
    at gstoggdemux.c line 2948
  • #11 gst_ogg_demux_perform_seek_pull
    at gstoggdemux.c line 3607
  • #12 gst_ogg_demux_loop
    at gstoggdemux.c line 4782
  • #13 gst_task_func
    at gsttask.c line 332
  • #14 g_thread_pool_thread_proxy
    at gthreadpool.c line 307
  • #15 g_thread_proxy
    at gthread.c line 764
  • #16 start_thread
    at pthread_create.c line 308
  • #17 clone
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 112
  • #18 ??

Comment 24 Sebastian Dröge (slomo) 2016-03-14 11:28:27 UTC
commit 57bbd18471c7a9f82cf33e2b270e95c73d27910c
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Fri Mar 11 14:17:13 2016 +0200

    typefind: Store caps on the pad before emitting have-type but send it downstream only in the default signal handler
    
    https://bugzilla.gnome.org/show_bug.cgi?id=763491


commit 916746e73165ca95a061bc2f975354524c8e3a57
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Sat Mar 12 19:47:47 2016 +0200

    decodebin: expose_pad() is always called with lock==TRUE, simplify code
    
    This basically reverts ee44337fc3e3030a5155d28b3561af157e6c6003 .
    
    https://bugzilla.gnome.org/show_bug.cgi?id=763491

commit 65d09c1495ec9093ab95b7539671b48b3c4c9305
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Sat Mar 12 19:46:44 2016 +0200

    decodebin: Don't check twice if the decode chain is complete in pad_added_cb()
    
    expose_pad() already does the same.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=763491

commit 001c7f04a06e261cebd191a541c1f23d13e01022
Author: Sebastian Dröge <sebastian@centricular.com>
Date:   Sat Mar 12 19:45:26 2016 +0200

    decodebin: Don't hold EXPOSE_LOCK in type_found() outside the stream lock
    
    In other places we lock it the other way around, leading to possible
    deadlocks. Also this will deadlock if analyze_pad() causes a new element to be
    autoplugged that adds new pads on itself when its state is changed.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=763491
Comment 25 Vincent Penquerc'h 2016-03-14 13:23:30 UTC
Created attachment 323868 [details] [review]
switch elements to NULL before changing anything else

With this fix from slomo, the test case has been running for 45 minutes without problems so far (it'd usually take 5 minutes to break).
Comment 26 Sebastian Dröge (slomo) 2016-03-14 14:45:36 UTC
Review of attachment 323868 [details] [review]:

This is not enough. Chains can contain groups which contain more chains which can contain groups ... :) And we need to shut down everything from the very bottom to the top. Let me send a new patch

::: gst/playback/gstdecodebin2.c
@@ +5178,3 @@
+      GstDecodeElement *delem = tmp->data;
+      GstElement *element = delem->element;
+      l = g_list_append (l, gst_object_ref (element));

prepend()
Comment 27 Sebastian Dröge (slomo) 2016-03-14 15:12:41 UTC
Let's continue this specific problem in https://bugzilla.gnome.org/show_bug.cgi?id=763625
Comment 28 Romain 2016-03-15 14:53:35 UTC
Created attachment 324002 [details] [review]
typefind: Use sticky caps event in caps query instead of the caps  member.
Comment 29 Romain 2016-03-15 14:57:12 UTC
Sebastian,

The typefind patch did not work on our application because gst_type_find_handle_src_query uses the caps member, and it is not set yet when the application does a caps query from the "have-type" handler.

attachment 324002 [details] [review] is an additional fix proposal that works for us.
Comment 30 Sebastian Dröge (slomo) 2016-03-15 15:05:25 UTC
Review of attachment 324002 [details] [review]:

Thanks, can you update the patch?

::: plugins/elements/gsttypefindelement.c
@@ +424,3 @@
+      if (event) {
+        GstCaps *event_caps;
+        gst_event_parse_caps (event, &event_caps);

You could just do gst_pad_get_current_caps() here instead
Comment 31 Jan Schmidt 2016-03-15 15:13:31 UTC
And please add more of a description in the commit message :)
Comment 32 Romain 2016-03-15 15:18:43 UTC
ok. 

There is probably the same issue in gst_type_find_element_get_property. 
We can apply the same change here if needed.

I also wonder if we need to remove the caps sticky event in start_typefinding and gst_type_find_element_change_state (case GST_STATE_CHANGE_READY_TO_NULL). I am not sure about whether these can be called several times during the lifetime of the object.
Comment 33 Sebastian Dröge (slomo) 2016-03-15 15:21:11 UTC
Or maybe just set ->caps before sending the caps event? I.e. setting ->caps at the same time as storing the sticky event?

(In reply to Romain from comment #32)

> I also wonder if we need to remove the caps sticky event in
> start_typefinding and gst_type_find_element_change_state (case
> GST_STATE_CHANGE_READY_TO_NULL). I am not sure about whether these can be
> called several times during the lifetime of the object.

The sticky event will disappear when the pad is deactivated
Comment 34 Romain 2016-03-15 16:33:39 UTC
Created attachment 324026 [details] [review]
typefind: Allow caps query in "have-type" signal handlers
Comment 35 Sebastian Dröge (slomo) 2016-03-15 17:00:28 UTC
Attachment 324026 [details] pushed as 20859af - typefind: Allow caps query in "have-type" signal handlers
Comment 36 Xabier Rodríguez Calvar 2016-04-28 15:08:55 UTC
We have found a regression in the two typefind patches in WebKitForWayland for

http://yt-dash-mse-test.commondatastorage.googleapis.com/unit-tests/2016.html?enablewebm=false&command=run&test_type=progressive-test&loop=true&stoponfailure=true&tests=22

Test is already a bit flaky, but this makes the flakyness much more prone to be triggered.
Comment 37 Sebastian Dröge (slomo) 2016-04-28 15:56:25 UTC
Can you file a new bug about that, ideally we a testcase and/or debug log? And you say the test is flaky before already, maybe the test is just racy and the new behaviour makes it more racy?
Comment 38 Xabier Rodríguez Calvar 2016-04-29 08:14:32 UTC
(In reply to Sebastian Dröge (slomo) from comment #37)
> Can you file a new bug about that, ideally we a testcase and/or debug log?
> And you say the test is flaky before already, maybe the test is just racy
> and the new behaviour makes it more racy?

The current flakyness which I am already investigating is caused by some marshalling parameter marshalling in function calling in the WK2 layer, so I think that the possibility that it is unrelated is 90%.

I wish I could reduce it to a mininum test case. I'll try it though and create a new bug as soon as I have more info ;)