Bug 788754 – gl: wayland: sometimes block pipeline at PREROLLED

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 788754 - gl: wayland: sometimes block pipeline at PREROLLED


Summary:	gl: wayland: sometimes block pipeline at PREROLLED


Status:	RESOLVED DUPLICATE of bug 758984

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-10-10 05:57 UTC by Chenglin Ye
Modified:	2018-01-04 04:39 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
fix pipeline block at PREROLLED bug. (1.08 KB, patch) 2017-10-10 05:57 UTC, Chenglin Ye	rejected	Details \| Review

Description Chenglin Ye 2017-10-10 05:57:36 UTC

Created attachment 361214 [details] [review]
fix pipeline block at PREROLLED bug.

environment:
  GST_GL_WINDOW=wayland
  GST_GL_PLATFORM=egl

pipeline:
  gst-launch-1.0 playbin uri=file:///home/ye/test.MOV video-sink=glimagesink

issue:
  sometimes, pipeline block at PREROLLED, about one-twentieth probability.

try:
  I found that "done" value changed too late because has entered the next cycle, so I try to add little delay after dispatching queue, pipeline block issue disappeared. detail please ref to patch.
  it's a stupid solution since I'm clueless. Do you have any good ideal to fix this issue?

Comment 1 Nicolas Dufresne (ndufresne) 2017-10-30 19:52:28 UTC

Review of attachment 361214 [details] [review]:

This is just a hack. How do you reproduce this ?

Comment 2 Nicolas Dufresne (ndufresne) 2017-10-30 19:59:03 UTC

Actually, it just happened here locally ! According to the backtrace, it is like you described, not a deadlock, but a stall.

Comment 3 Nicolas Dufresne (ndufresne) 2017-10-30 20:12:51 UTC

I must admit, I'm a bit clueless too. I wonder why the roundtrip code has been made so horrible. If you compare with gst_wl_display_roundtrip() (same function in waylandsink), the GL one is so complicated. I could not reproduce this issue on waylandsink, so the extra complexity seems unjustified.

Comment 4 Nicolas Dufresne (ndufresne) 2017-10-30 20:18:30 UTC

I wonder why alt+tab "fixes" the issue.

Comment 5 Nicolas Dufresne (ndufresne) 2017-10-30 20:36:39 UTC

To my previous comment, because there was nothing queued to roundtrip with. In fact, if I remove the roundtrip completly, the problem goes away.

Why do we added this roundtrip in the first place ?

Comment 6 Chenglin Ye 2017-10-31 04:57:13 UTC

(In reply to Nicolas Dufresne (stormer) from comment #1)
> Review of attachment 361214 [details] [review] [review]:
> 
> This is just a hack. How do you reproduce this ?

It's just a try, but I also can not explain why this issue disappeared after a little delay, I am sorry for my clueless.

Comment 7 Chenglin Ye 2017-10-31 05:03:50 UTC

(In reply to Nicolas Dufresne (stormer) from comment #3)
> I must admit, I'm a bit clueless too. I wonder why the roundtrip code has
> been made so horrible. If you compare with gst_wl_display_roundtrip() (same
> function in waylandsink), the GL one is so complicated. I could not
> reproduce this issue on waylandsink, so the extra complexity seems
> unjustified.

I will reference with gst_wl_display_roundtrip() (same function in waylandsink). Thanks for your explaintion.

Comment 8 Matthew Waters (ystreet00) 2017-10-31 05:36:50 UTC

(In reply to Nicolas Dufresne (stormer) from comment #3)
> I must admit, I'm a bit clueless too. I wonder why the roundtrip code has
> been made so horrible. If you compare with gst_wl_display_roundtrip() (same
> function in waylandsink), the GL one is so complicated. I could not
> reproduce this issue on waylandsink, so the extra complexity seems
> unjustified.

The roundtrip is complicated becuase
1. it deals with both the default-wl_queue and separate wl_queue cases
2. It attempts to solve a race where setting the wl_proxy races with others reading the queue.

Comment 9 Matthew Waters (ystreet00) 2017-10-31 05:40:12 UTC

There is also a comment at the top relating to thread safety.  Is the blocking case breaking that case?  If queue == NULL, the roundtrip should only be called on the display thread.  if queue != NULL them the thread must be the GL thread.

Comment 10 Nicolas Dufresne (ndufresne) 2017-10-31 15:15:38 UTC

I'm a bit too clueless to answer your questions. Normally, one will do a roundtrip not to work around a race, but to ensure an asynchronous request get processed before continuing (that's what waylandsink does). It's not clear that there is anything really pending for sure when we do this round trip. Running alt+tab generate 1 roundtrip, hence un-blocking the call.

How do you reproduce the race this roundtrip was supposedly fixing ? I haven't had any issue yet removing the roundtrip completly. Another weird thing in the code, is that the roundtrip happens in _show, after create, but never after any other "create" calls. And the window create function is more like an "ensure" function, maybe it has something to do with when the window is created ?

Comment 11 Matthew Waters (ystreet00) 2017-11-01 02:20:51 UTC

(In reply to Nicolas Dufresne (stormer) from comment #10)
> I'm a bit too clueless to answer your questions. Normally, one will do a
> roundtrip not to work around a race, but to ensure an asynchronous request
> get processed before continuing (that's what waylandsink does). It's not
> clear that there is anything really pending for sure when we do this round
> trip. Running alt+tab generate 1 roundtrip, hence un-blocking the call.
> 
> How do you reproduce the race this roundtrip was supposedly fixing ?

The race I mention is in the roundtrip function of most other wayland roundtrip functions where setting the wl_queue of a wl_proxy can race with another thread emptying the event queue and thus calling the callback on the default wl_display queue rather than the meant to be set wl_queue.

> I haven't had any issue yet removing the roundtrip completly. 

Are you only trying with GStreamer?  The issues will come integrating with any other wayland using library like Gtk, Qt, etc where the reproduction was that GStreamer's surface was not displayed at all (until someone did a roundtrip with resize, window switching, keypress, etc).

> Another weird
> thing in the code, is that the roundtrip happens in _show, after create, but
> never after any other "create" calls. And the window create function is more
> like an "ensure" function, maybe it has something to do with when the window
> is created ?

In short the roundtrip is processing all the asynchronous requests, not solving a race.

Comment 12 Nicolas Dufresne (ndufresne) 2017-11-01 14:26:55 UTC

(In reply to Matthew Waters (ystreet00) from comment #11)
> > I haven't had any issue yet removing the roundtrip completly. 
> 
> Are you only trying with GStreamer?  The issues will come integrating with
> any other wayland using library like Gtk, Qt, etc where the reproduction was
> that GStreamer's surface was not displayed at all (until someone did a
> roundtrip with resize, window switching, keypress, etc).

Yes, with GStreamer. It's funny, since this is exactly the description of the bug we are facing here, which get fixed by removing the roundtrip.

Now, if this is racing, it means that external compoenent are also doing some stuff on the queue from other threads. This is miss-use of wayland queues really.

> 
> > Another weird
> > thing in the code, is that the roundtrip happens in _show, after create, but
> > never after any other "create" calls. And the window create function is more
> > like an "ensure" function, maybe it has something to do with when the window
> > is created ?
> 
> In short the roundtrip is processing all the asynchronous requests, not
> solving a race.

But it also waits if there is nothing to process, and that seems like the issue we are facing.

Comment 13 Nicolas Dufresne (ndufresne) 2017-11-01 14:29:16 UTC

Btw, if you can find back such an application that would show that a roundtrip inside _show() call is sometimes needed, that could unblock this issue. I think user-input in full app is what makes this case less visible, but could easily appear in kiosk or digital signage applications. This random roundtrip call looks generally harmful and random.

Comment 14 Matthew Waters (ystreet00) 2017-11-01 14:39:51 UTC

(In reply to Nicolas Dufresne (stormer) from comment #12)
> Now, if this is racing, it means that external compoenent are also doing
> some stuff on the queue from other threads. This is miss-use of wayland
> queues really.

Precisely.

> But it also waits if there is nothing to process, and that seems like the
> issue we are facing.

"nothing to process" is not quite correct. The wl_callback that is installed is "something to process".  It would only hang if the wl_callback is processed somewhere else.

> But it also waits if there is nothing to process, and that seems like the
> issue we are facing.

Which is why it's only safe to be called from the thread that will be reading events from the specified wl_queue (as mentioned in the comment above _roundtrip()) and thus my response in comment 9 asking for a backtrace if this was the case or not when the hang occurs and if this is a problem in GStreamer itself.

(In reply to Nicolas Dufresne (stormer) from comment #13)
> Btw, if you can find back such an application that would show that a
> roundtrip inside _show() call is sometimes needed, that could unblock this
> issue. I think user-input in full app is what makes this case less visible,
> but could easily appear in kiosk or digital signage applications. This
> random roundtrip call looks generally harmful and random.

IIRC, the gtk videooverlay examples in -bad exhibited the hang sporadically.

Comment 15 Matthew Waters (ystreet00) 2017-11-02 23:59:55 UTC

e.g. I can get the following stall on startup with -bad/gl/gtk/filtervideooverlay/filtervideooverlay which is becuse mesa's wayland GL handling doesn't take into account the race possible with setting a wl_proxy's wl_queue in https://cgit.freedesktop.org/mesa/mesa/tree/src/egl/drivers/dri2/platform_wayland.c

(gdb) t a a bt

+ Trace 238137

Thread 1 (Thread 0x7ffff7f98e00 (LWP 3612))

#0 syscall
#1 g_cond_wait
#2 gst_gl_window_default_send_message
at gstglwindow.c line 603
#3 gst_glimage_sink_redisplay
at gstglimagesink.c line 2377
#4 gst_gl_sink_bin_overlay_expose
at gstglsinkbin.c line 448
#5 expose_cb(GtkWidget*, cairo_t*, GstElement*)
at main.cpp line 105
#6 0x00007ffff76f4098 in
#7 0x00007ffff784457d in
#8 g_signal_emit_valist
#9 g_signal_emit
#10 0x00007ffff7851963 in
#11 gtk_container_propagate_draw
#12 0x00007ffff7626a33 in
#13 0x00007ffff785f9af in
#14 0x00007ffff7851730 in
#15 0x00007ffff785aee3 in
#16 gtk_main_do_event
#17 0x00007ffff71fc6f6 in
#18 0x00007ffff720d03b in
#19 0x00007ffff720e299 in
#20 0x00007ffff720e499 in
#21 g_closure_invoke
#22 0x00007ffff38790b0 in
#23 g_signal_emit_valist
#24 g_signal_emit
#25 0x00007ffff720600a in
#26 0x00007ffff71f07c3 in
#27 0x00007ffff3593cb3 in
#28 g_main_context_dispatch
#29 0x00007ffff3596f69 in
#30 g_main_loop_run
#31 gtk_main
#32 main(gint, gchar**)
at main.cpp line 292

Comment 16 Matthew Waters (ystreet00) 2017-11-06 13:13:08 UTC

So, I guess the question is, is this actually our fault?  As a result, a backtrace would be most helpful in determining where the blame lies.

Comment 17 Matthew Waters (ystreet00) 2018-01-04 04:39:03 UTC


*** This bug has been marked as a duplicate of bug 758984 ***