GNOME Bugzilla – Bug 788754
gl: wayland: sometimes block pipeline at PREROLLED
Last modified: 2018-01-04 04:39:03 UTC
Created attachment 361214 [details] [review] fix pipeline block at PREROLLED bug. environment: GST_GL_WINDOW=wayland GST_GL_PLATFORM=egl pipeline: gst-launch-1.0 playbin uri=file:///home/ye/test.MOV video-sink=glimagesink issue: sometimes, pipeline block at PREROLLED, about one-twentieth probability. try: I found that "done" value changed too late because has entered the next cycle, so I try to add little delay after dispatching queue, pipeline block issue disappeared. detail please ref to patch. it's a stupid solution since I'm clueless. Do you have any good ideal to fix this issue?
Review of attachment 361214 [details] [review]: This is just a hack. How do you reproduce this ?
Actually, it just happened here locally ! According to the backtrace, it is like you described, not a deadlock, but a stall.
I must admit, I'm a bit clueless too. I wonder why the roundtrip code has been made so horrible. If you compare with gst_wl_display_roundtrip() (same function in waylandsink), the GL one is so complicated. I could not reproduce this issue on waylandsink, so the extra complexity seems unjustified.
I wonder why alt+tab "fixes" the issue.
To my previous comment, because there was nothing queued to roundtrip with. In fact, if I remove the roundtrip completly, the problem goes away. Why do we added this roundtrip in the first place ?
(In reply to Nicolas Dufresne (stormer) from comment #1) > Review of attachment 361214 [details] [review] [review]: > > This is just a hack. How do you reproduce this ? It's just a try, but I also can not explain why this issue disappeared after a little delay, I am sorry for my clueless.
(In reply to Nicolas Dufresne (stormer) from comment #3) > I must admit, I'm a bit clueless too. I wonder why the roundtrip code has > been made so horrible. If you compare with gst_wl_display_roundtrip() (same > function in waylandsink), the GL one is so complicated. I could not > reproduce this issue on waylandsink, so the extra complexity seems > unjustified. I will reference with gst_wl_display_roundtrip() (same function in waylandsink). Thanks for your explaintion.
(In reply to Nicolas Dufresne (stormer) from comment #3) > I must admit, I'm a bit clueless too. I wonder why the roundtrip code has > been made so horrible. If you compare with gst_wl_display_roundtrip() (same > function in waylandsink), the GL one is so complicated. I could not > reproduce this issue on waylandsink, so the extra complexity seems > unjustified. The roundtrip is complicated becuase 1. it deals with both the default-wl_queue and separate wl_queue cases 2. It attempts to solve a race where setting the wl_proxy races with others reading the queue.
There is also a comment at the top relating to thread safety. Is the blocking case breaking that case? If queue == NULL, the roundtrip should only be called on the display thread. if queue != NULL them the thread must be the GL thread.
I'm a bit too clueless to answer your questions. Normally, one will do a roundtrip not to work around a race, but to ensure an asynchronous request get processed before continuing (that's what waylandsink does). It's not clear that there is anything really pending for sure when we do this round trip. Running alt+tab generate 1 roundtrip, hence un-blocking the call. How do you reproduce the race this roundtrip was supposedly fixing ? I haven't had any issue yet removing the roundtrip completly. Another weird thing in the code, is that the roundtrip happens in _show, after create, but never after any other "create" calls. And the window create function is more like an "ensure" function, maybe it has something to do with when the window is created ?
(In reply to Nicolas Dufresne (stormer) from comment #10) > I'm a bit too clueless to answer your questions. Normally, one will do a > roundtrip not to work around a race, but to ensure an asynchronous request > get processed before continuing (that's what waylandsink does). It's not > clear that there is anything really pending for sure when we do this round > trip. Running alt+tab generate 1 roundtrip, hence un-blocking the call. > > How do you reproduce the race this roundtrip was supposedly fixing ? The race I mention is in the roundtrip function of most other wayland roundtrip functions where setting the wl_queue of a wl_proxy can race with another thread emptying the event queue and thus calling the callback on the default wl_display queue rather than the meant to be set wl_queue. > I haven't had any issue yet removing the roundtrip completly. Are you only trying with GStreamer? The issues will come integrating with any other wayland using library like Gtk, Qt, etc where the reproduction was that GStreamer's surface was not displayed at all (until someone did a roundtrip with resize, window switching, keypress, etc). > Another weird > thing in the code, is that the roundtrip happens in _show, after create, but > never after any other "create" calls. And the window create function is more > like an "ensure" function, maybe it has something to do with when the window > is created ? In short the roundtrip is processing all the asynchronous requests, not solving a race.
(In reply to Matthew Waters (ystreet00) from comment #11) > > I haven't had any issue yet removing the roundtrip completly. > > Are you only trying with GStreamer? The issues will come integrating with > any other wayland using library like Gtk, Qt, etc where the reproduction was > that GStreamer's surface was not displayed at all (until someone did a > roundtrip with resize, window switching, keypress, etc). Yes, with GStreamer. It's funny, since this is exactly the description of the bug we are facing here, which get fixed by removing the roundtrip. Now, if this is racing, it means that external compoenent are also doing some stuff on the queue from other threads. This is miss-use of wayland queues really. > > > Another weird > > thing in the code, is that the roundtrip happens in _show, after create, but > > never after any other "create" calls. And the window create function is more > > like an "ensure" function, maybe it has something to do with when the window > > is created ? > > In short the roundtrip is processing all the asynchronous requests, not > solving a race. But it also waits if there is nothing to process, and that seems like the issue we are facing.
Btw, if you can find back such an application that would show that a roundtrip inside _show() call is sometimes needed, that could unblock this issue. I think user-input in full app is what makes this case less visible, but could easily appear in kiosk or digital signage applications. This random roundtrip call looks generally harmful and random.
(In reply to Nicolas Dufresne (stormer) from comment #12) > Now, if this is racing, it means that external compoenent are also doing > some stuff on the queue from other threads. This is miss-use of wayland > queues really. Precisely. > But it also waits if there is nothing to process, and that seems like the > issue we are facing. "nothing to process" is not quite correct. The wl_callback that is installed is "something to process". It would only hang if the wl_callback is processed somewhere else. > But it also waits if there is nothing to process, and that seems like the > issue we are facing. Which is why it's only safe to be called from the thread that will be reading events from the specified wl_queue (as mentioned in the comment above _roundtrip()) and thus my response in comment 9 asking for a backtrace if this was the case or not when the hang occurs and if this is a problem in GStreamer itself. (In reply to Nicolas Dufresne (stormer) from comment #13) > Btw, if you can find back such an application that would show that a > roundtrip inside _show() call is sometimes needed, that could unblock this > issue. I think user-input in full app is what makes this case less visible, > but could easily appear in kiosk or digital signage applications. This > random roundtrip call looks generally harmful and random. IIRC, the gtk videooverlay examples in -bad exhibited the hang sporadically.
e.g. I can get the following stall on startup with -bad/gl/gtk/filtervideooverlay/filtervideooverlay which is becuse mesa's wayland GL handling doesn't take into account the race possible with setting a wl_proxy's wl_queue in https://cgit.freedesktop.org/mesa/mesa/tree/src/egl/drivers/dri2/platform_wayland.c (gdb) t a a bt
+ Trace 238137
Thread 1 (Thread 0x7ffff7f98e00 (LWP 3612))
So, I guess the question is, is this actually our fault? As a result, a backtrace would be most helpful in determining where the blame lies.
*** This bug has been marked as a duplicate of bug 758984 ***