After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 760757 - systemclock: incorrect handling of wait requests
systemclock: incorrect handling of wait requests
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gstreamer (core)
git master
Other Windows
: Normal normal
: 1.7.2
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2016-01-17 22:01 UTC by Florin Apostol
Modified: 2016-01-20 11:50 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
proposed patch (6.98 KB, patch)
2016-01-18 16:42 UTC, Florin Apostol
committed Details | Review
unit test to reproduce the problem (2.91 KB, patch)
2016-01-18 23:14 UTC, Florin Apostol
committed Details | Review

Description Florin Apostol 2016-01-17 22:01:09 UTC
When scheduling an alarm and inserting it at the beginning of the list gst_system_clock_id_wait_async will check if the current head entry is busy.If it is, the worker thread is already handling the head of the list and needs to be wake up to take a look at the new head. If the current head is not busy, the worker thread has not looked yet at the first entry, so inserting a new one before it is safe because the worker thread will start looking at the first entry from the list.

But gst_system_clock_async_thread will choose the first entry and drop the clock lock before calling gst_system_clock_id_wait_jitter_unlocked to set the chosen entry to busy. This creates a small window of opportunity in which a new entry is inserted by gst_system_clock_id_wait_async at the head of the list without informing the worker thread and the worker thread starting to handle the chosen entry which now is the second one in the list.

As a result, the new entry inserted at the beginning of the list will be ignored until the list becomes empty.
Comment 1 Sebastian Dröge (slomo) 2016-01-18 09:24:09 UTC
Do you want to provide a patch?
Comment 2 Florin Apostol 2016-01-18 10:32:50 UTC
Ok, I will
Comment 3 Florin Apostol 2016-01-18 16:42:58 UTC
Created attachment 319277 [details] [review]
proposed patch

My tests are passing.
make gst/gstsystemclock.check and make gst/gstclock.check are passing
Did not run other tests.
Comment 4 Nicolas Dufresne (ndufresne) 2016-01-18 17:37:49 UTC
Review of attachment 319277 [details] [review]:

::: gst/gstsystemclock.c
@@ +488,3 @@
+      /* mark the entry as busy but watch out for intermediate unscheduled
+       * statuses */
+    } while (G_UNLIKELY (!CAS_ENTRY_STATUS (entry, status, GST_CLOCK_BUSY)));

This looks like a lockless routine run with the object lock held. Sounds risky, are you sure this go well in real-world cases ?
Comment 5 Florin Apostol 2016-01-18 17:43:45 UTC
(In reply to Nicolas Dufresne (stormer) from comment #4)
> Review of attachment 319277 [details] [review] [review]:
> 
> ::: gst/gstsystemclock.c
> @@ +488,3 @@
> +      /* mark the entry as busy but watch out for intermediate unscheduled
> +       * statuses */
> +    } while (G_UNLIKELY (!CAS_ENTRY_STATUS (entry, status,
> GST_CLOCK_BUSY)));
> 
> This looks like a lockless routine run with the object lock held. Sounds
> risky, are you sure this go well in real-world cases ?

From what I know it is safe to perform atomic operations while holding a lock. 
Do you have any specific concerns?
Comment 6 Tim-Philipp Müller 2016-01-18 17:57:07 UTC
Also, do you think you might be able to come up with a unit test for this? It doesn't have to fail reliably, but if we can run something a thousand times in a loop and it fails now and then (before patch, but not after), that'd be nice.
Comment 7 Florin Apostol 2016-01-18 17:59:17 UTC
OK, I could do something similar to test_async_order
Comment 8 Florin Apostol 2016-01-18 23:14:50 UTC
Created attachment 319314 [details] [review]
unit test to reproduce the problem

this test (almost) reliable reproduces the problem. (9/10 runs)
After the fix is applied, the test never fails.

Increasing the number of iterations might increase the reproduction rate, but it depends on how busy the machine is and how threads are scheduled, so there is no guarantee. For my machine this was good enough to reproduce the problem without waiting too much on too many alarms.
Comment 9 Sebastian Dröge (slomo) 2016-01-20 11:48:48 UTC
commit e10a220a63b4eaa9f2c131da8f8161fb66dbe74a
Author: Florin Apostol <florin.apostol@oregan.net>
Date:   Mon Jan 18 22:45:58 2016 +0000

    systemclock: tests: added stress test for async order
    
    Keep inserting alarms at the beginning of the list. Due to
    https://bugzilla.gnome.org/show_bug.cgi?id=760757
    alarm thread will get confused and not serve them in order.

commit 4e2250e531df41f9795d1f6879e78110dcac9a6a
Author: Florin Apostol <florin.apostol@oregan.net>
Date:   Mon Jan 18 16:25:20 2016 +0000

    systemclock: fixed race condition in handling alarms
    
    When choosing the first entry from the list, gst_system_clock_async_thread
    must set the entry state to busy before releasing the clock lock. Otherwise
    a new entry could be added to the beginning of the list and
    gst_system_clock_async_thread will be unaware and keep waiting on the entry
    it has already chosen.
    
    Also improved messages about expected state and bumped them to ERROR level
    to detect unexpected state changes.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=760757
Comment 10 Sebastian Dröge (slomo) 2016-01-20 11:50:19 UTC
Wim reviewed on IRC and said patches can be merged.