Bug 742684 – aggregator: Locking logic should be reviewed, cleaned up, and documented

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 742684 - aggregator: Locking logic should be reviewed, cleaned up, and documented


Summary:	aggregator: Locking logic should be reviewed, cleaned up, and documented


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-bad
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.5.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:	739010

Reported:	2015-01-10 02:16 UTC by Olivier Crête
Modified:	2015-02-20 00:42 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
aggregator: Protect data with the same mutex as GCond (5.60 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
aggregator: Replace event lock with pad's object lock (11.35 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
aggregator: Protect exported pad members with the pad's object lock (1.78 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
videoaggregator: Lock access to members of GstAggregatorPad (4.28 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
audiomixer: Don't reset caps on flush (1011 bytes, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
aggregator: Consistently lock some members (8.76 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
audiomixer: Clear GstAudioInfo the the caps (908 bytes, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
aggregator: Consistenly lock the flow_return state (3.30 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
audiomixer: Make flush start/stop test non-racy (3.17 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	none	Details \| Review
audiomixer: Avoid race in caps negotiation (3.60 KB, patch) 2015-01-22 22:45 UTC, Olivier Crête	committed	Details \| Review
aggregator: Protect the tags with the object lock (3.23 KB, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
aggregator: Protect the srcpad caps negotiation with the stream lock (3.37 KB, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
aggregator: Document locking for gst_aggregator_get_latency_unlocked() (3.50 KB, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
aggregator: Protect all latency related members with the object lock (1.94 KB, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
aggregator: Document how the segment is protected (824 bytes, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
aggregator: Document locking of GstAggregatorPrivate members (2.18 KB, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
audiomixer: Replace racy timeout based tested with drain query (5.81 KB, patch) 2015-01-22 22:46 UTC, Olivier Crête	committed	Details \| Review
audiomixer: Make flush start/stop test non-racy (2.58 KB, patch) 2015-01-22 23:25 UTC, Olivier Crête	committed	Details \| Review
aggregator: Hide GstAggregatorPad buffer and EOS fileds (9.45 KB, patch) 2015-01-26 10:40 UTC, Thibault Saunier	none	Details \| Review
aggregator: Make the PAD_LOCK private (4.11 KB, patch) 2015-01-26 10:40 UTC, Thibault Saunier	committed	Details \| Review
aggregator: More fixes around locking when accessing protected private fields (2.65 KB, patch) 2015-01-26 10:40 UTC, Thibault Saunier	committed	Details \| Review
this brings the aggregator a little closer back to the original design (996 bytes, patch) 2015-01-26 12:15 UTC, Mathieu Duponchelle	committed	Details \| Review
aggregator: Hide GstAggregatorPad buffer and EOS fileds (9.02 KB, patch) 2015-01-26 14:44 UTC, Thibault Saunier	committed	Details \| Review
aggregator: Cleanup locking around of AggregatorPad flush related fields (4.08 KB, patch) 2015-01-26 16:22 UTC, Thibault Saunier	none	Details \| Review
aggregator: Cleanup locking around AggregatorPad flush related fields (4.07 KB, patch) 2015-01-26 16:23 UTC, Thibault Saunier	committed	Details \| Review

Description Olivier Crête 2015-01-10 02:16:47 UTC

Playing around with the new GstAggregator code, I get deadlocks with the aggregator thread stuck on I see a macro that does:

g_mutex_lock()
g_cond_broadcast();
g_mutex_unlock();

This is always always always wrong. Because the g_cond_broadcast() may happen between the time when the actual condition is checked and the time when the g_cond_wait() is called, which results in the wait waiting forever.

Comment 1 Olivier Crête 2015-01-10 05:08:43 UTC

Actually, all of the locking around flushing and seeking is quite dodgy. There are a bunch of probably useless atomics, where the same variable is read somewhere else without the atomics. The stream lock is taken in the flush_start and released on a flush_stop.. But after a flush start, there is no guarantee you'll get a flush_stop, you could get a state change to NULL instead! I'm not sure I understand why the duplicated gst_aggregator_pad_steal_buffer() in flush_start ?

To make it fail:

GST_CHECKS=test_flush_start_flush_stop make elements/audiomixer.forever
GST_CHECKS=test_live_seeking make elements/audiomixer.forever

Comment 2 Thibault Saunier 2015-01-10 07:54:15 UTC

We have run those test for nights and have been very careful in the management of those locking issue so this is for sure a regression in recent code. Where in the code did you detect those issues?

Putting that bug as blocker.

(In reply to comment #1)
> Actually, all of the locking around flushing and seeking is quite dodgy. There
> are a bunch of probably useless atomics, where the same variable is read
> somewhere else without the atomics. The stream lock is taken in the flush_start
> and released on a flush_stop.. But after a flush start, there is no guarantee
> you'll get a flush_stop, you could get a state change to NULL instead!

This is the case only after a flushing seek... Please give more detailed comments because that "it looks wrong" without any detail is not that useful.


> I'm not  sure I understand why the duplicated
> gst_aggregator_pad_steal_buffer() in flush_start ?

To handle the case were we had a queued buffer and a thread blocked in _chain waiting for the first buffer to be unqueued.

> To make it fail:
> 
> GST_CHECKS=test_flush_start_flush_stop make elements/audiomixer.forever
> GST_CHECKS=test_live_seeking make elements/audiomixer.forever

Comment 3 Nicolas Dufresne (ndufresne) 2015-01-10 14:43:44 UTC

(In reply to comment #0)
> g_mutex_lock()
> g_cond_broadcast();
> g_mutex_unlock();
> 

Broadcasting this way is correct. Are you sure this is the example you wanted to show ?

Comment 4 Sebastian Dröge (slomo) 2015-01-10 15:16:46 UTC

(In reply to comment #3)
> (In reply to comment #0)
> > g_mutex_lock()
> > g_cond_broadcast();
> > g_mutex_unlock();
> > 
> 
> Broadcasting this way is correct. Are you sure this is the example you wanted
> to show ?

It's probably not correct. When using GCond you always need variables additional to the GCond to tell the waiter(s) what happened and that something has happened. And these variables also have to be protected by the same mutex.
Reason for that is that there can also be spurious wakeups without a signal()/broadcast() and also you usually have multiple conditions to check and decide on different code paths after waking up.

And g_cond_wait() that is not somehow in a loop like this is suspicious and probably wrong:

> g_mutex_lock();
> while (!conditions)
>   g_cond_wait();
> g_mutex_unlock()

Comment 5 Nicolas Dufresne (ndufresne) 2015-01-10 15:48:50 UTC

(In reply to comment #4) 
> > g_mutex_lock();
> > while (!conditions)
> >   g_cond_wait();
> > g_mutex_unlock()

This seems like a better way to pinpoint the problem (stating there is no loop around condition would have made this bug report nicer imho). There exist valid use case for the pattern Olivier used, sometimes you just want to make your waiter spin in order to recheck more complex conditions).

Comment 6 Olivier Crête 2015-01-10 19:35:08 UTC

Actually, I though the lack of a loop was the problem, but there is one, it's just well hidden by the fact that the aggregate function is called in a loop.

The problem is actually that a bunch of the member variables that are used in multiple threads are sometimes modified with some lock held and sometimes not. And every case where the SRC_STREAM_BROADCAST() macro is used is a case where the variable that is checked before waiting on the GCond has been updated without having the same mutex that is used by the GConf held and is therefore racy.

Someone really needs to go over every member variable in GstAggregator and document which mutex protects it (or why it doesn't need it!)

The only safe pattern to use a GCond is:

thread 1:
g_mutex_lock(mutex);
update variable ..; /* same mutex used to protect changes to the variable as used when reading it and for the GCond ! */
g_cond_broadcast (cond);
g_mutex_unlock(mutex);

thread 2:
g_mutex_lock (mutex);
while (variable) {
  g_cond_wait (cond, mutex);
  /* it's allow to unlock and relock here ! */
}
g_mutex_unlock (mutex)


Also, there are probably too many added mutexes in GstAggregator: the event_lock and "stream_lock" should the the Object lock of the pad, the setcaps lock should be the srcpad stream lock (actually should always be called from the streaming thread!).

Comment 7 Mathieu Duponchelle 2015-01-10 20:27:38 UTC

(In reply to comment #0)
> Playing around with the new GstAggregator code, I get deadlocks with the
> aggregator thread stuck on I see a macro that does:
> 

It would be useful to have a test / example that shows these deadlocks, as Thibault said we spent quite a lot of time working on thread safety in there and had very intensive test cases running forever. The code has substantially changed since then, and it would be wise that we rerun these specific tests to ensure there has been no regression, but please share failing cases.

> g_mutex_lock()
> g_cond_broadcast();
> g_mutex_unlock();
> 
> This is always always always wrong. Because the g_cond_broadcast() may happen
> between the time when the actual condition is checked and the time when the
> g_cond_wait() is called, which results in the wait waiting forever.

I'm not sure why you're saying this, when the doc specifies :

It is good practice to lock the same mutex as the waiting threads while calling this function, though not required.

Comment 8 Olivier Crête 2015-01-14 22:13:25 UTC

Made some more progress, it seems the audiomixer tests are racy as they expect the flush_stop to be synchronous, but it's not anymore. Also, the use case where you have a gst_element_release_pad() called in the middle of a flushing seek is not well handled now.

Example problematic use-case:
- upstream flushing seek and forward it upstream to sink_0 and sink_1
- get a flush start on pad sink_0
- get a flush start on pad sink_1
- get a flush stop pad sink_0
- release pad sink_1
- .. the streaming never restarts because the second flush stop never happens


Also, it seems that locking around gst_pad_remove() is not sufficient and then we get memory corruption and or that something ues the pads without taking the element's object lock and don't keep a reference. The investigation continues.

Comment 9 Olivier Crête 2015-01-22 22:44:16 UTC

I ended up going through all of the locking in GstAggregator and documenting the locking of every member. This resulted in a rather large number of patches which I'm attaching.

I also fixed some races in the audiomixer unit tests.

Comment 10 Olivier Crête 2015-01-22 22:45:06 UTC

Created attachment 295222 [details] [review]
aggregator: Protect data with the same mutex as GCond

Whenever a GCond is used, the safest paradigm is to protect
the variable which change is signalled by the GCond with the same
mutex that the GCond depends on.

Comment 11 Olivier Crête 2015-01-22 22:45:11 UTC

Created attachment 295223 [details] [review]
aggregator: Replace event lock with pad's object lock

Reduce the number of locks simplify code, what is protects
is exposed, but the lock was not.

Also means adding an _unlocked version of gst_aggregator_pad_steal_buffer().

Comment 12 Olivier Crête 2015-01-22 22:45:17 UTC

Created attachment 295224 [details] [review]
aggregator: Protect exported pad members with the pad's object lock

Comment 13 Olivier Crête 2015-01-22 22:45:23 UTC

Created attachment 295225 [details] [review]
videoaggregator: Lock access to members of GstAggregatorPad

Take the pad's object lock before accessing members of the
GstAggregatorPad structure.

Comment 14 Olivier Crête 2015-01-22 22:45:27 UTC

Created attachment 295226 [details] [review]
audiomixer: Don't reset caps on flush

A flush event doesn't invalidate the previous caps event.

Comment 15 Olivier Crête 2015-01-22 22:45:34 UTC

Created attachment 295227 [details] [review]
aggregator: Consistently lock some members

Some members sometimes used atomic access, sometimes where not locked at
all. Instead consistently use a mutex to protect them, also document
that.

Comment 16 Olivier Crête 2015-01-22 22:45:40 UTC

Created attachment 295228 [details] [review]
audiomixer: Clear GstAudioInfo the the caps

When clearing the caps, also clear the matching GstAudioInfo

Comment 17 Olivier Crête 2015-01-22 22:45:48 UTC

Created attachment 295229 [details] [review]
aggregator: Consistenly lock the flow_return state

Use the object's lock to protect it.

Comment 18 Olivier Crête 2015-01-22 22:45:53 UTC

Created attachment 295230 [details] [review]
audiomixer: Make flush start/stop test non-racy

The sender must send a new segment event after a flush stop, so drop
buffers in the meantime and then re-send the segment event after a flush.

Comment 19 Olivier Crête 2015-01-22 22:45:59 UTC

Created attachment 295231 [details] [review]
audiomixer: Avoid race in caps negotiation

With the current audiomixer, the input caps need to be the same,
otherwise there is an unavoidable race in the caps negotiation. So
enforce that using capsfilters

Comment 20 Olivier Crête 2015-01-22 22:46:06 UTC

Created attachment 295232 [details] [review]
aggregator: Protect the tags with the object lock

The tags related variables were sometimes protected, sometimes not and
sometimes atomic. Put them all under the object lock.

Comment 21 Olivier Crête 2015-01-22 22:46:12 UTC

Created attachment 295233 [details] [review]
aggregator: Protect the srcpad caps negotiation with the stream lock

Instead of adding another lock, use the srcpad stream lock, which is already
taken anyway to push out the new caps if needed.

Comment 22 Olivier Crête 2015-01-22 22:46:19 UTC

Created attachment 295234 [details] [review]
aggregator: Document locking for gst_aggregator_get_latency_unlocked()

Renamed it to _unlocked() to make it clear.

Comment 23 Olivier Crête 2015-01-22 22:46:30 UTC

Created attachment 295235 [details] [review]
aggregator: Protect all latency related members with the object lock

The locking was not consistent, now consistently use the object lock.

Comment 24 Olivier Crête 2015-01-22 22:46:36 UTC

Created attachment 295236 [details] [review]
aggregator: Document how the segment is protected

Document that it can only be accessed with the object lock.

Comment 25 Olivier Crête 2015-01-22 22:46:42 UTC

Created attachment 295237 [details] [review]
aggregator: Document locking of GstAggregatorPrivate members

Most of them are protected by the object lock, specify
which ones use a different lock.

Comment 26 Olivier Crête 2015-01-22 22:46:49 UTC

Created attachment 295238 [details] [review]
audiomixer: Replace racy timeout based tested with drain query

Using the drain query, we can be certain that the buffer has done going
through the aggregator by taking the stream locks.

Comment 27 Olivier Crête 2015-01-22 23:25:09 UTC

Created attachment 295245 [details] [review]
audiomixer: Make flush start/stop test non-racy

The flush stop could have happened between the source trying
to push the segment event and the buffer, this would cause a warning.
Prevent that by taking the source's stream lock while flushing.

Comment 28 Thibault Saunier 2015-01-23 14:47:53 UTC

Review of attachment 295222 [details] [review]:

::: gst-libs/gst/base/gstaggregator.c
@@ -700,2 @@
   gst_pad_stop_task (self->srcpad);
-  SRC_STREAM_BROADCAST (self);

Why do you remove the broadcasting here??

@@ +847,3 @@
        * called
        */
+      SRC_STREAM_LOCK (self);

Why do you take the lock so early? I can't see any data you protect between that line and the broadcasting.

@@ +994,3 @@
   GST_INFO_OBJECT (pad, "Removing pad");
 
+  SRC_STREAM_LOCK (self);

Why do you take the lock so early? I can't see any data you protect between that line and the broadcasting.

@@ +1763,3 @@
   }
 
+  SRC_STREAM_LOCK (self);

Why did you move it here?

@@ +1885,2 @@
     PAD_LOCK_EVENT (aggpad);
+    g_atomic_int_set (&aggpad->priv->flushing, FALSE);

Why did you move that here? ->flushing is not protected by the event lock.

Comment 29 Thibault Saunier 2015-01-23 14:50:07 UTC

Review of attachment 295223 [details] [review]:

I really do not think it is a good idea to use GST_OBJECT_LOCK here, moreover we have a cond associated with that lock and I can very well imagine deadlock happening because of that approach.

Also, I do not see any reason why you would hide GST_OBJECT_LOCK in a new macro instead of using it directly?

Comment 30 Thibault Saunier 2015-01-23 14:51:22 UTC

Review of attachment 295224 [details] [review]:

::: gst-libs/gst/base/gstaggregator.c
@@ +195,2 @@
   aggpad->eos = FALSE;
   aggpad->priv->flushing = FALSE;

Flushing should not need locking and should be exclusively used atomically.

@@ +362,3 @@
     pad = l->data;
 
+    PAD_LOCK (pad);

Yeah, crazy that was not done!

Comment 31 Thibault Saunier 2015-01-23 14:55:25 UTC

Review of attachment 295225 [details] [review]:

I see why you used the GST_OBJECT_LOCK to replace the EVENT_LOCK now, but I am really afraid it is dangerous (mainly because that lock is associated with the GCond), we should maybe expose a GST_AGGREGATOR_PAD_LOCK or provide getters for those field and hide them in the Private structure?

Comment 32 Thibault Saunier 2015-01-23 14:56:18 UTC

Review of attachment 295226 [details] [review]:

Are you sure about that? I though it forced a new negotiation, doesn't it?

Comment 33 Thibault Saunier 2015-01-23 14:59:22 UTC

Review of attachment 295227 [details] [review]:

Thanks for rationalize those things!

You say you documented that but I can't see the documentation in that commit! :)

Comment 34 Thibault Saunier 2015-01-23 15:01:01 UTC

Review of attachment 295231 [details] [review]:

Isn't there code in audiomixer to avoid the use to requiere to do that? I thought there was one.

Comment 35 Thibault Saunier 2015-01-23 15:02:19 UTC

Review of attachment 295232 [details] [review]:

Looks right

Comment 36 Thibault Saunier 2015-01-23 15:04:30 UTC

Review of attachment 295233 [details] [review]:

I did not think about doing that, it might be right indeed. I think slomo had a failling testcase without that patch so it would be nice of him to tell us if it is still good this way :)

Comment 37 Thibault Saunier 2015-01-23 15:06:56 UTC

Review of attachment 295234 [details] [review]:

I do not like much methods that force the user to held a lock, what is the reason for not doing it in the method itself?

Comment 38 Thibault Saunier 2015-01-23 15:07:58 UTC

Review of attachment 295235 [details] [review]:

OK

Comment 39 Thibault Saunier 2015-01-23 15:08:16 UTC

Review of attachment 295236 [details] [review]:

OK

Comment 40 Tim-Philipp Müller 2015-01-23 15:09:04 UTC

Comment on attachment 295224 [details] [review]
aggregator: Protect exported pad members with the pad's object lock

>+    PAD_LOCK (pad);
>+    if (pad->buffer == NULL && !pad->eos) {
>+      GST_OBJECT_UNLOCK (pad);
>       goto pad_not_ready;
>+    }
>+    PAD_UNLOCK (pad);

Surely the middle bit should be PAD_UNLOCK as well for consistency? (perhaps it'd be nicer with a temp variable)

Comment 41 Thibault Saunier 2015-01-23 15:11:44 UTC

Review of attachment 295238 [details] [review]:

Sounds like a nice trick indeed :)

Comment 42 Thibault Saunier 2015-01-23 15:12:34 UTC

Review of attachment 295245 [details] [review]:

OK

Comment 43 Thibault Saunier 2015-01-23 15:12:36 UTC

Review of attachment 295245 [details] [review]:

OK

Comment 44 Thibault Saunier 2015-01-23 15:16:18 UTC

Mathieu would also like to have another look. The code changed a **lot** since our first version and we did not follow well enough the various changes which indeed lead to many inconsistencies.

Comment 45 Olivier Crête 2015-01-23 21:51:02 UTC

Thanks for the thoughtful review, I'll try to answer some of your questions/comments below. I've also ran the libs/aggregator, elements/compositor and elements/audiomixer tests in a loop overnight with no failures. I'm getting a timeout after a number of iterators in the nleoperation test, so I'll investigate that.

I think the big open question is if it should use the object lock of the pad of have a private lock for the members of GstAggregatorPad (and for the PAD_EVENT* stuff). If it was a private lock, I think it would be much easier for subclasses if that lock was taken after the object lock, and then make every member of GstAggregatorPad private and have accessors. But that makes subclasses a bit more complex.

(In reply to comment #28)
> Review of attachment 295222 [details] [review]:
> 
> ::: gst-libs/gst/base/gstaggregator.c
> @@ -700,2 @@
>    gst_pad_stop_task (self->srcpad);
> -  SRC_STREAM_BROADCAST (self);
> 
> Why do you remove the broadcasting here??

Because the thing waiting on it is in the streaming thread, which has just been stopped, so there is no point in broadcasting.

> @@ +847,3 @@
>         * called
>         */
> +      SRC_STREAM_LOCK (self);
> 
> Why do you take the lock so early? I can't see any data you protect between
> that line and the broadcasting.

The SRC_STREAM_WAIT() waits for one of self->priv->running, self->priv->send_eos, pad->buffer or pad->eos, so I included changes to those in the lock that matches it.

> @@ +994,3 @@
>    GST_INFO_OBJECT (pad, "Removing pad");
> 
> +  SRC_STREAM_LOCK (self);
> 
> Why do you take the lock so early? I can't see any data you protect between
> that line and the broadcasting.

Same as the previous one.

> @@ +1763,3 @@
>    }
> 
> +  SRC_STREAM_LOCK (self);
> 
> Why did you move it here?

To take it before reading the variables that the matching condition cares about.
 
> @@ +1885,2 @@
>      PAD_LOCK_EVENT (aggpad);
> +    g_atomic_int_set (&aggpad->priv->flushing, FALSE);
> 
> Why did you move that here? ->flushing is not protected by the event lock.

Actually, which lock is protecteing aggpad->priv->flushing? I didn't go through the locking off the GstAggregatorPadPrivate structure member, it's not consistently using atomics (not in gst_aggregator_pad_flush). And since it's one of the things that PAD_WAIT_EVENT() depends on, it needs to be protected by the same lock.

(In reply to comment #29)
> Review of attachment 295223 [details] [review]:
> 
> I really do not think it is a good idea to use GST_OBJECT_LOCK here, moreover
> we have a cond associated with that lock and I can very well imagine deadlock
> happening because of that approach.

There could be deadlocks, and we could certainly revert that to be a separate lock, but in this case. The locking order between this lock and the other locks needs to be defined and documented. It also means that the members of the GstAggregatorPad structure should all be 

> Also, I do not see any reason why you would hide GST_OBJECT_LOCK in a new macro
> instead of using it directly?

Because macros were already used everywhere and I assume they had been useful for debugging.

(In reply to comment #30)
> Review of attachment 295224 [details] [review]:
> 
> ::: gst-libs/gst/base/gstaggregator.c
> @@ +195,2 @@
>    aggpad->eos = FALSE;
>    aggpad->priv->flushing = FALSE;
> 
> Flushing should not need locking and should be exclusively used atomically.

I'm not sure I understand all of the logic around the flushing, but it seems to work fine.. Except that this function wasn't atomic. And since this is used with the PAD_WAIT_EVENT, I'm not sure if it can actually be atomic and be safe. 


(In reply to comment #31)
> Review of attachment 295225 [details] [review]:
> 
> I see why you used the GST_OBJECT_LOCK to replace the EVENT_LOCK now, but I am
> really afraid it is dangerous (mainly because that lock is associated with the
> GCond), we should maybe expose a GST_AGGREGATOR_PAD_LOCK or provide getters for
> those field and hide them in the Private structure?

I wouldn't be unfavorable to hidding everything and using a private lock. As long as this lock's order is after the object lock.

(In reply to comment #32)
> Review of attachment 295226 [details] [review]:
> 
> Are you sure about that? I though it forced a new negotiation, doesn't it?

I though that too, but slomo corrected me. If you look for GST_EVENT_FLUSH_STOP in gstpad.c, it only removes EOS & Segment sticky events.

(In reply to comment #33)
> Review of attachment 295227 [details] [review]:
> 
> Thanks for rationalize those things!
> 
> You say you documented that but I can't see the documentation in that commit!
> :)

Ooops, should fix the commit message...

(In reply to comment #34)
> Review of attachment 295231 [details] [review]:
> 
> Isn't there code in audiomixer to avoid the use to requiere to do that? I
> thought there was one.

There is no audioresample/audioconvert library yet. But I definitely think that we should aim to do like compositor.


(In reply to comment #37)
> Review of attachment 295234 [details] [review]:
> 
> I do not like much methods that force the user to held a lock, what is the
> reason for not doing it in the method itself?

Not a big fan of that either, but subclasses seem to hold the object lock for a long time, so it makes it a bit harder to implement without it.

Comment 46 Olivier Crête 2015-01-23 23:08:00 UTC

Bad news, the nleoperation tests run with git master of gst-p-bad segfaults in the compositor's blending function when running test_complex_operations_bis. While it seems that with the patches attached here, the test just times out. Further investigation ongoing.

Comment 47 Mathieu Duponchelle 2015-01-24 12:53:24 UTC

Bad news indeed, but I can't reproduce the timeout with your branch by running this test, been at it for one hour. HEAD is at 

412b71c443029c5ed5873dda5e6a8d83b07cc670 :
audiomixer: Replace racy timeout based tested with drain query

Comment 48 Mathieu Duponchelle 2015-01-24 13:03:55 UTC

Nevermind I reproduced, using GST_CHECk= was not very clever, glad you're investigating that :)

Comment 49 Mathieu Duponchelle 2015-01-24 18:28:05 UTC

To be more specific, with my HEAD as specified in c47, I reproduce the segfault, that is indeed in the blending code. I've reproduced with some logs, and it seems to happen at the beginning of a new run, ie I see 
http://www.fpaste.org/174035/12398814/ , new run starting at L36 . Interestingly enough at that moment there's only one sink on the compositor anyway, kind of lame to crash here :)

Has me wondering if the problem has anything to do with aggregator anyway, could be a subclass issue, not enough info to determine that but I thought I'd share :)

Comment 50 Thibault Saunier 2015-01-26 10:40:15 UTC

Created attachment 295427 [details] [review]
aggregator: Hide GstAggregatorPad buffer and EOS fileds

And add a getter for the EOS.

The user should always use the various getters to access
those fields

Comment 51 Thibault Saunier 2015-01-26 10:40:21 UTC

Created attachment 295428 [details] [review]
aggregator: Make the PAD_LOCK private

Instead of using the GST_OBJECT_LOCK we should have
a dedicated mutex for the pad as it is also associated
with the mutex on the EVENT_MUTEX on which we wait
in the _chain function of the pad.

The GstAggregatorPad.segment is still protected with the
GST_OBJECT_LOCK.

Comment 52 Thibault Saunier 2015-01-26 10:40:28 UTC

Created attachment 295429 [details] [review]
aggregator: More fixes around locking when accessing protected private fields

In some more places we were accessing GstAggregator->segment
and GstAggregator->seqnum without holding the GST_OBJECT_LOCK

Comment 53 Mathieu Duponchelle 2015-01-26 12:15:45 UTC

Created attachment 295442 [details] [review]
this brings the aggregator a little closer back to the original design

Comment 54 Thibault Saunier 2015-01-26 14:44:51 UTC

Created attachment 295458 [details] [review]
aggregator: Hide GstAggregatorPad buffer and EOS fileds

And add a getter for the EOS.

The user should always use the various getters to access
those fields

Comment 55 Olivier Crête 2015-01-26 15:37:46 UTC

Review of attachment 295428 [details] [review]:

Can you add a comment in the GstAggregatorPadPrivate explicitly saying which lock to use.

Comment 56 Mathieu Duponchelle 2015-01-26 15:47:27 UTC

Review of attachment 295458 [details] [review]:

OK, I would not lock the pad in the getter though, cause the subclasses methods should always be called in conditions where EOS won't change, but that *shouldn't* hurt anyway

Comment 57 Mathieu Duponchelle 2015-01-26 15:48:08 UTC

Review of attachment 295429 [details] [review]:

OK :)

Comment 58 Mathieu Duponchelle 2015-01-26 15:48:55 UTC

Review of attachment 295428 [details] [review]:

setting patch status

Comment 59 Thibault Saunier 2015-01-26 16:22:31 UTC

Created attachment 295466 [details] [review]
aggregator: Cleanup locking around of AggregatorPad flush related fields

And document the locking

Comment 60 Thibault Saunier 2015-01-26 16:23:24 UTC

Created attachment 295467 [details] [review]
aggregator: Cleanup locking around AggregatorPad flush related fields

And document the locking

Comment 61 Thibault Saunier 2015-01-26 16:24:23 UTC

(In reply to comment #55)
> Review of attachment 295428 [details] [review]:
> 
> Can you add a comment in the GstAggregatorPadPrivate explicitly saying which
> lock to use.

I did that in that last 'aggregator: Cleanup locking around AggregatorPad flush related fields' commit

Comment 62 Olivier Crête 2015-01-26 22:30:02 UTC

Should probably squash your patches and mine and remove the _unlocked() variant that no longer makes sense if it's not the same lock. Also need to document the locking order between the GstAggregator pad lock and the GstObject lock of the element and the pad. In the STREAM_LOCK(sinkpad)->STREAM_LOCK(srcpad)->OBJECT_LOCK(element)->OBJECT_LOCK(pad) order, I guess it goes between OBJECT_LOCK(element) and OBJECT_LOCK(pad) ? Which is a bit strange, but oh well.

Comment 63 Thibault Saunier 2015-01-29 10:05:38 UTC

Not closing as we still need to document the locking order and in case we still have things
to discuss but I think locking is looking much better now.


Attachment 295222 [details] pushed as 0a2dc11 - aggregator: Protect data with the same mutex as GCond
Attachment 295223 [details] pushed as cc3f418 - aggregator: Replace event lock with pad's object lock
Attachment 295224 [details] pushed as 68ac643 - aggregator: Protect exported pad members with the pad's object lock
Attachment 295225 [details] pushed as ee04b09 - videoaggregator: Lock access to members of GstAggregatorPad
Attachment 295226 [details] pushed as 067b44e - audiomixer: Don't reset caps on flush
Attachment 295227 [details] pushed as cc605f4 - aggregator: Consistently lock some members
Attachment 295228 [details] pushed as a2f1aa3 - audiomixer: Clear GstAudioInfo the the caps
Attachment 295229 [details] pushed as eddd5fd - aggregator: Consistenly lock the flow_return state
Attachment 295231 [details] pushed as 94e2d78 - audiomixer: Avoid race in caps negotiation
Attachment 295232 [details] pushed as 4a5882e - aggregator: Protect the tags with the object lock
Attachment 295233 [details] pushed as f7070dc - aggregator: Protect the srcpad caps negotiation with the stream lock
Attachment 295234 [details] pushed as 41d2667 - aggregator: Document locking for gst_aggregator_get_latency_unlocked()
Attachment 295235 [details] pushed as 9df8ac0 - aggregator: Protect all latency related members with the object lock
Attachment 295236 [details] pushed as ea76d39 - aggregator: Document how the segment is protected
Attachment 295237 [details] pushed as 93d0b51 - aggregator: Document locking of GstAggregatorPrivate members
Attachment 295238 [details] pushed as f98e457 - audiomixer: Replace racy timeout based tested with drain query
Attachment 295245 [details] pushed as fb6ba27 - audiomixer: Make flush start/stop test non-racy
Attachment 295428 [details] pushed as 1a07467 - aggregator: Make the PAD_LOCK private
Attachment 295429 [details] pushed as ccf329d - aggregator: More fixes around locking when accessing protected private fields
Attachment 295458 [details] pushed as d8eef43 - aggregator: Hide GstAggregatorPad buffer and EOS fileds
Attachment 295467 [details] pushed as 71e4c48 - aggregator: Cleanup locking around AggregatorPad flush related fields

Comment 64 Tim-Philipp Müller 2015-02-07 09:51:20 UTC

Not a blocker bug any more then.

Who's going to document the locking order?

Comment 65 Olivier Crête 2015-02-18 22:26:09 UTC

Actually, the code is still wrong, doing this fails after a couple iterations:
GST_CHECKS=test_live_seeking GST_DEBUG=*agg*:6 CK_FORK=no G_DEBUG=fatal-warnings,fatal-criticals make elements/audiomixer.forever

I think the reason is that in the gst_aggregator_wait_and_check() function, it checks on a bunch of stuff before deciding to wait on the condition variable  or the clock. But those things can change without the matching "src_lock" being taken.. I think that the PAD_LOCK and the SRC_LOCK should be merged into one, but they're not next to each other in the locking order, so it's not trivial.


I documented the locking order, and I also renamed both locks that had "stream" in their name to something less confusing as they really were not what we called "stream locks" in GStreamer.

commit c37e82587c57b28337972c2ea5bee26ce9f9e8e3
Author: Olivier Crête <olivier.crete@collabora.com>
Date:   Wed Feb 18 15:53:53 2015 -0500

    aggregator: Document locking order
    
    https://bugzilla.gnome.org/show_bug.cgi?id=742684

commit 17df37d8cbbf2593b477a6de6014bf64134a3311
Author: Olivier Crête <olivier.crete@collabora.com>
Date:   Wed Feb 18 15:11:14 2015 -0500

    aggregator: Rename confusinly named SRC_STREAM_LOCK macros to SRC_LOCK
    
    This will match the name of the lock itself. It is also not a stream
    lock as it not recursive and not held while pushing.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=742684

commit 3a3f2b534377c906a6a5d0fc5e5c4a2f14fac845
Author: Olivier Crête <olivier.crete@collabora.com>
Date:   Wed Feb 18 15:06:01 2015 -0500

    aggregator: Rename confusingly named stream lock to flush lock
    
    This lock is not what is commonly known as a "stream lock" in GStremer,
    it's not recursive and it's taken from the non-serialized FLUSH_START event.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=742684

commit 36ef8f0bd48586111d9db0ca409f009eda3fa675
Author: Olivier Crête <olivier.crete@collabora.com>
Date:   Wed Feb 18 15:04:04 2015 -0500

    aggregator: Fix macro indendation
    
    Changes no code
    
    https://bugzilla.gnome.org/show_bug.cgi?id=742684

Comment 66 Tim-Philipp Müller 2015-02-18 22:59:19 UTC

Thanks for renaming some of the locks, I was meaning to do that as well, since it was quite confusing like that.

Comment 67 Olivier Crête 2015-02-20 00:06:27 UTC

With this commit, I can't make it fail anymore, so mark it as fixed!

commit 97f6b7a8aacc526ada14db8bcbd70008d6fc846b
Author: Olivier Crête <olivier.crete@collabora.com>
Date:   Thu Feb 19 18:30:35 2015 -0500

    aggregator: Don't try to push tags while flush seeking
    
    The downstream segment could have been flushed already, so
    need to re-send the segment event before re-sending the tags.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=742684

Comment 68 Mathieu Duponchelle 2015-02-20 00:42:29 UTC

Thanks for the time you spent :)