After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 692691 - queue: dangling pointer when doing allocation query on a flushing queue
queue: dangling pointer when doing allocation query on a flushing queue
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gstreamer (core)
1.x
Other Linux
: Normal normal
: 1.0.6
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2013-01-28 10:04 UTC by Alexander Schrab
Modified: 2013-02-06 12:42 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch for queue.c and queuearray.c (5.22 KB, patch)
2013-01-28 10:09 UTC, Alexander Schrab
committed Details | Review
Unit test patch for queue that triggers the bug (1.98 KB, patch)
2013-01-29 11:30 UTC, Alexander Schrab
none Details | Review
Unit test patch for queue that triggers the bug (2.41 KB, patch)
2013-01-29 11:45 UTC, Alexander Schrab
committed Details | Review
Unit test for queuearray find/drop element (5.14 KB, patch)
2013-01-30 08:14 UTC, Alexander Schrab
committed Details | Review
Cleaned up unit test (4.48 KB, patch)
2013-02-05 08:45 UTC, Alexander Schrab
needs-work Details | Review
Cleaned up unit test (5.26 KB, patch)
2013-02-06 08:59 UTC, Alexander Schrab
needs-work Details | Review

Description Alexander Schrab 2013-01-28 10:04:54 UTC
When querying a queue that is flushing we end up adding a query to the queuearray without taking a reference to that query (because the normal functionality is to block until that query is done and discarded from the queue). This later causes problem if the query is unreffed outside of the queue before we discard the queue. There is a check to avoid unreffing any lingering query-objects, but since the query has been deleted that check fails.

Solution:
Remove the query from the queue if it is still there in the out_flushing: error case in gstqueue.c:gst_queue_handle_sink_query
Comment 1 Alexander Schrab 2013-01-28 10:09:35 UTC
Created attachment 234598 [details] [review]
Patch for queue.c and queuearray.c

There was a side issue from fixing this bug: the gst_queue_array_find and gst_queue_array_drop_element methods in queuearray.c were completely wrong (cannot have been used at all, maybe they were left from an earlier implementation?). This proposed patch fixes the original issue and follow up issue from needing those two methods in a working state.
Comment 2 Tim-Philipp Müller 2013-01-28 14:56:24 UTC
A unit test would be nice :)

I wonder if it wouldn't be better/cleaner to just ref/unref the query while it's in the queue then?
Comment 3 Alexander Schrab 2013-01-29 06:49:51 UTC
I agree with you.. The thought hit me as well, but at the same time I didn't want to make too much of an operation in the code. I will make a patch that reffes queries. Question is what to do with the queuearray.c stuff? It is obviously broken, but on the other hand it might be better to just remove if noone uses it (and take a look at why it happens to work in the few places it is used).

(In reply to comment #2)
> A unit test would be nice :)
> 
> I wonder if it wouldn't be better/cleaner to just ref/unref the query while
> it's in the queue then?
Comment 4 Tim-Philipp Müller 2013-01-29 11:06:14 UTC
> Question is what to do with the queuearray.c stuff? It is
> obviously broken, but on the other hand it might be better to just remove if
> noone uses it (and take a look at why it happens to work in the few places it
> is used).

It should still be fixed in a separate patch IMHO (ideally with unit test addition to tests/check/libs/queuearray.c).

We can't really remove the API unless it was totally impossible that anyone ever used it at all (e.g. would crash immediately), and even then it's probably better to just fix it up, no?
Comment 5 Alexander Schrab 2013-01-29 11:28:08 UTC
I finally have a unit test that triggers the bug *sometimes*, I wanted but have so far failed in making it fail all the time. I narrowed the problem down to this:

Making allocation queries to the queue while simultaneously sending one flush_start to the queue can trigger this. The reason being as I said earlier: 

handle_sink_query falls out through out_flushing: while waiting for the query to complete. This results in a race between the thread doing a query that will most likely unref the query when the function returns and the queue that will try to flush out the query (doing an GST_IS_QUERY(data) on something that might already be deleted).

So what can be done?

1. as we said before, we can either ref or make sure the item is gone (I do prefer the ref since we can remove a number of special cases for the query data items)
2. Make sure we always call gst_queue_locked_flush right after setting srcresult to GST_FLOW_FLUSHING, before we let the query call return (it is waiting for GST_QUEUE_WAIT_DEL_CHECK (queue, out_flushing)). This behavior is already there when pushing buffers results in setting srcresult to flushing as far as I can see, but I have way to little knowledge of the queue to understand if the gst_queue_locked_flush can be used that way.


I'm adding the test case if someone wants to take a look... It's a pretty ugly unit test with sleep etc, so it's not in a state to import...
Comment 6 Alexander Schrab 2013-01-29 11:30:23 UTC
Created attachment 234716 [details] [review]
Unit test patch for queue that triggers the bug

This unit test is not in a very nice state, it's a race condition bug and I'm adding this test if anyone else wants to try to trigger the bug. You might need to run the test through valgrind to get any real errors...
Comment 7 Alexander Schrab 2013-01-29 11:36:35 UTC
Just realized this patch is broken, I need to get my git act together :) 

(In reply to comment #6)
> Created an attachment (id=234716) [details] [review]
> Unit test patch for queue that triggers the bug
> 
> This unit test is not in a very nice state, it's a race condition bug and I'm
> adding this test if anyone else wants to try to trigger the bug. You might need
> to run the test through valgrind to get any real errors...
Comment 8 Alexander Schrab 2013-01-29 11:45:18 UTC
Created attachment 234720 [details] [review]
Unit test patch for queue that triggers the bug

Patch that adds a test that triggers the bug... sometimes...

Use valgrind and run it a few times. The test has a few sleeps etc, so it's just for debugging. Haven't been able to reproduce it more clean yet...
Comment 9 Tim-Philipp Müller 2013-01-29 12:01:42 UTC
> Patch that adds a test that triggers the bug... sometimes... 
> Use valgrind and run it a few times. The test has a few sleeps etc, so it's
> just for debugging. Haven't been able to reproduce it more clean yet...

Cool, thanks!

It's tricky with racy stuff like that, but if the issue pops up again it will eventually show up in form of spurious test failures.

It is possible to do  cd tests/check/; make foo/bar.torture  or make foo/bar.forever  btw.

Only other way might be to wrap the whole test in a loop that runs it a hundred times or so, and hope that makes it show up more reliably. But it shuould be fine as it is, thanks a lot.
Comment 10 Alexander Schrab 2013-01-29 13:47:32 UTC
Doing ref/unref won't be a good solution, at least not only doing that. Even though we correctly unref after a while, I get into new problems when the callee expects the query to be mutable (ref count == 1). From base source:

...
  query = gst_query_new_allocation (caps, TRUE);
  if (!gst_pad_peer_query (basesrc->srcpad, query)) {
    /* not a problem, just debug a little */
    GST_DEBUG_OBJECT (basesrc, "peer ALLOCATION query failed");
  }

  g_assert (bclass->decide_allocation != NULL);
  result = bclass->decide_allocation (basesrc, query);

  GST_DEBUG_OBJECT (basesrc, "ALLOCATION (%d) params: %" GST_PTR_FORMAT, result,
      query);

  if (!result)
    goto no_decide_allocation;

  /* we got configuration from our peer or the decide_allocation method,
   * parse them */
  if (gst_query_get_n_allocation_params (query) > 0) {
    gst_query_parse_nth_allocation_param (query, 0, &allocator, &params);
  } else {
    allocator = NULL;
    gst_allocation_params_init (&params);
  }
...

Here, even though the query fails, we continue to use the query object. Which should be fine! But the extra ref in the queue is a problem in this case.
Comment 11 Tim-Philipp Müller 2013-01-29 15:47:43 UTC
Ah, I see, I didn't expect anything to be still (ab)using the query after _query() fails, but it's a good point. Let's go with your original plan then.
Comment 12 Alexander Schrab 2013-01-30 07:25:46 UTC
Great, I'll build unit tests for my queuearray changes as well


(In reply to comment #11)
> Ah, I see, I didn't expect anything to be still (ab)using the query after
> _query() fails, but it's a good point. Let's go with your original plan then.
Comment 13 Alexander Schrab 2013-01-30 08:14:24 UTC
Created attachment 234805 [details] [review]
Unit test for queuearray find/drop element
Comment 14 Tim-Philipp Müller 2013-01-30 11:39:41 UTC
Thanks. This is all a bit messy, because the API has been made public again in git master and changed slightly. I have fixed up your unit test a bit (g_random_int -> g_random_int() and somesuch).

1.0 branch:

commit eb37f4e5905f2c29bfeb1e8345d2af54438548f0
Author: Tim-Philipp Müller <tim@centricular.net>
Date:   Tue Jan 29 22:54:21 2013 +0000

    tests: one more test for gst_queue_array_drop_element()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit 53a7fe04a0fc64d6e11d53291d7b22d999fe0b5d
Author: Alexander Schrab <alexas@axis.com>
Date:   Wed Jan 30 09:11:24 2013 +0100

    tests: add checks for gst_queue_array_find() and _drop_element()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit 0b2ccd5ff14c2f217bd0057555c63fc5925f18a3
Author: Alexander Schrab <alexas@axis.com>
Date:   Mon Jan 28 11:05:28 2013 +0100

    queuearray: fix gst_queue_array_find()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit a1568a30544a10e428d531cb796b1c6ef9db6069
Author: Alexander Schrab <alexas@axis.com>
Date:   Mon Jan 28 11:05:28 2013 +0100

    queuearray: fix gst_queue_array_drop_element()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691


master:


commit 8ea19a48ceb35440dacc91b947afe95ccce11df4
Author: Tim-Philipp Müller <tim.muller@collabora.co.uk>
Date:   Wed Jan 30 11:34:15 2013 +0000

    tests: check return value of gst_queue_array_drop_element() too
    
    Was added when the API was made public in git master.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit bc397c780cdeb7c9165dcdcf4003fe730cf473a6
Author: Tim-Philipp Müller <tim@centricular.net>
Date:   Tue Jan 29 22:54:21 2013 +0000

    tests: one more test for gst_queue_array_drop_element()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691
    
    Conflicts:
        tests/check/libs/queuearray.c

commit 5a7c1b56dcecf187b19c2e2f9b7f98f94ff95df5
Author: Alexander Schrab <alexas@axis.com>
Date:   Mon Jan 28 11:05:28 2013 +0100

    queuearray: fix gst_queue_array_find()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit af8ff1bed8e8fb9104ef400d09aa2d56b4deea7d
Author: Alexander Schrab <alexas@axis.com>
Date:   Mon Jan 28 11:05:28 2013 +0100

    queuearray: fix gst_queue_array_drop_element()
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691
    
    Conflicts:
        libs/gst/base/gstqueuearray.c
Comment 15 Tim-Philipp Müller 2013-01-30 12:14:50 UTC
master:


commit 745821d5f1af00a693d4e5076f20cddd1b1da6a6
Author: Alexander Schrab <alexas@axis.com>
Date:   Tue Jan 29 12:40:52 2013 +0100

    tests: unit test to trigger the queue/flushing race condition bug for allocation queries
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit 09bb0c2cdb1e0abd7b674bea3e30ac2adb25c663
Author: Alexander Schrab <alexas@axis.com>
Date:   Mon Jan 28 11:05:28 2013 +0100

    queue: remove query from queue if queue is flushing
    
    When querying a queue that is flushing we end up adding
    a query to the queuearray without taking a reference to
    that query (because the normal functionality is to block
    until that query is done and discarded from the queue).
    This later causes problem if the query is unreffed outside
    of the queue before we discard the queue. There is a check
    to avoid unreffing any lingering query-objects, but since
    the query has been deleted that check fails.
    
    This commit depends on other fixes done to gst_queue_array_find()
    and gst_queue_array_drop_element().
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit ae8940e6f7057b4528d7b73d69672c0a8ff46cf1
Author: Tim-Philipp Müller <tim.muller@collabora.co.uk>
Date:   Wed Jan 30 11:55:18 2013 +0000

    queuearray: make _find() find the value if no compare function is provided
    
    Allow NULL as compare function for direct value lookup.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691


1.0:


commit fd366ce62834ac2065437531665431e775d726b8
Author: Alexander Schrab <alexas@axis.com>
Date:   Tue Jan 29 12:40:52 2013 +0100

    tests: unit test to trigger the queue/flushing race condition bug for allocation queries
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit ca5628ad8c39ad8735430a2091970926cde3bd76
Author: Alexander Schrab <alexas@axis.com>
Date:   Mon Jan 28 11:05:28 2013 +0100

    queue: remove query from queue if queue is flushing
    
    When querying a queue that is flushing we end up adding
    a query to the queuearray without taking a reference to
    that query (because the normal functionality is to block
    until that query is done and discarded from the queue).
    This later causes problem if the query is unreffed outside
    of the queue before we discard the queue. There is a check
    to avoid unreffing any lingering query-objects, but since
    the query has been deleted that check fails.
    
    This commit depends on other fixes done to gst_queue_array_find()
    and gst_queue_array_drop_element().
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691

commit 1b46969f7e7bd0fbd9d08c06174e69f2f82acef1
Author: Tim-Philipp Müller <tim.muller@collabora.co.uk>
Date:   Wed Jan 30 11:55:18 2013 +0000

    queuearray: make _find() find the value if no compare function is provided
    
    Allow NULL as compare function for direct value lookup.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=692691
    
    Conflicts:
        plugins/elements/gstqueuearray.c
Comment 16 Tim-Philipp Müller 2013-01-30 12:21:57 UTC
queue2 doesn't handle serialized queries at all.

multiqueue looks ok at first glance.
Comment 17 Alexander Schrab 2013-02-05 08:45:42 UTC
Created attachment 235196 [details] [review]
Cleaned up unit test

This is an improved unit test to trigger the patch. I removed sleeps etc to avoid long tests.
Comment 18 Tim-Philipp Müller 2013-02-05 18:46:11 UTC
Comment on attachment 235196 [details] [review]
Cleaned up unit test

Thanks, but this patch seems to neither apply to git master nor the 1.0 branch (even with -pX)
Comment 19 Alexander Schrab 2013-02-06 06:52:40 UTC
Sorry, I'll grab master and apply the fix there instead... 

/Alex

(In reply to comment #18)
> (From update of attachment 235196 [details] [review])
> Thanks, but this patch seems to neither apply to git master nor the 1.0 branch
> (even with -pX)
Comment 20 Alexander Schrab 2013-02-06 08:59:07 UTC
Created attachment 235292 [details] [review]
Cleaned up unit test

This one is on top of a clean git clone from master...
Comment 21 Tim-Philipp Müller 2013-02-06 10:41:33 UTC
Comment on attachment 235292 [details] [review]
Cleaned up unit test

Thanks, but this doesn't work particularly well for me at all. I applied this patch and then reverted 09bb0c2c again to see if the test still fails. But it doesn't, not for me anyway. Not even with  make elements/queue.torture. So I think I prefer the current state then, which fails much more reliably for me (i.e. every time).

Is test execution time an issue for  you? It finishes almost immediately here, so I wouldn't expect it to take very long anywhere.
Comment 22 Alexander Schrab 2013-02-06 12:22:29 UTC
I just tested here again and while I do get the error with the new patch it's not as common (as you say). Maybe we should just leave it as it is then. We got some timeouts in our build env, but I'm not sure about the status of those. If there really is a real problem with timeouts in some situations I'll see what I can do. Ok?
Comment 23 Tim-Philipp Müller 2013-02-06 12:42:38 UTC
Ok, thanks. I'm happy to leave it as is for now.

After how much time does it time out? This really finishes within milliseconds for me, and the normal timeout is 30s or so (IIRC), so your system must be a lot slower, or something else is going on.