GNOME Bugzilla – Bug 692691
queue: dangling pointer when doing allocation query on a flushing queue
Last modified: 2013-02-06 12:42:38 UTC
When querying a queue that is flushing we end up adding a query to the queuearray without taking a reference to that query (because the normal functionality is to block until that query is done and discarded from the queue). This later causes problem if the query is unreffed outside of the queue before we discard the queue. There is a check to avoid unreffing any lingering query-objects, but since the query has been deleted that check fails. Solution: Remove the query from the queue if it is still there in the out_flushing: error case in gstqueue.c:gst_queue_handle_sink_query
Created attachment 234598 [details] [review] Patch for queue.c and queuearray.c There was a side issue from fixing this bug: the gst_queue_array_find and gst_queue_array_drop_element methods in queuearray.c were completely wrong (cannot have been used at all, maybe they were left from an earlier implementation?). This proposed patch fixes the original issue and follow up issue from needing those two methods in a working state.
A unit test would be nice :) I wonder if it wouldn't be better/cleaner to just ref/unref the query while it's in the queue then?
I agree with you.. The thought hit me as well, but at the same time I didn't want to make too much of an operation in the code. I will make a patch that reffes queries. Question is what to do with the queuearray.c stuff? It is obviously broken, but on the other hand it might be better to just remove if noone uses it (and take a look at why it happens to work in the few places it is used). (In reply to comment #2) > A unit test would be nice :) > > I wonder if it wouldn't be better/cleaner to just ref/unref the query while > it's in the queue then?
> Question is what to do with the queuearray.c stuff? It is > obviously broken, but on the other hand it might be better to just remove if > noone uses it (and take a look at why it happens to work in the few places it > is used). It should still be fixed in a separate patch IMHO (ideally with unit test addition to tests/check/libs/queuearray.c). We can't really remove the API unless it was totally impossible that anyone ever used it at all (e.g. would crash immediately), and even then it's probably better to just fix it up, no?
I finally have a unit test that triggers the bug *sometimes*, I wanted but have so far failed in making it fail all the time. I narrowed the problem down to this: Making allocation queries to the queue while simultaneously sending one flush_start to the queue can trigger this. The reason being as I said earlier: handle_sink_query falls out through out_flushing: while waiting for the query to complete. This results in a race between the thread doing a query that will most likely unref the query when the function returns and the queue that will try to flush out the query (doing an GST_IS_QUERY(data) on something that might already be deleted). So what can be done? 1. as we said before, we can either ref or make sure the item is gone (I do prefer the ref since we can remove a number of special cases for the query data items) 2. Make sure we always call gst_queue_locked_flush right after setting srcresult to GST_FLOW_FLUSHING, before we let the query call return (it is waiting for GST_QUEUE_WAIT_DEL_CHECK (queue, out_flushing)). This behavior is already there when pushing buffers results in setting srcresult to flushing as far as I can see, but I have way to little knowledge of the queue to understand if the gst_queue_locked_flush can be used that way. I'm adding the test case if someone wants to take a look... It's a pretty ugly unit test with sleep etc, so it's not in a state to import...
Created attachment 234716 [details] [review] Unit test patch for queue that triggers the bug This unit test is not in a very nice state, it's a race condition bug and I'm adding this test if anyone else wants to try to trigger the bug. You might need to run the test through valgrind to get any real errors...
Just realized this patch is broken, I need to get my git act together :) (In reply to comment #6) > Created an attachment (id=234716) [details] [review] > Unit test patch for queue that triggers the bug > > This unit test is not in a very nice state, it's a race condition bug and I'm > adding this test if anyone else wants to try to trigger the bug. You might need > to run the test through valgrind to get any real errors...
Created attachment 234720 [details] [review] Unit test patch for queue that triggers the bug Patch that adds a test that triggers the bug... sometimes... Use valgrind and run it a few times. The test has a few sleeps etc, so it's just for debugging. Haven't been able to reproduce it more clean yet...
> Patch that adds a test that triggers the bug... sometimes... > Use valgrind and run it a few times. The test has a few sleeps etc, so it's > just for debugging. Haven't been able to reproduce it more clean yet... Cool, thanks! It's tricky with racy stuff like that, but if the issue pops up again it will eventually show up in form of spurious test failures. It is possible to do cd tests/check/; make foo/bar.torture or make foo/bar.forever btw. Only other way might be to wrap the whole test in a loop that runs it a hundred times or so, and hope that makes it show up more reliably. But it shuould be fine as it is, thanks a lot.
Doing ref/unref won't be a good solution, at least not only doing that. Even though we correctly unref after a while, I get into new problems when the callee expects the query to be mutable (ref count == 1). From base source: ... query = gst_query_new_allocation (caps, TRUE); if (!gst_pad_peer_query (basesrc->srcpad, query)) { /* not a problem, just debug a little */ GST_DEBUG_OBJECT (basesrc, "peer ALLOCATION query failed"); } g_assert (bclass->decide_allocation != NULL); result = bclass->decide_allocation (basesrc, query); GST_DEBUG_OBJECT (basesrc, "ALLOCATION (%d) params: %" GST_PTR_FORMAT, result, query); if (!result) goto no_decide_allocation; /* we got configuration from our peer or the decide_allocation method, * parse them */ if (gst_query_get_n_allocation_params (query) > 0) { gst_query_parse_nth_allocation_param (query, 0, &allocator, ¶ms); } else { allocator = NULL; gst_allocation_params_init (¶ms); } ... Here, even though the query fails, we continue to use the query object. Which should be fine! But the extra ref in the queue is a problem in this case.
Ah, I see, I didn't expect anything to be still (ab)using the query after _query() fails, but it's a good point. Let's go with your original plan then.
Great, I'll build unit tests for my queuearray changes as well (In reply to comment #11) > Ah, I see, I didn't expect anything to be still (ab)using the query after > _query() fails, but it's a good point. Let's go with your original plan then.
Created attachment 234805 [details] [review] Unit test for queuearray find/drop element
Thanks. This is all a bit messy, because the API has been made public again in git master and changed slightly. I have fixed up your unit test a bit (g_random_int -> g_random_int() and somesuch). 1.0 branch: commit eb37f4e5905f2c29bfeb1e8345d2af54438548f0 Author: Tim-Philipp Müller <tim@centricular.net> Date: Tue Jan 29 22:54:21 2013 +0000 tests: one more test for gst_queue_array_drop_element() https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit 53a7fe04a0fc64d6e11d53291d7b22d999fe0b5d Author: Alexander Schrab <alexas@axis.com> Date: Wed Jan 30 09:11:24 2013 +0100 tests: add checks for gst_queue_array_find() and _drop_element() https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit 0b2ccd5ff14c2f217bd0057555c63fc5925f18a3 Author: Alexander Schrab <alexas@axis.com> Date: Mon Jan 28 11:05:28 2013 +0100 queuearray: fix gst_queue_array_find() https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit a1568a30544a10e428d531cb796b1c6ef9db6069 Author: Alexander Schrab <alexas@axis.com> Date: Mon Jan 28 11:05:28 2013 +0100 queuearray: fix gst_queue_array_drop_element() https://bugzilla.gnome.org/show_bug.cgi?id=692691 master: commit 8ea19a48ceb35440dacc91b947afe95ccce11df4 Author: Tim-Philipp Müller <tim.muller@collabora.co.uk> Date: Wed Jan 30 11:34:15 2013 +0000 tests: check return value of gst_queue_array_drop_element() too Was added when the API was made public in git master. https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit bc397c780cdeb7c9165dcdcf4003fe730cf473a6 Author: Tim-Philipp Müller <tim@centricular.net> Date: Tue Jan 29 22:54:21 2013 +0000 tests: one more test for gst_queue_array_drop_element() https://bugzilla.gnome.org/show_bug.cgi?id=692691 Conflicts: tests/check/libs/queuearray.c commit 5a7c1b56dcecf187b19c2e2f9b7f98f94ff95df5 Author: Alexander Schrab <alexas@axis.com> Date: Mon Jan 28 11:05:28 2013 +0100 queuearray: fix gst_queue_array_find() https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit af8ff1bed8e8fb9104ef400d09aa2d56b4deea7d Author: Alexander Schrab <alexas@axis.com> Date: Mon Jan 28 11:05:28 2013 +0100 queuearray: fix gst_queue_array_drop_element() https://bugzilla.gnome.org/show_bug.cgi?id=692691 Conflicts: libs/gst/base/gstqueuearray.c
master: commit 745821d5f1af00a693d4e5076f20cddd1b1da6a6 Author: Alexander Schrab <alexas@axis.com> Date: Tue Jan 29 12:40:52 2013 +0100 tests: unit test to trigger the queue/flushing race condition bug for allocation queries https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit 09bb0c2cdb1e0abd7b674bea3e30ac2adb25c663 Author: Alexander Schrab <alexas@axis.com> Date: Mon Jan 28 11:05:28 2013 +0100 queue: remove query from queue if queue is flushing When querying a queue that is flushing we end up adding a query to the queuearray without taking a reference to that query (because the normal functionality is to block until that query is done and discarded from the queue). This later causes problem if the query is unreffed outside of the queue before we discard the queue. There is a check to avoid unreffing any lingering query-objects, but since the query has been deleted that check fails. This commit depends on other fixes done to gst_queue_array_find() and gst_queue_array_drop_element(). https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit ae8940e6f7057b4528d7b73d69672c0a8ff46cf1 Author: Tim-Philipp Müller <tim.muller@collabora.co.uk> Date: Wed Jan 30 11:55:18 2013 +0000 queuearray: make _find() find the value if no compare function is provided Allow NULL as compare function for direct value lookup. https://bugzilla.gnome.org/show_bug.cgi?id=692691 1.0: commit fd366ce62834ac2065437531665431e775d726b8 Author: Alexander Schrab <alexas@axis.com> Date: Tue Jan 29 12:40:52 2013 +0100 tests: unit test to trigger the queue/flushing race condition bug for allocation queries https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit ca5628ad8c39ad8735430a2091970926cde3bd76 Author: Alexander Schrab <alexas@axis.com> Date: Mon Jan 28 11:05:28 2013 +0100 queue: remove query from queue if queue is flushing When querying a queue that is flushing we end up adding a query to the queuearray without taking a reference to that query (because the normal functionality is to block until that query is done and discarded from the queue). This later causes problem if the query is unreffed outside of the queue before we discard the queue. There is a check to avoid unreffing any lingering query-objects, but since the query has been deleted that check fails. This commit depends on other fixes done to gst_queue_array_find() and gst_queue_array_drop_element(). https://bugzilla.gnome.org/show_bug.cgi?id=692691 commit 1b46969f7e7bd0fbd9d08c06174e69f2f82acef1 Author: Tim-Philipp Müller <tim.muller@collabora.co.uk> Date: Wed Jan 30 11:55:18 2013 +0000 queuearray: make _find() find the value if no compare function is provided Allow NULL as compare function for direct value lookup. https://bugzilla.gnome.org/show_bug.cgi?id=692691 Conflicts: plugins/elements/gstqueuearray.c
queue2 doesn't handle serialized queries at all. multiqueue looks ok at first glance.
Created attachment 235196 [details] [review] Cleaned up unit test This is an improved unit test to trigger the patch. I removed sleeps etc to avoid long tests.
Comment on attachment 235196 [details] [review] Cleaned up unit test Thanks, but this patch seems to neither apply to git master nor the 1.0 branch (even with -pX)
Sorry, I'll grab master and apply the fix there instead... /Alex (In reply to comment #18) > (From update of attachment 235196 [details] [review]) > Thanks, but this patch seems to neither apply to git master nor the 1.0 branch > (even with -pX)
Created attachment 235292 [details] [review] Cleaned up unit test This one is on top of a clean git clone from master...
Comment on attachment 235292 [details] [review] Cleaned up unit test Thanks, but this doesn't work particularly well for me at all. I applied this patch and then reverted 09bb0c2c again to see if the test still fails. But it doesn't, not for me anyway. Not even with make elements/queue.torture. So I think I prefer the current state then, which fails much more reliably for me (i.e. every time). Is test execution time an issue for you? It finishes almost immediately here, so I wouldn't expect it to take very long anywhere.
I just tested here again and while I do get the error with the new patch it's not as common (as you say). Maybe we should just leave it as it is then. We got some timeouts in our build env, but I'm not sure about the status of those. If there really is a real problem with timeouts in some situations I'll see what I can do. Ok?
Ok, thanks. I'm happy to leave it as is for now. After how much time does it time out? This really finishes within milliseconds for me, and the normal timeout is 30s or so (IIRC), so your system must be a lot slower, or something else is going on.