GNOME Bugzilla – Bug 647493
Regression: gstbus: Not getting all messages posted on the bus
Last modified: 2012-06-27 17:47:47 UTC
We had a major regression in our tests where we quite often would not receive some messages posted on the bus. This was particularly visible when we waited for a certain message to come through, and we would wait forever. This is a regression for us, and by reverting GstBus we get our messages back. (the tests stop failing) We mainly see this on Windows, but I chose "OS:All" because we have seen it on OSX as well. (not yet on Linux) I don´t have any test-case, patch or more details on this, because we have not yet looked into it, but I decided to post a bug anyway, since I thought you should be aware of the problem.
Without any additional information this will be hard to fix ;) I'm still marking this as a blocker for 0.10.33, maybe you can give some additional information before the release :)
Wim had a comment on #irc about some Event-stuff on Windows he thought looked wrong. Could you expand on this Wim?
The conversation from IRC: 11-04-2011 21:56:22 hgr : but the real reason I came was this: https://bugzilla.gnome.org/show_bug.cgi?id=647493 :) wtay : maybe because of the poll changes, I think there is a race on windows hgr : and we think we have seen it on osx as well wtay : because SetEvent is not really a counter, you can have 2 SetEvent being called wtay : hm hgr : an EOS that never came... sounds like it, no? wtay : there was also a bug wtay : http://cgit.freedesktop.org/gstreamer/gstreamer/commit/?id=7c6d9c2725a9632fbd96fd41b60ed814649d3bec wtay : but I think that's a fix for the oposite of what you see.. wtay : so, I can see how your bug happens when ReleaseEvent takes away the effect of multiple SetEvent calls hgr : cool! hgr : as I said, I have just observed it going wrong, and seen it was messages not arriving from the bus, not done any digging wtay : ok hgr : but we are using git from yesterday for all modules, so pretty fresh stuff wtay : hm ok
Håvard: are you guys looking into this? It's a bit hard for us to do anything about this if we have no logs, no details, no way to reproduce it...
(In reply to comment #4) > Håvard: are you guys looking into this? > > It's a bit hard for us to do anything about this if we have no logs, no > details, no way to reproduce it... At the moment we are not, no. I guess I could try to write some simple tests and see if it will reproduce reliably. It won´t be for another few days though. Did Wim do anything about the stuff he saw about ReleaseEvent killing more than one SetEvent? Sounds like a plausible cause to me?
> > Håvard: are you guys looking into this? (...) > > At the moment we are not, no. I guess I could try to write some simple tests > and see if it will reproduce reliably. It won´t be for another few days though. That would be great, thanks. > Did Wim do anything about the stuff he saw about ReleaseEvent killing more than > one SetEvent? Sounds like a plausible cause to me? Not that I know of. Windows hackers need to look into this I think.
Hm. So far no luck. I wrote a test, trying to mimic what goes on in our application, but so far all messages sent are accounted for. I will have to keep trying, but maybe someone else should be looking at this as well? Surly we can't be the only windows-users out there who have a problem? And as I said, I am pretty sure we have seen it on OSX as well, as an EOS that never arrived. Help?
An EOS that never arrived might be bug #647756 but that's no regression
(In reply to comment #8) > An EOS that never arrived might be bug #647756 but that's no regression If the Bus we used up until now was almost a year old, is it not possible that this might be a regression for us then? Because the test that failed for us (with the missing EOS) has *never* failed before, and then we talk about a test that would have ran thousands of times the last year. And by reverting our Bus, the test(s) stopped failing.
Ok, that's not bug #647756 then
Ok, so let's revert to the non-lockfree GstBus for the release then and go back to this afterwards, so there's enough time to investigate and fix this. commit 06ca2759167f8ea0cdaf34db41d5a4c2cf98b75a Author: Tim-Philipp Müller <tim.muller@collabora.co.uk> Date: Tue Apr 26 15:42:46 2011 +0100 Revert lockfree GstBus for the release Drop in old GstBus code for the release to play it safe, since regressions that are apparently hard to track down and reproduce have been reported (on windows/OSX mostly) against the lockfree version, and more time is needed to fix them. This reverts commit 03391a897001d35d1d290f27dd12e98a8b729fb4. This reverts commit 43cdbc17e6f944cdf02aeed78d1d5f6bde5190c9. This reverts commit 80eb160e0f62350271f061daa5f289d9d4277cf4. This reverts commit c41b0ade28790ffdb0e484b41cd7929c4e145dec. This reverts commit 874d60e5899dd5b89854679d1a4ad016a58ba4e0. This reverts commit 79370d4b1781af9c9a65f2d1e3498124d8c4c413. This reverts commit 2cb3e5235196eb71fb25e0a4a4b8749d6d0a8453. This reverts commit bd1c40011434c1efaa696dc98ef855ef9cce9b28. This reverts commit 4bf8f1524f6e3374b3f3bc57322337723d06b928. This reverts commit 14d7db1b527b05f029819057aef5c123ac7e013d. https://bugzilla.gnome.org/show_bug.cgi?id=647493
Ok, so I found the OSX problem! The mistake is in gst_bus_timed_pop_filtered: if (timeout > elapsed) timeout -= elapsed; else timeout = 0; This is wrong. You can't subtract elapsed from timeout. That way you can eat up 20 seconds (our case) in a few milliseconds. It should probably read: if (elapsed >= timeout) timeout = 0; or maybe even: if (elapsed > timeout) break; I am no closer to the Windows-issue though, which seems to be a different one. (Probably related to GstPoll somehow) I'll keep digging, even if you have decided to revert for now, I am assuming it is coming back, so might as well do it now while I am at it :)
Forgot to say in last post: There should probably be a gst_bus_timed_pop_filtered unittest. Make it wait for 10 seconds, and post a message after 1 second, and it will fail.
I don't see the error.. Are you saying that it causes rounding errors when it's called quickly because the elapsed time is bigger than it really is?
(In reply to comment #14) > I don't see the error.. Are you saying that it causes rounding errors when it's > called quickly because the elapsed time is bigger than it really is? Nope, it is making the time go a bit like this: http://en.wikipedia.org/wiki/Fibonacci_number :)
Elapsed: 0 Timeout: 100 Elapsed: 1 Timeout: 100 - 1 (99) (100 - 99 = 1) Elapsed: 2 Timeout: 99 - 2 (97) (100 - 97 = 3) Elapsed: 3 Timeout: 97 - 3 (94) (100 - 94 = 6) Elapsed: 4 Timeout: 94 - 4 (91) (100 - 91 = 9) Elapsed: 5 Timeout: 91 - 5 (86) (100 - 86 = 14) Elapsed: 6 Timeout: 86 - 6 (80) (100 - 80 = 20)
Actually it's http://en.wikipedia.org/wiki/Triangular_number :) Interesting and easy to fix.
Created attachment 186753 [details] [review] possible patch
Comment on attachment 186753 [details] [review] possible patch This patch seems to be in 0.11 now. Do we want to re-add the lockless GstBus to 0.10 now, including this patch?
Now is a good time to ping this, or is it too late for the branch tpm ?
> Now is a good time to ping this, or is it too late for the branch tpm ? Hrm? What's too late for which branch?
(In reply to comment #21) > > Now is a good time to ping this, or is it too late for the branch tpm ? > > Hrm? What's too late for which branch? Adding the lockless GstBus to the 0.10 branch, as slomo mentioned. Unless that's been done already, in which case this can be closed.
Ah, right. It's an optimisation, it's a bit late to push that into the release branch at this point in the release cycle. Could add it back into 0.11 (don't know about master).
Patch has been committed. can this bug be closed ?
This seems to all be in 0.11 now incl. the timeout patch, and was reverted for 0.10 (let's keep the old version in 0.10), so closing. Please re-open if I missed something.