Bug 647493 – Regression: gstbus: Not getting all messages posted on the bus

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 647493 - Regression: gstbus: Not getting all messages posted on the bus


Summary:	Regression: gstbus: Not getting all messages posted on the bus


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gstreamer (core)
Version:	git master
Hardware:	Other All

Importance:	Normal critical
Target Milestone:	0.11.x
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-04-11 19:55 UTC by Håvard Graff (hgr)
Modified:	2012-06-27 17:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
possible patch (1.31 KB, patch) 2011-04-27 16:12 UTC, Wim Taymans	committed	Details \| Review

Description Håvard Graff (hgr) 2011-04-11 19:55:14 UTC

We had a major regression in our tests where we quite often would not receive some messages posted on the bus. This was particularly visible when we waited for a certain message to come through, and we would wait forever. This is a regression for us, and by reverting GstBus we get our messages back. (the tests stop failing)

We mainly see this on Windows, but I chose "OS:All" because we have seen it on OSX as well. (not yet on Linux)

I don´t have any test-case, patch or more details on this, because we have not yet looked into it, but I decided to post a bug anyway, since I thought you should be aware of the problem.

Comment 1 Sebastian Dröge (slomo) 2011-04-13 14:41:02 UTC

Without any additional information this will be hard to fix ;) I'm still marking this as a blocker for 0.10.33, maybe you can give some additional information before the release :)

Comment 2 Håvard Graff (hgr) 2011-04-13 14:43:26 UTC

Wim had a comment on #irc about some Event-stuff on Windows he thought looked wrong. Could you expand on this Wim?

Comment 3 Tim-Philipp Müller 2011-04-13 15:27:50 UTC

The conversation from IRC:

11-04-2011 21:56:22
hgr  : but the real reason I came was this: https://bugzilla.gnome.org/show_bug.cgi?id=647493 :)
wtay : maybe because of the poll changes, I think there is a race on windows
hgr  : and we think we have seen it on osx as well
wtay : because SetEvent is not really a counter, you can have 2 SetEvent being called
wtay : hm
hgr  : an EOS that never came... sounds like it, no?
wtay : there was also a bug
wtay : http://cgit.freedesktop.org/gstreamer/gstreamer/commit/?id=7c6d9c2725a9632fbd96fd41b60ed814649d3bec
wtay : but I think that's a fix for the oposite of what you see..
wtay : so, I can see how your bug happens when ReleaseEvent takes away the effect of multiple SetEvent calls
hgr  : cool!
hgr  : as I said, I have just observed it going wrong, and seen it was messages not arriving from the bus, not done any digging
wtay : ok
hgr  : but we are using git from yesterday for all modules, so pretty fresh stuff
wtay : hm ok

Comment 4 Tim-Philipp Müller 2011-04-20 06:44:51 UTC

Håvard: are you guys looking into this?

It's a bit hard for us to do anything about this if we have no logs, no details, no way to reproduce it...

Comment 5 Håvard Graff (hgr) 2011-04-20 07:09:51 UTC

(In reply to comment #4)
> Håvard: are you guys looking into this?
> 
> It's a bit hard for us to do anything about this if we have no logs, no
> details, no way to reproduce it...

At the moment we are not, no. I guess I could try to write some simple tests and see if it will reproduce reliably. It won´t be for another few days though.

Did Wim do anything about the stuff he saw about ReleaseEvent killing more than one SetEvent? Sounds like a plausible cause to me?

Comment 6 Tim-Philipp Müller 2011-04-20 16:16:14 UTC

> > Håvard: are you guys looking into this? (...)
> 
> At the moment we are not, no. I guess I could try to write some simple tests
> and see if it will reproduce reliably. It won´t be for another few days though.

That would be great, thanks.


> Did Wim do anything about the stuff he saw about ReleaseEvent killing more than
> one SetEvent? Sounds like a plausible cause to me?

Not that I know of. Windows hackers need to look into this I think.

Comment 7 Håvard Graff (hgr) 2011-04-26 14:04:32 UTC

Hm. So far no luck. I wrote a test, trying to mimic what goes on in our application, but so far all messages sent are accounted for. I will have to keep trying, but maybe someone else should be looking at this as well? Surly we can't be the only windows-users out there who have a problem? And as I said, I am pretty sure we have seen it on OSX as well, as an EOS that never arrived.

Help?

Comment 8 Sebastian Dröge (slomo) 2011-04-26 14:07:38 UTC

An EOS that never arrived might be bug #647756 but that's no regression

Comment 9 Håvard Graff (hgr) 2011-04-26 14:18:10 UTC

(In reply to comment #8)
> An EOS that never arrived might be bug #647756 but that's no regression

If the Bus we used up until now was almost a year old, is it not possible that this might be a regression for us then? Because the test that failed for us (with the missing EOS) has *never* failed before, and then we talk about a test that would have ran thousands of times the last year. And by reverting our Bus, the test(s) stopped failing.

Comment 10 Sebastian Dröge (slomo) 2011-04-26 14:20:09 UTC

Ok, that's not bug #647756 then

Comment 11 Tim-Philipp Müller 2011-04-26 15:16:09 UTC

Ok, so let's revert to the non-lockfree GstBus for the release then and go back to this afterwards, so there's enough time to investigate and fix this.


commit 06ca2759167f8ea0cdaf34db41d5a4c2cf98b75a
Author: Tim-Philipp Müller <tim.muller@collabora.co.uk>
Date:   Tue Apr 26 15:42:46 2011 +0100

    Revert lockfree GstBus for the release
    
    Drop in old GstBus code for the release to play it safe, since
    regressions that are apparently hard to track down and reproduce
    have been reported (on windows/OSX mostly) against the lockfree
    version, and more time is needed to fix them.
    
    This reverts commit 03391a897001d35d1d290f27dd12e98a8b729fb4.
    This reverts commit 43cdbc17e6f944cdf02aeed78d1d5f6bde5190c9.
    This reverts commit 80eb160e0f62350271f061daa5f289d9d4277cf4.
    This reverts commit c41b0ade28790ffdb0e484b41cd7929c4e145dec.
    This reverts commit 874d60e5899dd5b89854679d1a4ad016a58ba4e0.
    This reverts commit 79370d4b1781af9c9a65f2d1e3498124d8c4c413.
    This reverts commit 2cb3e5235196eb71fb25e0a4a4b8749d6d0a8453.
    This reverts commit bd1c40011434c1efaa696dc98ef855ef9cce9b28.
    This reverts commit 4bf8f1524f6e3374b3f3bc57322337723d06b928.
    This reverts commit 14d7db1b527b05f029819057aef5c123ac7e013d.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=647493

Comment 12 Håvard Graff (hgr) 2011-04-27 14:35:43 UTC

Ok, so I found the OSX problem!

The mistake is in gst_bus_timed_pop_filtered:

if (timeout > elapsed)
  timeout -= elapsed;
else
  timeout = 0;

This is wrong. You can't subtract elapsed from timeout. That way you can eat up 20 seconds (our case) in a few milliseconds. 

It should probably read:

if (elapsed >= timeout)
  timeout = 0;

or maybe even:

if (elapsed > timeout)
  break;


I am no closer to the Windows-issue though, which seems to be a different one. (Probably related to GstPoll somehow)

I'll keep digging, even if you have decided to revert for now, I am assuming it is coming back, so might as well do it now while I am at it :)

Comment 13 Håvard Graff (hgr) 2011-04-27 14:38:43 UTC

Forgot to say in last post: There should probably be a gst_bus_timed_pop_filtered unittest. Make it wait for 10 seconds, and post a message after 1 second, and it will fail.

Comment 14 Wim Taymans 2011-04-27 14:56:51 UTC

I don't see the error.. Are you saying that it causes rounding errors when it's called quickly because the elapsed time is bigger than it really is?

Comment 15 Håvard Graff (hgr) 2011-04-27 15:12:02 UTC

(In reply to comment #14)
> I don't see the error.. Are you saying that it causes rounding errors when it's
> called quickly because the elapsed time is bigger than it really is?

Nope, it is making the time go a bit like this:
http://en.wikipedia.org/wiki/Fibonacci_number

:)

Comment 16 Håvard Graff (hgr) 2011-04-27 15:24:23 UTC

Elapsed: 0 Timeout: 100
Elapsed: 1 Timeout: 100 - 1 (99) (100 - 99 = 1)
Elapsed: 2 Timeout: 99  - 2 (97) (100 - 97 = 3)
Elapsed: 3 Timeout: 97  - 3 (94) (100 - 94 = 6)
Elapsed: 4 Timeout: 94  - 4 (91) (100 - 91 = 9)
Elapsed: 5 Timeout: 91  - 5 (86) (100 - 86 = 14)
Elapsed: 6 Timeout: 86  - 6 (80) (100 - 80 = 20)

Comment 17 Wim Taymans 2011-04-27 16:11:54 UTC

Actually it's http://en.wikipedia.org/wiki/Triangular_number :) Interesting and easy to fix.

Comment 18 Wim Taymans 2011-04-27 16:12:31 UTC

Created attachment 186753 [details] [review]
possible patch

Comment 19 Sebastian Dröge (slomo) 2011-05-26 11:02:03 UTC

Comment on attachment 186753 [details] [review]
possible patch

This patch seems to be in 0.11 now. Do we want to re-add the lockless GstBus to 0.10 now, including this patch?

Comment 20 Vincent Penquerc'h 2012-02-03 11:43:51 UTC

Now is a good time to ping this, or is it too late for the branch tpm ?

Comment 21 Tim-Philipp Müller 2012-02-06 10:32:27 UTC

> Now is a good time to ping this, or is it too late for the branch tpm ?

Hrm? What's too late for which branch?

Comment 22 Vincent Penquerc'h 2012-02-06 11:13:12 UTC

(In reply to comment #21)
> > Now is a good time to ping this, or is it too late for the branch tpm ?
> 
> Hrm? What's too late for which branch?

Adding the lockless GstBus to the 0.10 branch, as slomo mentioned.
Unless that's been done already, in which case this can be closed.

Comment 23 Tim-Philipp Müller 2012-02-06 11:28:52 UTC

Ah, right. It's an optimisation, it's a bit late to push that into the release branch at this point in the release cycle.

Could add it back into 0.11 (don't know about master).

Comment 24 Akhil Laddha 2012-06-18 16:52:21 UTC

Patch has been committed. can this bug be closed ?

Comment 25 Tim-Philipp Müller 2012-06-27 17:47:47 UTC

This seems to all be in 0.11 now incl. the timeout patch, and was reverted for 0.10 (let's keep the old version in 0.10), so closing. Please re-open if I missed something.