Bug 137460 – dangling timeout event source ID causes crashes

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 137460 - dangling timeout event source ID causes crashes


Summary:	dangling timeout event source ID causes crashes


Status:	RESOLVED FIXED

Product:	rhythmbox
Classification:	Other
Component:	general
Version:	0.6.5
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	RhythmBox Maintainers
QA Contact:	RhythmBox Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-03-17 04:25 UTC by Ben Liblit
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
proposed (and tested) fix: do not keep around stale event source IDs (476 bytes, patch) 2004-03-17 04:31 UTC, Ben Liblit	none	Details \| Review
alternate fix with "extra paranoia" suggested by <teuf@users.sourceforge.net> (956 bytes, patch) 2004-03-17 17:40 UTC, Ben Liblit	none	Details \| Review
corrected alternate fix with "extra paranoia" (965 bytes, patch) 2004-03-17 22:19 UTC, Ben Liblit	none	Details \| Review

Description Ben Liblit 2004-03-17 04:25:52 UTC

In brief:

The browser disclosure widget uses a timeout event source to animate its
arrow glyph.  However, it keeps around this timeout event source's ID even
after the timeout object itself has already been destroyed.  Later on, this
can cause the code to mistakenly destroy some other event source at random.
 Experiments show that this causes a significant number of crashes.


Methodology:

This bug was uncovered as part of the Cooperative Bug Isolation Project
<http://www.cs.berkeley.edu/~liblit/sampler/>, which attempts to find bugs
by discovering statistically significant differences in the behavior of
good and bad runs.  Rhythmbox is part of our public deployment, and we are
collecting good and bad feedback profiles on an ongoing basis.  This bug
was not discovered as a direct result of that public feedback data, which
has not yet reached the level of statistical significance.  Instead, it
appeared during an examination of several thousand scripted runs of
Rhythmbox in shuffle mode over a large music library.

We have identified several program behaviors which are strong crash
predictors for Rhythmbox on these scripted runs.  One of them concerns the
call to g_source_remove() in cddb_disclosure_destroy() on line 77 of
"lib/disclosure-widget.c".  Specifically, when g_source_remove() returns a
positive value on this line, Rhythmbox is much more likely to crash.

Unfortunately our public deployment of Rhythmbox does not measure this
"g_source_remove() > 0" predicate, so I cannot give you an estimate of how
often this crash is in the hands of real users.  In our scripted
experiment, this behavior is implicated in roughly 8% of failures, or 0.6%
of all runs.  Note however that our scripted tests interact with Rhythmbox
in fairly limited ways; actual user bugs are likely to be more varied and
therefore this bug likely accounts for less than 8% of real failures in the
field.


Interpretation:

All of the action takes place in Rhythmbox source file
"lib/disclosure-widget.c".  See function toggled() starting on line 215. 
This function is generally called by the underlying toolkit when the
show/hide widget switches states.  However, it is also called once at
startup to set the initial state of the show/hide widget.  So every run of
Rhythmbox is going to call toggled() at least once, even with no user action.

On line 221, toggled() calls do_animation(), which appears on line 201 of
this same file.

On line 212, do_animation() calls g_timeout_add() to create a new source of
timer events.  You can find documentation for g_timeout_add() at
<http://developer.gnome.org/doc/API/2.0/glib/glib-The-Main-Event-Loop.html#g-timeout-add>.
 The important bit is that the trigger function is
expand_collapse_timeout(), which will be called once every fifty milliseconds.

Also on line 212, g_timeout_add() returns an event source ID.
do_animation() stores this ID in disclosure->priv->expand_id.  Up until
this point, that field was zero.  (See line 303 for the place where that
"priv" structure was allocated and cleared to zeros.)

So in fifty milliseconds the main event loop will wake up and call
expand_collapse_timeout(), which appears starting on line 141. According to
the documentation for g_timeout_add(), this function should return TRUE if
the timeout is to be retained and triggered again in another fifty
milliseconds.  If instead expand_collapse_timeout() returns FALSE, "the
timeout is automatically destroyed and the function will not be called
again."  Looking at the body of expand_collapse_timeout(), most of what's
going on is arrow painting logic that is not important here.  What is
important is:

    1. expand_collapse_timeout() eventually does return FALSE, which
       will cause destruction of the timeout event source.

    2. expand_collapse_timeout() never modifies the value of
       disclosure->priv->expand_id.

These two facts taken together mean that disclosure->priv->expand_id will
eventually hold the ID of an event source which has already been destroyed.
 In effect, we have a dangling pointer.  A dangling ID, really.

This dangling ID can bite the application in two different places:

    1. On line 77, cddb_disclosure_destroy() calls g_source_remove() to
       destroy the timeout event source.  But the behavior of
       expand_collapse_timeout() means we are using the ID for an event
       source which has already been destroyed.

    2. On line 208, do_animation(), does essentially the same thing.  It
       destroys an event source using a stale ID.

So if the ID is stale, then one of two things happens.  If the ID has not
been reused, then g_source_remove() returns 0 indicating that no such event
source exists.  If the ID has been reused, then g_source_remove() finds and
destroys whichever unlucky event source now holds that ID.  There are about
twenty other places where Rhythmbox manipulates event sources, and one of
those is probably going to get blown away.  As our scripted runs show,
Rhythmbox doesn't just become deaf to some source of events, though.  When
g_source_remove() thinks it found something to destroy, Rhythmbox
subsequently crashes.

The stack traces associated with this bug are uninteresting.  Here is a
typical example:

    #3  <signal handler called>

    #4  0x409b5a05 in g_hash_table_lookup () from /usr/lib/libglib-2.0.so.0

    #5  0x085dc329 in rhythmdb_query_model_entry_to_iter
        (model=0x8b78168, entry=0x8ba4fb8, iter=0xbfffb920) at
        rhythmdb-query-model.c:870

    #6  0x0831b451 in rb_entry_view_get_entry_contained
        (view=0x8a059c8, entry=0x8ba4fb8) at rb-entry-view.c:1902

    #7  0x080e0460 in rb_shell_jump_to_entry_with_source
        (shell=0x8979218, source=0x8a038a0, entry=0x8ba4fb8) at
        rb-shell.c:2118

    #8  0x080e0552 in rb_shell_play_entry (shell=0x8979218,
        entry=0x8ba4fb8) at rb-shell.c:2136

    #9  0x08082eb6 in idle_do_action (data=0x8b94bf0) at rb-shell.c:534

    #10 0x409c2743 in g_timeout_add () from /usr/lib/libglib-2.0.so.0

    #11 0x409bfb35 in g_get_current_time () from
        /usr/lib/libglib-2.0.so.0

    #12 0x409c0b78 in g_main_context_dispatch () from
        /usr/lib/libglib-2.0.so.0

    #13 0x409c0e8d in g_main_context_dispatch () from
        /usr/lib/libglib-2.0.so.0

    #14 0x409c158f in g_main_loop_run () from
        /usr/lib/libglib-2.0.so.0

    #15 0x40739328 in bonobo_main () from /usr/lib/libbonobo-2.so.0

    #16 0x08075473 in main (argc=2, argv=0xbfffded4) at main.c:173

    #17 0x420156a4 in __libc_start_main () from /lib/tls/libc.so.6

I say this is uninteresting because I'm seeing the same stack for various
other failures that have distinct causes.  A stack like this, with
g_hash_table_lookup() and rhythmdb_query_model_entry_to_iter() on the top,
appears to be a common place that Rhythmbox processes go to die once
they've already done something bad elsewhere.


Fix:

I will attach a proposed patch to this report shortly.  We have re-run our
scripted tests with the patch in place and can confirm that the indicated
failure no longer occurs.  That is, "g_source_remove() > 0" on line on line
77 of "lib/disclosure-widget.c" is no longer a good predictor of failure.

Comment 1 Ben Liblit 2004-03-17 04:31:01 UTC

Created attachment 25713 [details] [review]
proposed (and tested) fix: do not keep around stale event source IDs

Comment 2 Christophe Fergeau 2004-03-17 08:09:57 UTC

The id should be set to 0 after each g_source_remove call for extra
paranoia.

Comment 3 Ben Liblit 2004-03-17 17:40:31 UTC

Created attachment 25736 [details] [review]
alternate fix with "extra paranoia" suggested by <teuf@users.sourceforge.net>

Comment 4 Ben Liblit 2004-03-17 17:53:46 UTC

I expect that this mistake is pretty common.  I searched for other
instances of it within Rhythmbox and found one other example, in the
song position slider code.  That one might be responsible for the
crash reported as bug #130788, and I've posted a corresponding patch
to that report.

If this happened twice in Rhythmbox, though, no doubt it happens in
other projects as well.  I'm considering posting a message to
<desktop-devel> suggesting that developers examine their own code to
see if they have made the same mistake.

Comment 5 Crispin Flowerday (not receiving bugmail) 2004-03-17 22:09:52 UTC

So, I was just looking at the patches (having seen the post on d-d-l),
and noticed that the second patch is incorrect, it should read (for
the second hunk)

if (ret == FALSE)
    disclosure->priv->expand_id = 0;

Comment 6 Ben Liblit 2004-03-17 22:19:24 UTC

Created attachment 25747 [details] [review]
corrected alternate fix with "extra paranoia"

Comment 7 Ben Liblit 2004-03-17 22:20:57 UTC

D'oh!  Crispin Flowerday is completely right, of course.  I've
attached a corrected "extra paranoia" patch.  Please disregard the
patch with timestamp "03/17/04 12:40".  In fact, if anybody has
permission to remove that patch from this bug report entirely, please
do so.

Thanks, Crispin!

Comment 8 Colin Walters 2004-03-19 17:23:16 UTC

This is one of the best bug reports I've ever seen :)  Thanks for the
report and the patch!

* committed
rhythmbox-devel@gnome.org--2004/rhythmbox--main--0.7--patch-153