After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 137834 - dangling RBShellPlayer callbacks cause crashes
dangling RBShellPlayer callbacks cause crashes
Status: RESOLVED FIXED
Product: rhythmbox
Classification: Other
Component: general
0.6.5
Other Linux
: Normal major
: ---
Assigned To: RhythmBox Maintainers
RhythmBox Maintainers
Depends on:
Blocks:
 
 
Reported: 2004-03-21 07:43 UTC by Ben Liblit
Modified: 2004-12-22 21:47 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
suggested fix: remove callbacks during finalization (2.07 KB, patch)
2004-03-21 07:45 UTC, Ben Liblit
none Details | Review

Description Ben Liblit 2004-03-21 07:43:31 UTC
In brief:

The singleton instance of RBShellPlayer installs several callbacks at
initialization time.  None of these callbacks are removed during
RBShellPlayer finalization.  This creates a race condition during
application shutdown.  If any of those callbacks are triggered after the
RBShellPlayer singleton has been destroyed, the callback functions will be
manipulating dangling pointers to freed memory.  Experiments show that this
causes a significant number of crashes.

Methodology:

This bug was uncovered as part of the Cooperative Bug Isolation Project
<http://www.cs.berkeley.edu/~liblit/sampler/>, which attempts to find bugs
by discovering statistically significant differences in the behavior of
good and bad runs.  Rhythmbox is part of our public deployment, and we are
collecting good and bad feedback profiles on an ongoing basis.  This bug
was not discovered as a direct result of that public feedback data, which
has not yet reached the level of statistical significance.  Instead, it
appeared during an examination of several thousand scripted runs of
Rhythmbox in shuffle mode over a large music library.

We have identified several program behaviors which are strong crash
predictors for Rhythmbox on these scripted runs.  One of them concerns the
calls to monkey_media_player_get_uri() in info_available_cb() on lines 1769
and 1775 of "shell/rb-shell-player.c".  Specifically, when
monkey_media_player_get_uri() returns NULL on these lines, Rhythmbox is
much more likely to crash.

Unfortunately our public deployment of Rhythmbox does not measure this
"monkey_media_player_get_uri() == NULL" predicate, so I cannot give you an
estimate of how often this crash is in the hands of real users.  In our
scripted experiment, this behavior is at least partially implicated in
roughly 96% of failures, or 1.4% of all runs.  Note however that our
scripted tests interact with Rhythmbox in fairly limited ways; actual user
bugs are likely to be more varied and therefore it is difficult to estimate
how common this failure is in the field.

Interpretation:

The interesting activity is in "shell/rb-shell-player.c".  This code
implements a RBShellPlayer object which is typically has a singleton
instance that lives for the duration of a Rhythmbox process.  The
constructor, rb_shell_player_init(), calls monkey_media_player_new() to
create a subsidiary MonkeyMediaPlayer instance.  This will also typically
be a singleton, and is effectively "owned" by the RBShellPlayer instance. 
The MonkeyMediaPlayer instance is stored in player->priv->mmplayer.  Note
that a newly allocated object initially has reference count 1; this will
become important later.

Continuing on, rb_shell_player_init() now calls g_signal_connect() six
times to install six callbacks on various MonkeyMediaPlayer events: "info",
"eos", "tick", etc.  In each g_signal_connect() call, observe that the last
argument is "player".  This is the callback data.  So
player->priv->mmplayer will actually be holding on to a few pointers to our
RBShellPlayer instance.

Also in rb_shell_player_init() you will see three calls to
eel_gconf_notification_add().  These are also installing callbacks, though
to a different subsystem.  Again, note that "player" is given as the
callback data in each case.

Now shift your attention to rb_shell_player_finalize(), which is the
destructor for RBShellPlayer.  This will typically be called only when the
application is quitting.  Inside this function you will find a call to
g_object_unref() on player->priv->mmplayer.  In most runs, this drives the
reference count of player->priv->mmplayer to 0.  The MonkeyMediaPlayer
instance will be finalized and freed from within the g_object_unref() call.
 Part of that finalization process removes all signal callbacks, so the six
callbacks registered when rb_shell_player_init() called g_signal_connect()
will be removed and will never fire again.

Problem #1: in some cases, the g_object_unref() call does not actually
drive the MonkeyMediaPlayer instance's reference count to 0.  The
MonkeyMediaPlayer instance remains alive, at least for a little while. It
will be reclaimed eventually, but for now it remains alive.  That delay is
long enough for it to trigger some signal callbacks, including any of the
six callbacks registered when rb_shell_player_init() called
g_signal_connect().  But when those callbacks are invoked, they are given a
pointer to the RBShellPlayer which has just been finalized.  The callback
data is a dangling pointer to freed memory, and at that point anything can
happen.

Problem #2: the three additional callbacks registered when
rb_shell_player_init() called eel_gconf_notification_add() are also still
active after rb_shell_player_finalize() has completed and the RBShellPlayer
instance has been reclaimed.  If one of those were to trigger, they would
also be using a dangling pointer to freed memory as their callback data. 
This is unlikely to happen much in practice, though, as GConf settings do
not change very often in typical use.

Suggested fix for problem #1: either the MonkeyMediaPlayer instance must
always be reclaimed within rb_shell_player_finalize() or else the six
callbacks installed on the MonkeyMediaPlayer instance must be removed
within rb_shell_player_finalize().  The former is tricky; there are many
other places which temporarily increase the reference count on this
instance, and it would be difficult to ensure that none of them have
anything going on at RBShellPlayer finalization time.  So I've implemented
the second approach, which is to unregister all six callbacks during
RBShellPlayer finalization.

Suggested fix for problem #2: unregister all three GConf notification
callbacks within rb_shell_player_finalize().

How would these bugs result in the behavior detected by the Cooperative Bug
Isolation Project instrumentation?  The two strong returns predictors show
that crashes happen when info_available_cb() calls
monkey_media_player_get_uri() and gets a NULL result.  info_available_cb()
is one of the six callbacks discussed earlier.  When problem #1 occurs, and
info_available_cb() is called, the callback data is a dangling pointer but
the data it points to still looks like a valid RBShellPlayer instance.  The
freed memory has not yet been overwritten.  However, the whole application
is in the midst of shutting down.  Playback has stopped, threads are being
terminated, it's the end of the world.  The dangling pointer still lets us
find the MonkeyMediaPlayer instance, but that object has already stopped
playback and therefore its currently-playing URI is NULL.

How could my instrumentation have been better?  Honestly, this is a tricky
bug for my system to detect.  All objects are eventually reclaimed, so it's
not a matter of detecting that some finalize code is called in good runs
but not called in bad ones.  I doubt that it would help to add
instrumentation inside g_object_unref() either, as this code is very
heavily used from many different places.

However, I am now considering a new instrumentation scheme that looks at
calls to g_object_unref().  Before any such call, count how often the
reference count is 1 and how often it is >1.  That is, count how often each
g_object_unref() call will cause the object to be reclaimed and how often
it won't.  A scheme like that would have revealed that failure is more
likely when the MonkeyMediaPlayer instance is not reclaimed within
rb_shell_player_finalize().  That fact would have led me to the bug much
more quickly.

I'll note for the record that the stack traces when this bug appears are
not useful.  In fact, they are indistinguishable from those resulting from
bug #137460, which is also a heap corruption bug.  I suppose it is
interesting to note that this stack trace, ending in a call from
rhythmdb_query_model_entry_to_iter() to g_hash_table_lookup(), keeps coming
back time after time.  It appears to be the place where Rhythmbox goes to
die once it has trashed its heap.
Comment 1 Ben Liblit 2004-03-21 07:45:44 UTC
Created attachment 25849 [details] [review]
suggested fix: remove callbacks during finalization
Comment 2 Ben Liblit 2004-03-21 07:48:24 UTC
The patch attached above is against the official Rhythmbox 0.6.5
source code.  I believe that the same problem still exists in CVS
HEAD.  If adapting this patch to CVS HEAD proves difficult, please let
me know and I will try to put together a proper CVS HEAD patch.
Comment 3 Colin Walters 2004-03-26 02:07:41 UTC
Thanks for another great bug report.  I solved the player callbacks in a
slightly different way - I just used g_signal_connect_object.  This should
ensure that if the object is finalized, the callback will be removed.

* committed rhythmbox-devel@gnome.org--2004/rhythmbox--main--0.7--patch-168
Comment 4 Ben Liblit 2004-03-26 20:45:44 UTC
Aah, I didn't know about g_signal_connect_object().  Cool.  No doubt there are a
few other g_signal_connect() calls lurking about that could benefit from being
switched to g_signal_connect_object() as well....