GNOME Bugzilla – Bug 137834
dangling RBShellPlayer callbacks cause crashes
Last modified: 2004-12-22 21:47:04 UTC
In brief: The singleton instance of RBShellPlayer installs several callbacks at initialization time. None of these callbacks are removed during RBShellPlayer finalization. This creates a race condition during application shutdown. If any of those callbacks are triggered after the RBShellPlayer singleton has been destroyed, the callback functions will be manipulating dangling pointers to freed memory. Experiments show that this causes a significant number of crashes. Methodology: This bug was uncovered as part of the Cooperative Bug Isolation Project <http://www.cs.berkeley.edu/~liblit/sampler/>, which attempts to find bugs by discovering statistically significant differences in the behavior of good and bad runs. Rhythmbox is part of our public deployment, and we are collecting good and bad feedback profiles on an ongoing basis. This bug was not discovered as a direct result of that public feedback data, which has not yet reached the level of statistical significance. Instead, it appeared during an examination of several thousand scripted runs of Rhythmbox in shuffle mode over a large music library. We have identified several program behaviors which are strong crash predictors for Rhythmbox on these scripted runs. One of them concerns the calls to monkey_media_player_get_uri() in info_available_cb() on lines 1769 and 1775 of "shell/rb-shell-player.c". Specifically, when monkey_media_player_get_uri() returns NULL on these lines, Rhythmbox is much more likely to crash. Unfortunately our public deployment of Rhythmbox does not measure this "monkey_media_player_get_uri() == NULL" predicate, so I cannot give you an estimate of how often this crash is in the hands of real users. In our scripted experiment, this behavior is at least partially implicated in roughly 96% of failures, or 1.4% of all runs. Note however that our scripted tests interact with Rhythmbox in fairly limited ways; actual user bugs are likely to be more varied and therefore it is difficult to estimate how common this failure is in the field. Interpretation: The interesting activity is in "shell/rb-shell-player.c". This code implements a RBShellPlayer object which is typically has a singleton instance that lives for the duration of a Rhythmbox process. The constructor, rb_shell_player_init(), calls monkey_media_player_new() to create a subsidiary MonkeyMediaPlayer instance. This will also typically be a singleton, and is effectively "owned" by the RBShellPlayer instance. The MonkeyMediaPlayer instance is stored in player->priv->mmplayer. Note that a newly allocated object initially has reference count 1; this will become important later. Continuing on, rb_shell_player_init() now calls g_signal_connect() six times to install six callbacks on various MonkeyMediaPlayer events: "info", "eos", "tick", etc. In each g_signal_connect() call, observe that the last argument is "player". This is the callback data. So player->priv->mmplayer will actually be holding on to a few pointers to our RBShellPlayer instance. Also in rb_shell_player_init() you will see three calls to eel_gconf_notification_add(). These are also installing callbacks, though to a different subsystem. Again, note that "player" is given as the callback data in each case. Now shift your attention to rb_shell_player_finalize(), which is the destructor for RBShellPlayer. This will typically be called only when the application is quitting. Inside this function you will find a call to g_object_unref() on player->priv->mmplayer. In most runs, this drives the reference count of player->priv->mmplayer to 0. The MonkeyMediaPlayer instance will be finalized and freed from within the g_object_unref() call. Part of that finalization process removes all signal callbacks, so the six callbacks registered when rb_shell_player_init() called g_signal_connect() will be removed and will never fire again. Problem #1: in some cases, the g_object_unref() call does not actually drive the MonkeyMediaPlayer instance's reference count to 0. The MonkeyMediaPlayer instance remains alive, at least for a little while. It will be reclaimed eventually, but for now it remains alive. That delay is long enough for it to trigger some signal callbacks, including any of the six callbacks registered when rb_shell_player_init() called g_signal_connect(). But when those callbacks are invoked, they are given a pointer to the RBShellPlayer which has just been finalized. The callback data is a dangling pointer to freed memory, and at that point anything can happen. Problem #2: the three additional callbacks registered when rb_shell_player_init() called eel_gconf_notification_add() are also still active after rb_shell_player_finalize() has completed and the RBShellPlayer instance has been reclaimed. If one of those were to trigger, they would also be using a dangling pointer to freed memory as their callback data. This is unlikely to happen much in practice, though, as GConf settings do not change very often in typical use. Suggested fix for problem #1: either the MonkeyMediaPlayer instance must always be reclaimed within rb_shell_player_finalize() or else the six callbacks installed on the MonkeyMediaPlayer instance must be removed within rb_shell_player_finalize(). The former is tricky; there are many other places which temporarily increase the reference count on this instance, and it would be difficult to ensure that none of them have anything going on at RBShellPlayer finalization time. So I've implemented the second approach, which is to unregister all six callbacks during RBShellPlayer finalization. Suggested fix for problem #2: unregister all three GConf notification callbacks within rb_shell_player_finalize(). How would these bugs result in the behavior detected by the Cooperative Bug Isolation Project instrumentation? The two strong returns predictors show that crashes happen when info_available_cb() calls monkey_media_player_get_uri() and gets a NULL result. info_available_cb() is one of the six callbacks discussed earlier. When problem #1 occurs, and info_available_cb() is called, the callback data is a dangling pointer but the data it points to still looks like a valid RBShellPlayer instance. The freed memory has not yet been overwritten. However, the whole application is in the midst of shutting down. Playback has stopped, threads are being terminated, it's the end of the world. The dangling pointer still lets us find the MonkeyMediaPlayer instance, but that object has already stopped playback and therefore its currently-playing URI is NULL. How could my instrumentation have been better? Honestly, this is a tricky bug for my system to detect. All objects are eventually reclaimed, so it's not a matter of detecting that some finalize code is called in good runs but not called in bad ones. I doubt that it would help to add instrumentation inside g_object_unref() either, as this code is very heavily used from many different places. However, I am now considering a new instrumentation scheme that looks at calls to g_object_unref(). Before any such call, count how often the reference count is 1 and how often it is >1. That is, count how often each g_object_unref() call will cause the object to be reclaimed and how often it won't. A scheme like that would have revealed that failure is more likely when the MonkeyMediaPlayer instance is not reclaimed within rb_shell_player_finalize(). That fact would have led me to the bug much more quickly. I'll note for the record that the stack traces when this bug appears are not useful. In fact, they are indistinguishable from those resulting from bug #137460, which is also a heap corruption bug. I suppose it is interesting to note that this stack trace, ending in a call from rhythmdb_query_model_entry_to_iter() to g_hash_table_lookup(), keeps coming back time after time. It appears to be the place where Rhythmbox goes to die once it has trashed its heap.
Created attachment 25849 [details] [review] suggested fix: remove callbacks during finalization
The patch attached above is against the official Rhythmbox 0.6.5 source code. I believe that the same problem still exists in CVS HEAD. If adapting this patch to CVS HEAD proves difficult, please let me know and I will try to put together a proper CVS HEAD patch.
Thanks for another great bug report. I solved the player callbacks in a slightly different way - I just used g_signal_connect_object. This should ensure that if the object is finalized, the callback will be removed. * committed rhythmbox-devel@gnome.org--2004/rhythmbox--main--0.7--patch-168
Aah, I didn't know about g_signal_connect_object(). Cool. No doubt there are a few other g_signal_connect() calls lurking about that could benefit from being switched to g_signal_connect_object() as well....