Bug 167714 – Large performance leak in signal_impl::notify and sweep

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 167714 - Large performance leak in signal_impl::notify and sweep


Summary:	Large performance leak in signal_impl::notify and sweep


Status:	RESOLVED FIXED

Product:	libsigc++
Classification:	Bindings
Component:	signals
Version:	2.0.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsigc++ maintainer(s)
QA Contact:	libsigc++ maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-02-17 16:36 UTC by Neal E. Coombes
Modified:	2017-11-09 08:56 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fix performance leak in signal_impl::notify (2.51 KB, patch) 2005-02-17 16:45 UTC, Neal E. Coombes	none	Details \| Review
patch: signal_base, signal_impl: Speed up disconnection of slots. (6.46 KB, patch) 2012-01-30 15:06 UTC, Kjell Ahlstedt	none	Details \| Review
A test case to measure execution time for disconnection. (1.30 KB, text/plain) 2012-01-30 15:09 UTC, Kjell Ahlstedt		Details
0001-signal_base-clear-signal_impl-in-its-own-destructor.patch (2.39 KB, patch) 2016-03-03 12:52 UTC, Murray Cumming	needs-work	Details \| Review

Description Neal E. Coombes 2005-02-17 16:36:57 UTC

The static member of signal_impl, notify is called every time a connection to
the signal is disconnected.  It then proceeds to sweep the entire list of slots
to see which one was disconnected and remove it from the list.  If notify is
called during an emittion, it defers this sweeping until the end of execution. 
The flag that is set to cause the sweep at the end of emittion is never unset
however, which also causes this to happen more frequently than necessary.

Performance for sweep is O(n), and performed repeatedly (every disconnection)
causes big problems.  52% of my application's running time was spent in these
functions.  I have produced a patch that fixes this problem, although I don't
think it is ideal.  The ideal solution would likely require more fundamental
internal design changes.  The performance of notify now is O(1) (it no longer
calls sweep) unless performed during an emittion when it is still O(n) (and
still defers the work to sweep at the end of emittion).

Comment 1 Neal E. Coombes 2005-02-17 16:45:42 UTC

Created attachment 37598 [details] [review]
Fix performance leak in signal_impl::notify

Comment 2 Martin Schulze 2005-02-20 18:04:39 UTC

I have commit the 'deferred_=false' fix in signal_impl::sweep().
For the O(N) problem we need a more thorough fix. I will try to add a list that
keeps track of all slots for which notify() is called during signal emission in
a less hackish way.

Comment 3 Martin Schulze 2005-02-20 18:31:16 UTC

From what I can see a thorough fix requires to break the ABI of signal_impl. I
will prepare a bugfix that can go into the next major release.

Comment 4 Neal E. Coombes 2005-04-20 14:27:01 UTC

I read over the comments again, and I think I had originally misunderstood one.
 I'm getting the impression that comment #2 implies a belief that this is only a
problem for slots disconnecting during signal emission.  I wanted to clarify
that this occurs on -every- disconnect.  If this was already clear ignore this
additional comment :)

Comment 5 Murray Cumming 2005-12-20 10:48:14 UTC

Even if it breaks ABI, a patch for this would be informative.

Comment 6 Murray Cumming 2007-07-28 12:26:33 UTC

A patch would still be nice.

Comment 7 Kjell Ahlstedt 2011-07-05 14:16:07 UTC

As bug 154498, bug 564005, and bug 642203 show, there are cases where the slot,
which shall call signal_impl::notify(), is deleted before it makes that call.
I have suggested in bug 564005, comments 8 and 12 that slot_rep::notify()
shall not call the slot's parent (e.g. signal_impl::notify()) in such cases.

If my suggested fix is implemented, the patch in this bug's comment 1 will
replace a performance leak by a memory leak. In cases where
signal_impl::notify() is not called, the self_and_iter struct allocated in
signal_impl::insert() will never be deallocated.

sigc::signal_impl is not involved in the problematic parts of the test cases in
the 3 mentioned bugs. Instead it's Glib::SignalProxyConnectionNode which is
part of the problem. But in bug 564005 comment 14, I have attached a test case
without Glib::SignalProxyConnectionNode. If that test case is run without the
patch in bug 564005 comment 12, signal_impl::notify() is called, the
self_and_iter object is deallocated, and there is no memory leak. But the call
to signal_impl::notify() is made while the called signal_impl object is being
destructed. valgrind reports several invalid reads and writes during that call.

Conclusion: Avoid allocating memory which will be deallocated only if
signal_impl::notify() is called. Or suggest another fix of bug 564005.

Comment 8 Murray Cumming 2011-07-19 09:41:14 UTC

What's the status now that those bugs are fixed?

Comment 9 Kjell Ahlstedt 2011-07-20 06:53:51 UTC

The status of this bug is the same as it was before I added comment 7.
Bug 564005 and its relatives have been fixed in a better way than discussed in
comment 7. The pushed patch in bug 564005 comment 18 does not introduce new
memory leaks, if it's combined with the patch in this bug's comment 1.

But now when I've looked at it once again, I've discovered another way of
getting a memory leak with the patch in comment 1. This is not related to bug
564005, but the test case in bug 564005 comment 14 can be used to demonstrate
the memory leak. (Also available at
http://git.gnome.org/browse/libsigc++2/tree/tests/test_bind_refptr.cc)

Comment out or delete the line
  #define ACTIVATE_BUG 1
in test_bind_refptr.cc, compile with this bug's patch activated, and run with
valgrind. It will report a memory leak.

The self_and_iter object, allocated by signal_impl::insert(), is not deleted.
In this case the Action object, and thus also Action::sig1, are deleted before
the slot, connected to Action::sig1, has been disconnected.
signal_impl::notify(), which should have deleted self_and_iter, is not called.
Without the patch in comment 1, this behaviour does not cause any problem.

Boost.signals uses a similar technique as the one proposed by Neal in this bug,
disconnecting slots in O(1) time instead of O(n). There's no memory leak there.
(Test case in bug 564005 comment 15.) I don't know how it's done.

Conclusion: Avoid allocating memory which will be deallocated only if
signal_impl::notify() is called, unless you can be sure that it _will_ be
called.

Comment 10 Krzysztof Kosiński 2011-07-25 18:46:22 UTC

(In reply to comment #9)
> Boost.signals uses a similar technique as the one proposed by Neal in this bug,
> disconnecting slots in O(1) time instead of O(n). There's no memory leak there.
> (Test case in bug 564005 comment 15.) I don't know how it's done.

Probably the slots are stored using some kind of intrusive list. In other words, each slot has "next" and "prev" pointers that can be used to add it to a list. The list itself is implemented as a cyclic list with a sentinel - the parent list structure also has those next and prev pointers which point at the head and tail, respectively. An empty list has prev = next = &list. This way, a slot can remove itself from the list in O(1) time.
  slot->next->prev = slot->prev;
  slot->prev->next = slot->next;

Comment 11 André Klapper 2011-09-22 09:02:06 UTC

[Resetting QA Contact and Assignee to default - see bug 659799 for more info]

Comment 12 Kjell Ahlstedt 2012-01-30 15:06:44 UTC

Created attachment 206439 [details] [review]
patch: signal_base, signal_impl: Speed up disconnection of slots.

This patch adds Neal's self_and_iter struct without causing memory leaks.

The trick is to disconnect every slot (calling signal_impl::notify()) before
it's removed from the signal's slot list. I believe that this is also how
boost::signals avoids memory leaks.

Comment 13 Kjell Ahlstedt 2012-01-30 15:09:19 UTC

Created attachment 206440 [details]
A test case to measure execution time for disconnection.

This program can be used for measuring the execution time with and without
the patch in comment 12. Some results on my fairly old PC:

Slot list size  # disconnections  Without patch  With patch
--------------  ----------------  -------------  ----------
           100               100           5 ms        4 ms
           100              1000           6 ms        5 ms
           100             10000          16 ms       15 ms
          1000               100           6 ms        5 ms
          1000              1000          14 ms        6 ms
          1000             10000         103 ms       16 ms
         10000             10000        1080 ms       26 ms

The difference is significant only when at least about 1000 slots are connected
to the signal and slots are connected and disconnected 1000s of times.
Neal's program, where half of the execution time is spent disconnecting slots,
must be exceptional.

Comment 14 Kjell Ahlstedt 2012-09-10 08:59:21 UTC

I haven't pushed the patch in comment 12 because I think the time measurements
in comment 13 indicate that it's unnecessary. Is there an application where
the patch makes a noticeable difference? Neal's application, mentioned in
comment 0, might be an example, but it's possible that Martin's commit in
comment 2 plugged most of the performance leak.

As mentioned in TODO comments in the patch, there is a marginally better way to
make slot disconnection O(1) when ABI can be broken. Perhaps a fix can be
reconsidered at the next ABI break.

Comment 15 Kjell Ahlstedt 2013-07-27 10:15:07 UTC

Krzysztof, you say in
https://mail.gnome.org/archives/gtkmm-list/2013-July/msg00052.html
that this is the worst libsigc++ bug. You should have objected to my conclusion
in comment 14. Have you got a program where disconnection of slots make up a
considerable part of the program's execution time? Is something wrong with my
timings in comment 13?

The patch in comment 12 (or an updated version of it) can be applied without
breaking ABI.

Comment 16 Krzysztof Kosiński 2013-07-28 20:48:23 UTC

My opinion was very subjective. Some time ago I tried using libsigc++ in Inkscape's node editor, and the result was that selecting/deselecting a large path resulted in pauses of a several seconds even on fast machines. I later rewrote the code to use virtual functions to work around this performance problem. I'm not sure whether it was caused by this O(N) behavior or just by the overhead of creating and deleting so many slots.

I assumed that nothing was done about the bug for such a long because the fix was ABI-breaking. I must have missed your earlier comments.

Comment 17 Kjell Ahlstedt 2013-07-30 14:56:59 UTC

I have pushed the signal_base.cc part of the patch in comment 12.
The other parts are obsolete and not related to this bug.
https://git.gnome.org/browse/libsigc++2/commit/?id=ab3f6cd266e9cfa342847cde908b717cdacecea1

Comment 18 Kjell Ahlstedt 2014-07-24 14:39:19 UTC

I forgot to update the documentation of signal_impl::notify() when that
function was updated. Done now, at last.
https://git.gnome.org/browse/libsigc++2/commit/?id=755a723b721da6b490b7f98ad29f0a5b2e0cd5a6

Comment 19 Murray Cumming 2016-03-03 12:52:18 UTC

Created attachment 322971 [details] [review]
0001-signal_base-clear-signal_impl-in-its-own-destructor.patch

> As mentioned in TODO comments in the patch, there is a marginally better way to
> make slot disconnection O(1) when ABI can be broken. Perhaps a fix can be
> reconsidered at the next ABI break.

What ABI breaking change would have to be made? I guess it's more than just what's in this patch to libsigc++-3.0 (in the variadic_bind4 branch today). This patch causes several test cases to segfault.

Comment 20 Kjell Ahlstedt 2016-03-03 15:43:41 UTC

It's slightly more difficult than I thought. But only slightly.
All test cases pass (in the master branch), if you change ~signal_impl() to

signal_impl::~signal_impl()
{
  // unreference() must not call ~signal_impl() while clear() is executing.
  ++ref_count_;
  // Disconnect all slots before *this is deleted.
  clear();
  // Now ref_count_ can be cleared again (not really necessary), but don't do it
  // with a call to unreference(). That would invoke ~signal_impl() recursively.
  --ref_count_;
}

Of course you can skip --ref_count_ and the corresponding comment.

Comment 21 Murray Cumming 2016-03-03 21:48:55 UTC

Thanks. I have pushed that for libsigc++-3.0:
https://git.gnome.org/browse/libsigc++2/commit/?h=variadic_bind4&id=06e86e782035b187ccd9509af0d82276dfc21698

Comment 22 Murray Cumming 2016-03-03 21:49:35 UTC

However, what about that would have been ABI breaking?

Comment 23 Kjell Ahlstedt 2016-03-04 10:19:00 UTC

Now ~signal_impl() is compiler-generated. A compiler can choose to inline it.

Don't you think this scenario is possible:

- An application is built with a compiler and an optimization level that makes
  ~signal_impl() inlined.
- libsigc++ is upgraded to a version where ~signal_impl() is user-supplied, and
  can't be inlined. The application is not rebuilt.
- Now the application calls its old inlined ~signal_impl(), which does not call
  clear(), and the new user-supplied ~signal_base(), which does not call
  impl_->clear().
  Result: signal_impl::clear() will not be called when a signal is deleted.

Comment 24 Murray Cumming 2016-03-07 10:52:32 UTC

Yes, well spotted.

Comment 25 Gerald Britton 2017-10-30 16:03:13 UTC

These changes appear to have introduced leaks in at least 2 cases I've found.  These leaks were not present in libsigc++-2.3.1 (RHEL 7.4 updated libsigc++ from that version to 2.10.0).  Both patterns result in leaks if existing applications follow them, but the dynamically linked library is upgraded.

The first occurs when an empty sigc::slot is connected to a signal:

#include <sigc++/signal.h>
int main(void) {
  sigc::slot<void()> empty;
  sigc::signal<void> sig;
  sig.connect(empty);
}

The empty slot has no rep_, and thus the set_parent() call that would normally hook signal_impl::notify does not happen resulting in the self_and_iter object never getting deleted.

An interesting related case, if empty is defined using the deprecated sigc::slot<void> syntax, the leak vanishes if re-compiled against 2.10.0 headers as the slot goes through copy construction (which results in a rep_).

I'm not a sigc expert, so this may have issues I haven't found, but perhaps a fix here would be to create a dummy slot_rep alter slot_base::set_parent() to create dummy rep_ so the cleanup callback can be held.  It looks to me like this should be safe, though again, I'm not an expert in this codebase.  At present, the slot_base::set_parent() call will only call slot_rep::set_parent() iff rep_ is set.  This change would construct a dummy rep if it's not set and always call slot_rep::set_parent():

void slot_base::set_parent(void* parent, void* (*cleanup)(void*)) const noexcept
{
  if (!rep_)
    rep_ = new slot_rep(nullptr, nullptr, nullptr);
  rep_->set_parent(parent, cleanup);
}

============

The second case involves deletion of the signal while signal emission is in progress.  This leaks the same self_and_iter object, though through a different path.  It's unclear whether this is intended to be supported.  It worked fine in earlier versions, and the bookkeeping of signal_impl by the signal_exec objects suggests it was intended to work fine.

Here's a minimal example of this pattern:

#include <sigc++/adaptors/bind.h>
#include <sigc++/signal.h>
void reap_sig(sigc::signal<void>* p_sig) { delete p_sig; }
int main(void) {
  sigc::signal<void>* p_sig = new sigc::signal<void>();
  p_sig->connect(sigc::bind(sigc::ptr_fun(&reap_sig), p_sig));
  p_sig->emit();
}

This has the signal object deleted by the function called during emission.  This pattern is typically not so bare, typically an object may be constructed and fed into a reactor and the signal will be an event triggering its self-destruction.

The leak happens because during signal emission, reference counting means that notify() (where self_and_iter would be deleted) is never called as destruction of signal_base does not call clear() if the impl is still referenced.  Reference count dropping in the unwind after signal emission ends up done via signal_impl::unreference_exec() which does a "delete this" without any path to clear() or notify().

The signal_exec/signal_impl interaction appears to be entirely in header code, meaning there's no easy way to fix this purely in library code (I haven't been able to come up with one at least).  This appears to be the hazard described in comment #7 and comment #23.

It's unclear from the documentation I've dug through whether this is a pattern that's intended to be supported.  Given libsigc++-2.10.0 has been out for more than a year, and nobody has pointed this out until now makes me think it may not be, though if it is a forbidden pattern, there should either be some protection against it, or documentation at least, especially since this was a behavior that did work fine prior to this version.

Comment 26 Gerald Britton 2017-10-30 16:07:25 UTC

I forgot to mention in my report above, a workaround for the deletion during emission leak is to copy the signal prior to emission, though that obviously has big performance implications.  Changing the emit in my above example to the following works around the leak successfully:

  sigc::signal<void>(*p_sig).emit();

Comment 27 Kjell Ahlstedt 2017-11-09 08:56:18 UTC

Thanks for your very detailed bug comment.

I've pushed fixes for the empty slot problem to the master branch and the
libsigc++-2-10 branch.

Your suggested fix of set_parent() would work fine in the libsigc++-2-10 branch,
but I chose a slightly different solution, mainly because your fix can't be
applied to the master branch with the coming ABI-breaking libsigc++3.
slot_rep is an abstract base class there. Only subclasses of slot_rep can be
instantiated.

I don't know if deletion of a signal during emission was ever supposed to work,
but perhaps it was. Anyway it does work without memory leak in the master
branch, where signal_impl::clear() is called from the signal_impl dtor instead
of the signal_base dtor. This is an ABI-breaking change, and can't be applied
to libsigc++2. I've added some documentation to signal_base in the
libsigc++-2-10 branch, mentioning your workaround.