After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 754687 - Drop the GSlice allocator
Drop the GSlice allocator
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: gslice
unspecified
Other All
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2015-09-07 15:18 UTC by Matthias Clasen
Modified: 2018-05-24 18:13 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Drop the GSlice allocator (63.23 KB, patch)
2015-09-07 15:18 UTC, Matthias Clasen
none Details | Review
with gslice, but disabled by valgrind detection (692.07 KB, application/x-bzip2)
2015-09-08 17:23 UTC, Matthias Clasen
  Details
without gslice (the patch in this bug applied) (692.61 KB, application/x-bzip2)
2015-09-08 17:23 UTC, Matthias Clasen
  Details
with slice (697.96 KB, application/x-bzip2)
2015-09-08 17:46 UTC, Matthias Clasen
  Details
Add GtkContainer api to handle children as an array (6.81 KB, patch)
2015-09-10 10:01 UTC, Alexander Larsson
none Details | Review
box: Use gtk_container_get_children_array() (2.36 KB, patch)
2015-09-10 10:01 UTC, Alexander Larsson
none Details | Review
gdkwindow: Store children list nodes in GdkWindow structure (6.59 KB, patch)
2015-09-10 11:01 UTC, Alexander Larsson
committed Details | Review
gdkwindow: Avoid list allocation and object refs during repaint (3.02 KB, patch)
2015-09-14 07:45 UTC, Alexander Larsson
committed Details | Review
fetch last linked list node and count children at same time (1.49 KB, patch)
2015-09-14 19:27 UTC, Christian Hergert
committed Details | Review

Description Matthias Clasen 2015-09-07 15:18:42 UTC
Since malloc implementations have caught up since GSlice was
introduced, it is no longer an advantage to have our own allocator
here. Just always use malloc. The g_slice API is preserved.
Comment 1 Matthias Clasen 2015-09-07 15:18:51 UTC
Created attachment 310831 [details] [review]
Drop the GSlice allocator
Comment 2 Colin Walters 2015-09-07 15:33:07 UTC
Do you have any references here?  This feels like a patch that's following some external discussion - is it on a mailing list/blog etc. that we could link to?

AFAICS there is very little going on in glibc malloc:
https://sourceware.org/git/?p=glibc.git;a=history;f=malloc/malloc.c;h=452f036387e0b5699e8f0fa33ed027abf066115e;hb=HEAD

jemalloc is a lot more active:
https://github.com/jemalloc/jemalloc

Notably wrt gslice: jemalloc has a "sized deallocation" which AIUI helps it with small chunks.

It'd be nice obviously to have some performance numbers with this.  I have a fear we're going to regress some application+platform combo (e.g. gedit on Windows).

Can we instead make this a build-time option for now?  That would allow people who don't want it to disable it today (e.g. if we think glibc on Linux is good enough now, people there can disable it.  Likewise for jemalloc on FreeBSD etc.)
Comment 3 Emmanuele Bassi (:ebassi) 2015-09-07 15:41:43 UTC
(In reply to Colin Walters from comment #2)
> Do you have any references here?  This feels like a patch that's following
> some external discussion - is it on a mailing list/blog etc. that we could
> link to?

This issue was discussed at the GTK+ hackfest in Berlin, 2015, and at GUADEC 2015:

  https://wiki.gnome.org/Projects/GTK%2B/Guadec2015

but it's also been discussed multiple times on the gtk-devel-list, the latest one in the discussion about dropping Windows XP:

  https://mail.gnome.org/archives/gtk-devel-list/2015-April/msg00003.html

> AFAICS there is very little going on in glibc malloc:
> https://sourceware.org/git/?p=glibc.git;a=history;f=malloc/malloc.c;
> h=452f036387e0b5699e8f0fa33ed027abf066115e;hb=HEAD
> 
> jemalloc is a lot more active:
> https://github.com/jemalloc/jemalloc

As soon as we start introducing a separate allocator, we'll have to deal with a mechanism to disable that, so that applications like Firefox can replace our allocator with their own.

If we just use malloc/free, then applications can replace those weak symbols with their own, without any additional cost for us, in terms of API and maintenance.

> It'd be nice obviously to have some performance numbers with this.  I have a
> fear we're going to regress some application+platform combo (e.g. gedit on
> Windows).

Yes, that would be a good thing to have. I know that Christian had some benchmark code that needed to be updated.
Comment 4 Matthias Clasen 2015-09-07 17:07:25 UTC
Christian pointed me at https://github.com/chergert/alloctest
Comment 5 Alexander Larsson 2015-09-07 19:32:19 UTC
At guadec it was also reported that there was a lot of backtraces against gslice in multithreaded setups. Has anyone researched this?
Comment 6 Alexander Larsson 2015-09-07 19:37:23 UTC
I think we need some basic benchmark results to go forward, but I believe the best approach is to just use malloc() as a standard api, and then allow apps to override this with whatever other malloc (like jemalloc) based on the specific allocation pattern in the app.

The one case where i think a specialized allocator in glib would make sense is for the specific case for GList and GSList nodes. We are heavy users of these, and using malloc for them is likely to grow them a bit. Also, with some ahead of time knowledge and integration between the list allocator and the list implementation we can get perhaps get better cache behaviour for list traversals (use consecutive list nodes, etc).
Comment 7 Tim-Philipp Müller 2015-09-07 19:41:58 UTC
Crashes in gslice in multi-threaded setups are usually caused by memory corruption and bad memory handling, just like with single-threaded setups. GStreamer uses a lot of GSlice-allocated memory, and a lot of threads, and I'm not aware of anything that indicates that there are problems with GSlice in a multi-threaded environment.
Comment 8 Matthias Clasen 2015-09-08 13:32:58 UTC
(In reply to Alexander Larsson from comment #6)

> The one case where i think a specialized allocator in glib would make sense
> is for the specific case for GList and GSList nodes. We are heavy users of
> these, and using malloc for them is likely to grow them a bit. Also, with
> some ahead of time knowledge and integration between the list allocator and
> the list implementation we can get perhaps get better cache behaviour for
> list traversals (use consecutive list nodes, etc).

Go back to trash stacks and GAllocator ?
Comment 9 Colin Walters 2015-09-08 13:58:28 UTC
We use linked lists *way* too much.  GPtrArray is almost always better.  And in the places where linked lists' performance characteristics match the problem domain, it's usually better to have them be intrusive.

Let's just ignore GList performance?
Comment 11 Sebastian Dröge (slomo) 2015-09-08 14:10:01 UTC
(In reply to Colin Walters from comment #9)
> We use linked lists *way* too much.
> [...]
> Let's just ignore GList performance?

You just gave the reason *not* to ignore GList performance in the first sentence ;)
Comment 12 Alexander Larsson 2015-09-08 14:49:30 UTC
So, i traced the g_slice codepath taken for GLists (i.e. the magazines), and its pretty damn efficient. Its a per-thread free-list and so its completely lock and atomic free (except a single TLS lookup), and its super simple. I'm pretty sure we could just extract that from g_slice and use for G(S)List, and then drop g_slice for everything else (just fall back to malloc). 

I had some ideas where we could have a g_list_alloc_near(GList *other) method that tries to put list nodes near each other to improve locality. This could be automatically used by things like g_list_append(). Not sure if this is worth the work.
Comment 13 Matthias Clasen 2015-09-08 17:23:29 UTC
Created attachment 310924 [details]
with gslice, but disabled by valgrind detection
Comment 14 Matthias Clasen 2015-09-08 17:23:51 UTC
Created attachment 310925 [details]
without gslice (the patch in this bug applied)
Comment 15 Matthias Clasen 2015-09-08 17:26:47 UTC
Not sure how representative this is, but is the output of

libtool --mode execute valgrind --tool=callgrind ./demos/gtk-demo/gtk3-demo --run listbox --autoquit 1

Once with plain glib master, and once with this patch applied. I should mention that this is a bit unfair of a comparison since the current code already enables always-malloc if it detects us running in valgrind. So the difference you see here is probably to a good extent the cost of the pthread_getspecific call we're still doing for each allocation, even with always-malloc.

You can open those files in kcachegrind to dig through the differences.
Comment 16 Matthias Clasen 2015-09-08 17:46:13 UTC
Created attachment 310926 [details]
with slice
Comment 17 Matthias Clasen 2015-09-08 17:48:29 UTC
The last attachment was produced by 

G_SLICE=help libtool --mode execute valgrind --tool=callgrind ./demos/gtk-demo/gtk3-demo --run listbox --autoquit 1

with an unpatched glib (setting G_SLICE defeats the valgrind autodetection, so the slice allocator was in use)
Comment 18 Matthias Clasen 2015-09-08 17:52:11 UTC
here is the total instruction counts for the 3 runs:

gslice enabled      6 600 328 293
gslice patched out  6 776 239 372
gslice disabled     6 933 853 635
Comment 19 Colin Walters 2015-09-08 20:28:21 UTC
(In reply to Sebastian Dröge (slomo) from comment #11)

> You just gave the reason *not* to ignore GList performance in the first
> sentence ;)

Yes...but I argue we should look at finding any remaining performance sensitive GList consumers and fix them.  To take a random example, `g_resolver_lookup_by_name`...probably not?  That case is kind of ugly because it's taking the intrusive list from the system and re-allocating it.  And we're allocating GObjects.

(In reply to Alexander Larsson from comment #12)
> So, i traced the g_slice codepath taken for GLists (i.e. the magazines), and
> its pretty damn efficient. Its a per-thread free-list and so its completely
> lock and atomic free (except a single TLS lookup), 

I think both tcmalloc and jemalloc do similar things.


Anyways, I think what I'm arguing against here is keeping around GSlice just for GList.  That's kind of the worst of both worlds in that we'd still need to maintain the code - e.g. honor the `G_SLICE` environment variable so apps like Firefox can do memory accounting etc.
Comment 20 Alexander Larsson 2015-09-09 09:44:06 UTC
(In reply to Colin Walters from comment #19)
 
> Anyways, I think what I'm arguing against here is keeping around GSlice just
> for GList.  That's kind of the worst of both worlds in that we'd still need
> to maintain the code - e.g. honor the `G_SLICE` environment variable so apps
> like Firefox can do memory accounting etc.

Oh, i did not mean that. I mean we should drop g_slice_* (make it just call malloc), and then implement a super tiny allocator for g_list_alloc() and g_list_free().
Comment 21 Matthias Clasen 2015-09-09 10:37:41 UTC
Just to point this out: gslice or not, gmarkup has a cache of slist nodes that sits on top of g_slist_alloc.
Comment 22 Alexander Larsson 2015-09-09 14:38:07 UTC
(In reply to Matthias Clasen from comment #21)
> Just to point this out: gslice or not, gmarkup has a cache of slist nodes
> that sits on top of g_slist_alloc.

Did someone actually profile that before adding it?
Comment 23 Alexander Larsson 2015-09-09 14:40:26 UTC
gmarkup cache was added here: https://bugzilla.gnome.org/show_bug.cgi?id=572508

Its kind of silly though. If the g_slice codepath was not so convoluted due to also supporting other kind of allocation crap it ends up doing pretty much the same as the gmarkup cache...
Comment 24 Colin Walters 2015-09-09 18:57:13 UTC
(In reply to Sebastian Dröge (slomo) from comment #11)
> (In reply to Colin Walters from comment #9)
> > We use linked lists *way* too much.
> > [...]
> > Let's just ignore GList performance?
> 
> You just gave the reason *not* to ignore GList performance in the first
> sentence ;)

Anyone using them in code that's medium-to-high performance sensitive is just Doing It Wrong though, and would be better off with something else.  (Modulo other constraints listed in that blog I posted)
Comment 25 Alexander Larsson 2015-09-10 07:30:48 UTC
(In reply to Colin Walters from comment #24)
> Anyone using them in code that's medium-to-high performance sensitive is
> just Doing It Wrong though, and would be better off with something else. 
> (Modulo other constraints listed in that blog I posted)

We do have them in our APIs all over the place though, including returning them as transfer=none, thus forcing us to continue using them internally. This means we can't totally ignore their performance.

I'm not sure which of these (if any) are problematic wrt performance though, as generally the lists are short.

A few ones are: gdk_window_get_children*, gtk_container_get_children, _gtk_box_get_children. In particular the a11y code uses the container child list as an array (calling g_list_nth) which is a recipe for bad performance. Perhaps we should introduce alternative calls for this that work on arrays.
Comment 26 Alexander Larsson 2015-09-10 10:00:56 UTC
So, i started making an array API for GtkContainer children, so that
performance critical code can use this. I got as far as adding it and
converting GtkBox to use it. Then i started looking at the travesty
that is the a11y code...

GtkContainerAccessible keeps a separate list of childrens, so that it
can do things like find the index of children (in particular, i guess
of recently removed items), which seems ripe to be replaced with the
new API.... Unfortunately the whole thing seems broken, it tracks
additions/removals via calls to _gtk_container_accessible_add() in
gtk_container_add(), and similar with remove. This obviously breaks
for any other container adder (such as gtk_box_pack_start()). Not to
mention that the indexes it uses seem useless as there is no
child-reorder event, so the consumer cannot rely on the indexes in any
sensible way. For this to really work we should instead hook into
gtk_widget_set_parent/unparent, and possibly have some sort of reorder
event.

Furthermore, in the GtkMenuItemAccessible code it adds all submenus
set on the menu items as children in the container, when they are not
actually children of the GtkContainer implementation, which breaks the
correspondence between GtkContainer indexes and the a11y indexes.
I'm not sure how to fix this other than moving the custom children list
from GtkContainerAccessible into GtkMenuItemAccessible.

I'm not sure if this work is worth it. How much of the a11y code is
running when there is no a11y agent active?

Attaching the initial patches for possible future work.
Comment 27 Alexander Larsson 2015-09-10 10:01:33 UTC
Created attachment 311049 [details] [review]
Add GtkContainer api to handle children as an array

This lets you get the children as an array and use an index
of the children. The array is a more cache and allocation efficient way
to deal with large amounts of children, and the index handling methods
are useful when working with containers array-like, such as in the
a11y code.

This is an initial version based on the existing generic forall
vfunc. It can later be made more efficient for critical container
implementations by adding vfuncs.
Comment 28 Alexander Larsson 2015-09-10 10:01:39 UTC
Created attachment 311050 [details] [review]
box: Use gtk_container_get_children_array()

This is slightly more efficient.
Comment 29 Alexander Larsson 2015-09-10 11:01:16 UTC
Created attachment 311052 [details] [review]
gdkwindow: Store children list nodes in GdkWindow structure

This avoids a bunch of allocations, and additionally it has better
cache behaviour, as we don't follow pointers to the separate GList
node memory areas during traversal.
Comment 30 Alexander Larsson 2015-09-10 11:29:26 UTC
The last patch removes a bunch of allocations in GdkWindow, at the expense of allocating an extra GList-size memory for leaf GdkWindows. Its a bit weird using GList like this, but it can't be avoided because we have to keep supporting gdk_window_peek_children().
Comment 31 Christian Hergert 2015-09-13 19:44:16 UTC
I just applied the GdkWindow embedded GList node patch to my builds for quartz. I'm seeing a non-trivial performance improvement here.

This machine is a Retina mac book pro so I've been working on getting GtkTextView (GtkPixelCache) up to our performance level on X11/Wayland. I'm seeing a jump from about 43 FPS to about 50 FPS.
Comment 32 Alexander Larsson 2015-09-13 19:50:04 UTC
Comment on attachment 311052 [details] [review]
gdkwindow: Store children list nodes in GdkWindow structure

Attachment 311052 [details] pushed as ea294fd - gdkwindow: Store children list nodes in GdkWindow structure
Comment 33 Alexander Larsson 2015-09-14 07:45:59 UTC
Created attachment 311258 [details] [review]
gdkwindow: Avoid list allocation and object refs during repaint

There is no need to ref the windows we're ignoring, so collect and ref
only the affected child windows. Also, use a on-stack array rather
than allocating a linked list.

Also, we don't need to ref during the event emissions too, as we
already hold a ref.
Comment 34 Alexander Larsson 2015-09-14 09:02:42 UTC
Comment on attachment 311258 [details] [review]
gdkwindow: Avoid list allocation and object refs during repaint

Attachment 311258 [details] pushed as eafedfb - gdkwindow: Avoid list allocation and object refs during repaint
Comment 35 Christian Persch 2015-09-14 16:04:07 UTC
Comment on attachment 311258 [details] [review]
gdkwindow: Avoid list allocation and object refs during repaint

Drive-by comment:

+  n_children = g_list_length (window->children);

+  for (l = g_list_last (window->children); l != NULL; l = l->prev)

This walks the list twice. You could just open-code getting the list length and the last element at the same time.
Comment 36 Christian Hergert 2015-09-14 19:27:20 UTC
Created attachment 311309 [details] [review]
fetch last linked list node and count children at same time

I mentioned the same thing as Christian yesterday. Here is a quick patch to do so.
Comment 37 Christian Hergert 2015-09-14 19:41:07 UTC
Comment on attachment 311309 [details] [review]
fetch last linked list node and count children at same time

ack'd on irc by mclasen
Comment 38 David Jaša 2016-04-21 11:24:32 UTC
G_SLICE=debug-blocks helps catching some otherwise hard-to-detect memory errors. Will there be any replacement when GSlice is phased out?
Comment 39 Bastien Nocera 2016-04-21 11:25:50 UTC
(In reply to David Jaša from comment #38)
> G_SLICE=debug-blocks helps catching some otherwise hard-to-detect memory
> errors. Will there be any replacement when GSlice is phased out?

MALLOC_PERTURB_ on glibc systems, valgrind, heck, even ElectricFence.
Comment 40 Sebastian Dröge (slomo) 2018-05-05 11:10:24 UTC
Christian was doing some tests recently and seemed to have seen tcmalloc being much faster than glibc malloc. So maybe it should be reconsidered whether GSlice should be dropped (i.e. always use malloc), or if GSlice should just become a wrapper around tcmalloc (or jemalloc?) instead. In case of the latter it should probably be a configure option though, to keep library size smaller if needed.

Maybe Christian can share his benchmarks :)
Comment 41 Christian Hergert 2018-05-05 23:50:53 UTC
I haven't done any new tests since 2014.

  https://github.com/chergert/alloctest

But I think the concern at the time was that GSlice was better on Windows in
terms of fragmentation by keeping similar allocations near each other?

tcmalloc and such does some of that automatically IIRC.

In terms of Linux, tcmalloc outperformed both GSlice and g_malloc. But I don't
think I was tracking total memory usage (and we should probably modify the tests
to do so before making any decisions).
Comment 42 Emmanuele Bassi (:ebassi) 2018-05-06 09:56:50 UTC
I've taken the alloctest that Christian wrote and tweaked it a bit:

 - fewer threads, to look at the profile closer to 1 thread per core on a typical desktop machine
 - smaller memory allocation (64 kB instead of,256 kB) to simulate the typical use case for a GSlice
 - 250k iterations instead of 100k

Then I've tested the GNU libc system allocator on:

 - glibc 2.25 (Fedora 26)
 - glibc 2.26 (Fedora 27)
 - glibc 2.27 (Fedora 28)

on my Dell XPS with an Intel 8th gen Core i7 "Kaby Lake", 4 cores.

The results are that:

 - the GNU libc allocator has gotten incrementally better over the past few releases, though it's still not that great
 - glibc now allocates memory *slightly* more efficiently (lower VM peak memory, lower latency) than GSlice, *as long as* the number of threads is less than or equal to the number of CPU cores — then glibc takes a hit that gets amortized over a bunch of additional threads, and GSlice becomes slightly more efficient
 - the gmalloc abstraction is not as zero cost as we'd like; we should probably move it to pre-processor macros instead of using real functions, though it would make adding dtrace probes much harder
 - tcmalloc generally outperforms both the system allocator (a lot) and GSlice (moderately)

Given the performance profile of tcmalloc, I thought to test the case where we swapped out GSlice with tcmalloc, and ran the tests after making GSlice always go through the system allocator and replacing the system allocator with tcmalloc:

 - even if we swap out GSlice with tcmalloc when creating GObject instances, we're not going to get much in terms of performance
 - gobject instantiation is *not* dominated by the slice allocator, and it gets worse as soon as you involve more than one thread; the locks and synchronisation points are what kills us, even if we're not hitting property and signal code paths

I have the profile data available on request.
Comment 43 Philip Withnall 2018-05-09 11:12:03 UTC
(In reply to Emmanuele Bassi (:ebassi) from comment #42)
> I've taken the alloctest that Christian wrote and tweaked it a bit:
>  …

Thanks Emmanuele. Is it suitable to write up as a blog post (maybe to post once we’ve worked out what to do based on your research)?

We should not forget Windows in these performance measurements, though. GSlice has historically been a big win on Windows. Has the Windows system allocator performance improved enough that we can consider dropping GSlice there?

>  - the gmalloc abstraction is not as zero cost as we'd like; we should
> probably move it to pre-processor macros instead of using real functions,
> though it would make adding dtrace probes much harder

I’d like to not lose the dtrace probe support — unless we could rely on dtrace probes in the system malloc()?

>  - tcmalloc generally outperforms both the system allocator (a lot) and
> GSlice (moderately)

Over all allocation sizes/thread counts/core counts?

>  - even if we swap out GSlice with tcmalloc when creating GObject instances,
> we're not going to get much in terms of performance

Right. What about performance for other GLib allocations? afaik we’ve never really focussed on scaling GObject so you can create millions of GObjects really quickly; but we do care a bit more about that kind of thing for string allocations, hash tables, linked lists, pointer arrays, etc.

>  - gobject instantiation is *not* dominated by the slice allocator, and it
> gets worse as soon as you involve more than one thread; the locks and
> synchronisation points are what kills us, even if we're not hitting property
> and signal code paths

This is obviously not something we’re going to fix with allocator changes, and can be handled separately.

> I have the profile data available on request.

Might be good to attach it here for posterity anyway.

---

I’m currently thinking that we could deprecate the GSlice API, remove the GSlice implementation (and get the deprecated API to use g_malloc() instead), and recommend that people use tcmalloc() to replace the system allocator if they care about performance. I’d be open to using tcmalloc by default in GLib, as long as applications can swap that out for their own allocators if they want.

However, this all depends on the Windows (and OS X) allocator performance. If we’re still going to need GSlice to provide reasonable allocator performance on Windows, and can’t use tcmalloc instead, then we should keep it for all platforms, otherwise it’s going to be even more unmaintainable.
Comment 44 Nirbheek Chauhan 2018-05-09 11:29:47 UTC
(In reply to Philip Withnall from comment #43)
> We should not forget Windows in these performance measurements, though.
> GSlice has historically been a big win on Windows. Has the Windows system
> allocator performance improved enough that we can consider dropping GSlice
> there?
> 

Tim ran some tests for this on the "see also" bug 795828. I'm pasting some relevant comments here:

> It's hard to do proper measurements for our purposes, we need to test alloc from one thread and free in another thread for realistic usage, while at the same time having a test case where allocation/free takes up most of the cycles. 
> I have run some tests on a low-powered windows ec2 machine, that showed GSLice being ridiculously slow compared to the system allocator there (windows server 2006 =~ win10).
> IIRC main reason to use GSlice was because the sys allocator on ~Windows XP was horrendous, but that doesn't seem to be the case any longer
Comment 45 Emmanuele Bassi (:ebassi) 2018-05-09 12:06:30 UTC
(In reply to Philip Withnall from comment #43)
> (In reply to Emmanuele Bassi (:ebassi) from comment #42)
> > I've taken the alloctest that Christian wrote and tweaked it a bit:
> >  …
> 
> Thanks Emmanuele. Is it suitable to write up as a blog post (maybe to post
> once we’ve worked out what to do based on your research)?

I can blog about it, but in the meantime I can write a wiki page with the results:

  https://wiki.gnome.org/Projects/GLib/GSlicePeformanceTests

> We should not forget Windows in these performance measurements, though.
> GSlice has historically been a big win on Windows. Has the Windows system
> allocator performance improved enough that we can consider dropping GSlice
> there?

I'll let somebody who uses GLib on Windows comment on this.

> >  - the gmalloc abstraction is not as zero cost as we'd like; we should
> > probably move it to pre-processor macros instead of using real functions,
> > though it would make adding dtrace probes much harder
> 
> I’d like to not lose the dtrace probe support — unless we could rely on
> dtrace probes in the system malloc()?

Yes, we would rely on instrumentation on the system allocator for this.

> >  - tcmalloc generally outperforms both the system allocator (a lot) and
> > GSlice (moderately)
> 
> Over all allocation sizes/thread counts/core counts?

Over all thread counts, yes.

Size is left as a constant, but in general the curve of tcmalloc is pretty flat for various sizes up to a 1M as far as I can see.

> >  - even if we swap out GSlice with tcmalloc when creating GObject instances,
> > we're not going to get much in terms of performance
> 
> Right. What about performance for other GLib allocations? afaik we’ve never
> really focussed on scaling GObject so you can create millions of GObjects
> really quickly; but we do care a bit more about that kind of thing for
> string allocations, hash tables, linked lists, pointer arrays, etc.

Over non-GObject allocations tcmalloc still wins over GSlice.

> >  - gobject instantiation is *not* dominated by the slice allocator, and it
> > gets worse as soon as you involve more than one thread; the locks and
> > synchronisation points are what kills us, even if we're not hitting property
> > and signal code paths
> 
> This is obviously not something we’re going to fix with allocator changes,
> and can be handled separately.
> 
> > I have the profile data available on request.
> 
> Might be good to attach it here for posterity anyway.

I'll attach the output files to the wiki page above.
 
> ---
> 
> I’m currently thinking that we could deprecate the GSlice API, remove the
> GSlice implementation (and get the deprecated API to use g_malloc()
> instead), and recommend that people use tcmalloc() to replace the system
> allocator if they care about performance. I’d be open to using tcmalloc by
> default in GLib, as long as applications can swap that out for their own
> allocators if they want.

That's a bit complicated to do; I'd really try and keep GLib allocator agnostic, and suggest people use their system's allocator, and replace it using LD_PRELOAD to replace malloc()/free() if they want to change the allocator wholesale.

If we want to provide an additional allocator like we did with GMemChunks and GSlice, we can use things like tcmalloc or jemalloc — but as an optional API that people can opt into for specific use cases.
Comment 46 Philip Withnall 2018-05-09 12:32:38 UTC
(In reply to Emmanuele Bassi (:ebassi) from comment #45)
> (In reply to Philip Withnall from comment #43)
> > I’m currently thinking that we could deprecate the GSlice API, remove the
> > GSlice implementation (and get the deprecated API to use g_malloc()
> > instead), and recommend that people use tcmalloc() to replace the system
> > allocator if they care about performance. I’d be open to using tcmalloc by
> > default in GLib, as long as applications can swap that out for their own
> > allocators if they want.
> 
> That's a bit complicated to do; I'd really try and keep GLib allocator
> agnostic, and suggest people use their system's allocator, and replace it
> using LD_PRELOAD to replace malloc()/free() if they want to change the
> allocator wholesale.

I was thinking of potentially just compiling GLib with
   -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -ltcmalloc
which would essentially be equal to the LD_PRELOAD approach, just done at the GLib level rather than the client program level. I would want client programs to still be able to override the allocator themselves with LD_PRELOAD or -l if they wanted.
Comment 47 Nirbheek Chauhan 2018-05-09 12:39:29 UTC
(In reply to Emmanuele Bassi (:ebassi) from comment #45)
> That's a bit complicated to do; I'd really try and keep GLib allocator
> agnostic, and suggest people use their system's allocator, and replace it
> using LD_PRELOAD to replace malloc()/free() if they want to change the
> allocator wholesale.
> 

Note that LD_PRELOAD does not work on OSes such as Windows and macOS.

It will also not work on Linux/BSD when built with -Wl,-Bsymbolic, which is the default.
Comment 48 Christian Hergert 2018-05-09 20:33:12 UTC
(In reply to Emmanuele Bassi (:ebassi) from comment #45)
> That's a bit complicated to do; I'd really try and keep GLib allocator
> agnostic, and suggest people use their system's allocator, and replace it
> using LD_PRELOAD to replace malloc()/free() if they want to change the
> allocator wholesale.

We can't realistically require calling API to set a default allocator anyway given how many projects allocate from static constructors.

> If we want to provide an additional allocator like we did with GMemChunks
> and GSlice, we can use things like tcmalloc or jemalloc — but as an optional
> API that people can opt into for specific use cases.

I have a number of situations where having a GAllocator API would be useful. But we can discuss that separately.
Comment 49 Emmanuele Bassi (:ebassi) 2018-05-10 15:47:07 UTC
(In reply to Christian Hergert from comment #48)
> (In reply to Emmanuele Bassi (:ebassi) from comment #45)
> > That's a bit complicated to do; I'd really try and keep GLib allocator
> > agnostic, and suggest people use their system's allocator, and replace it
> > using LD_PRELOAD to replace malloc()/free() if they want to change the
> > allocator wholesale.
> 
> We can't realistically require calling API to set a default allocator anyway
> given how many projects allocate from static constructors.

We can have a GAllocator API that gets used by all our data structures that are currently using GSlice or g_malloc() directly, and that allows people to set their own allocators in their own process. Then we'd have various allocators provided by GLib — system allocator, slice allocator, jemalloc, tcmalloc, whatever. If no allocator is set, we default to the system one. The hard part is porting our data structures to that model — we'd have to keep a pointer to the allocator that was used internally, in order to free memory with the right one in case somebody pushed/popped their own in between. This cannot always work for things that can be placed on the stack, like GList and GQueue, but for those we'd still keep a slice allocator by default, as it's faster when it comes to freeing them.

For instance, I think language bindings would want to "push" an allocator using their own runtimes, in order to account for memory allocated by GLib and GTK.

Indeed, though, this would be a very separate issue.
Comment 50 GNOME Infrastructure Team 2018-05-24 18:13:11 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1079.