After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 118439 - Add GFreeList, deprecate GMemChunk
Add GFreeList, deprecate GMemChunk
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: general
2.2.x
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
: 84536 116805 169389 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2003-07-27 20:48 UTC by Soren Sandmann Pedersen
Modified: 2011-02-18 15:49 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
patch (36.40 KB, patch)
2003-07-27 20:50 UTC, Soren Sandmann Pedersen
needs-work Details | Review
GFreeList header file (3.15 KB, text/plain)
2003-07-27 20:50 UTC, Soren Sandmann Pedersen
  Details
GFreeList source file (2.95 KB, text/plain)
2003-07-27 20:51 UTC, Soren Sandmann Pedersen
  Details
the patch (47.12 KB, patch)
2005-09-13 05:00 UTC, Matthias Clasen
none Details | Review
thread private free lists (5.50 KB, text/plain)
2005-09-13 16:15 UTC, Matthias Clasen
  Details
another approach (8.69 KB, text/plain)
2005-09-14 17:05 UTC, Matthias Clasen
  Details
header (224 bytes, text/plain)
2005-09-14 17:13 UTC, Matthias Clasen
  Details
memchunk.c (5.57 KB, text/plain)
2005-09-18 06:19 UTC, Matthias Clasen
  Details
memchunk.h (209 bytes, text/plain)
2005-09-18 06:19 UTC, Matthias Clasen
  Details
gmem.[hc] patch (11.26 KB, patch)
2005-09-26 04:26 UTC, Matthias Clasen
none Details | Review
use the new allocator for GList, GSList, GHashTable (13.60 KB, patch)
2005-09-26 04:27 UTC, Matthias Clasen
none Details | Review
necessary gthread changes (1.89 KB, patch)
2005-09-26 05:09 UTC, Matthias Clasen
none Details | Review
listtest1.c (3.19 KB, text/x-csrc)
2005-09-26 13:21 UTC, Matthias Clasen
  Details
Cleaned up patch that was to timj for review (34.84 KB, patch)
2005-09-27 17:42 UTC, Matthias Clasen
none Details | Review

Description Soren Sandmann Pedersen 2003-07-27 20:48:39 UTC
Here is a patch that adds new GFreeList API and deprecates the GMemChunks.

The new GFreeList is a data structure that allocates objects from chunks of
memory and puts them on a free-list when they are freed. It doesn't attempt
to free the chunks so it can be much simpler than GMemChunk.

The alloc and free functions are inlined and expand to a dozen instructions
or so.

     - Freeing a list becomes slower because the list will have to be
       walked. This is not worse in the algorithmic sense because the n
       elements must have been allocated at some point which is O(n).
       
     - The only remaining use of GMemChunk is in ghook.[ch]. The reason
       is that ghook.h exports GMemChunk as public API.
Comment 1 Soren Sandmann Pedersen 2003-07-27 20:50:06 UTC
Created attachment 18652 [details] [review]
patch
Comment 2 Soren Sandmann Pedersen 2003-07-27 20:50:55 UTC
Created attachment 18653 [details]
GFreeList header file
Comment 3 Soren Sandmann Pedersen 2003-07-27 20:51:33 UTC
Created attachment 18654 [details]
GFreeList source file
Comment 4 Havoc Pennington 2003-07-28 00:59:14 UTC
It feels a little not-opaque-enough to me but maybe it's worth it 
for the performance win.

A possibly-nice thing DBusMemPool has is a flag to zero the 
blocks on alloc (like g_malloc0/calloc)
Comment 5 Matthias Clasen 2003-07-28 07:28:38 UTC
*** Bug 84536 has been marked as a duplicate of this bug. ***
Comment 6 Matthias Clasen 2003-07-28 07:31:05 UTC
There is a typo in the GFreeList c source:




DISBLE_MEM_POOLS should be DISABLE_MEM_POOLS
Comment 7 Soren Sandmann Pedersen 2003-07-28 12:58:45 UTC
- The g_free_list_free_internal() function is not used, so it can be
deleted

- If we want the zero-on-alloc feature, I'd do it as a new function
g_free_list_alloc0 ()

- It may be a good idea to not let the user specify the n_preallocs
and have the free list decide the block size dynamically (say by
doubling every time it runs out of atoms).

This might also make it possible to avoid malloc() overhead completely
by using anonymous mmap()s of complete pages.

However, the n_preallocs argument to g_allocator_new() would have to 
be ignored, or passed to the free list by some other (internal) means.

- I don't really see how to make the structure opaque without breaking
the G_FREE_LIST_DEFINE macros. You can't use malloc to initialize a
global variable in C.

Replacing the "blocks" pointer with a "priv" pointer might be a good
idea though.
Comment 8 Owen Taylor 2003-07-28 13:36:40 UTC
Some random thoughts:

 - It may be hard to replace the GList and GSList free
   lists with a generic freelist, at least without some
   extra API; the change of the GList freelist from 
   a list to a list of lists was a pretty big efficiency
   win, in my memory.

   (the point being that g_list_free() is currently O(1))

 - What Tim wanted to do in this area was to replace
   memchunks with fixed size allocators, so that every
   usage of blocks of size 8 shared the same free
   list.

 - I think it would be a mistake to add new API here
   without thinking about threading and locking; a 
   big issue with GLib currently in heavily threaded
   apps is lock contention on the free lists.
   (Bug 116805; Havoc also noted this in the similarly
   structured DBus features)

   The GStreamer folks apparently have an atomic-operation
   trash stack to avoid lock overhead (this was mentioned
   to me at GUADEC, I've added Wim Taymans to the CC, 
   I think he was the person working on this)

   That may well be good enough; there are potentially
   still problems with:

    - Overhead of atomic primitive (lock the bus)
    - Locality of memory usage

   That might make some sort of scheme where free list
   elements normally stay local to a processor/thread 
   better, but the complexity and extra instructions
   there might be considerable.

 - Does GTrashStack also get deprecated? How does this
   relate to GTrashStack?

Anyways, just thought I'd add a few comments, and make
things a bit murkier :-)
Comment 9 Havoc Pennington 2003-07-28 13:47:01 UTC
I found that DBusMemPool (which is very similar to this API)
worked well when I had one mem pool per hash table _instance_, 
but seems to suck a bit for lists when used globally for all 
list nodes due to the thread issue.

Comment 10 Soren Sandmann Pedersen 2003-07-28 14:30:13 UTC
- You could imagine keeping track of the tail of the freelist:

     G_INLINE_FUNC void
     g_free_list_free (GFreeList *list, gpointer mem) 
     { 
       if (G_LIKELY (list->tail))
          {
             mem->next = list->atoms;
             list->atoms = mem;
          }
        else
         g_free_list_free_internal(list, mem);
     }

  then having a g_free_list_free_many (GFreeList *list, 
                                       gpointer   mem,
                                       gpointer  *next);

  that would reuse the next-pointer if it happened to match the
  one in GFreeListAtom. 

- A big part of the problem in bug 116805 is probably that the
  lock is held for a very long time as the memchunk updates all
  its internal trees and stuff. The GString memchunk currently
  does not have a trash stack list in front of it.

- Having a global 8-bytes free list would impose locking even in
  cases where it wouldn't otherwise be necessary (like inside gtk+
  in a multithreaded application).
Comment 11 Tim Janik 2004-02-20 02:33:38 UTC
the glib memory caching code needs a major overhaul. it should:
- make use of libc malloc/free performance advantages where it makes
sense (e.g. the glibc allocaters are *fast*, there's not much point in
caching here)
- avoid seperate trash lists for structures of different sizes (not
just 8 bytes)
- do singly-linked-list frees in O(1) for all structures that use
*next as second pointer (GSList, GList, GNode, ...)

basically, that means getting rid of memchunks internally and
providing one global lock-free allocater with trash-list support.
Comment 12 Matthias Clasen 2004-12-02 15:38:28 UTC
When we get around to doing something about this, we should also make sure
that we get rid of the ABI change due to --disable-mem-pools...
Comment 13 Andy Wingo 2005-03-31 13:57:39 UTC
Hey, some notes from the GStreamer camp.

== What we have ==

We do have a GstMemChunk[0] that is mt-safe. Its init allocates a block of
memory and pushes the chunks onto a trash stack. After the init, all operations
are on the trash stack.

[0] http://cvs.freedesktop.org/gstreamer/gstreamer/gst/gstmemchunk.h?view=markup
    http://cvs.freedesktop.org/gstreamer/gstreamer/gst/gstmemchunk.c?view=markup

GstMemChunk doesn't have any locks itself; it's our trash stack implementation,
GstTrashStack[1], that's threadsafe. The trash stack can't be implemented with
the stock atomic ops in gatomic.c because of the so-called "ABA problem" (google
for it). The x86 implementation is lock-free because it supports double word
compare and swap. The fallback C implementation uses a mutex around the stack,
but is still fast on linux 2.6. Optimized implementations for other
architectures are possible.

[1] http://cvs.freedesktop.org/gstreamer/gstreamer/gst/gsttrashstack.h?view=markup

== An aside regarding GMemChunk ==

The default defines for glib don't have G_DISABLE_CHECKS, which is good. What's
not good is that without G_DISABLE_CHECKS defined, gmemchunk instruments each
call to alloc and free, updating TLS call counters. The result is that it is
slower than malloc on my box, even without being threadsafe. That profiling code
has no place in a production allocator. GMemChunk is significantly faster than
malloc if the profiling code is removed. I removed it in the tests below.

== Performance of the x86 version ==

I wrote a small test[2] to run N threads doing M allocations, freeing
immediately after each allocation. The wall clock time is then printed out for
each test. Here are some typical results on a P4 2.8GHz single-processor machine.

[2]
http://cvs.freedesktop.org/gstreamer/gstreamer/tests/memchunk/gmemchunktest.c?view=markup

  1000000 alloc+frees X 1 threads
  0.396343s - GMemChunk
  0.184425s - GstMemChunk
  0.343142s - g_malloc/g_free

This is without GMemChunk's TLS profiling, but with a lock around the memchunk.
If I remove the lock (#undef GMEMCHUNK_THREADSAFE in the test), I get something
like this:

  1000000 alloc+frees X 1 threads
  0.143318s - GMemChunk
  0.183703s - GstMemChunk
  0.343673s - g_malloc/g_free

If I increase the number of threads, I have to put the lock back, of course. On
my single-processor machine the relative times are the same, and the same per
thread. On a dual CPU Xeon with hyperthreading, the results are like this (I'm
doing 10 times as much):

  1000000 alloc+frees X 1 threads
  0.398000s (0.398000s/thread) - GMemChunk
  0.149681s (0.149681s/thread) - GstMemChunk
  0.299621s (0.299621s/thread) - g_malloc/g_free

  1000000 alloc+frees X 10 threads
  3.786688s (0.378669s/thread) - GMemChunk
  2.231514s (0.223151s/thread) - GstMemChunk
  3.084715s (0.308471s/thread) - g_malloc/g_free

== Performance of the fallback implementation ==

AMD64 running in 64-bit mode on linux 2.6:

  100000 alloc+frees X 1 threads
  0.012986s - GMemChunk
  0.008702s - GstMemChunk
  0.010135s - g_malloc/g_free

  100000 alloc+frees X 10 threads
  0.126268s - GMemChunk
  0.105127s - GstMemChunk
  0.099638s - g_malloc/g_free

== Compatibility ==

GstMemChunk offers the same API as GMemChunk, so it can be a dropin replacement.
 Since GMemChunk is an opaque structure this would work out nicely.

GstTrashStack is larger than GTrashStack, for the counter and mutex needed for
threadsafety. Also, it doesn't implement the _peek opteration. So it should
probably be a new API, and GTrashStack should be deprecated. Perhaps
GstTrashStack can be called GFreeList.

== Issues ==

GstMemChunk will never free any memory until the user calls
gst_mem_chunk_destroy(). Neither will GFreeList (Soeren's structure) though.

GstMemChunk doesn't do O(1) list frees by itself. That's kind of nasty. But
here's how you might do it:

The trash stack currently clobbers the first word in the structure, to use as
its own next pointer. But the list next pointer is the second word in GList,
GSList, and GNode. So you could have GstMemChunk check the second word after
popping an item, and if it's not NULL push it back on the trash stack.
g_mem_chunk_free would by default set the second word to NULL, so it's not
followed, while we could add g_mem_chunk_free_list to not NULL out the second word.

So for a normal case, you still have one pop operation. For the case of the
list, you'd have a pop and a push, but a push is cheaper than a pop because it
doesn't have to avoid the ABA problem.

This would imply a minimum chunk size of two words. Not terrible.

To make all mem chunk allocators of the same size use the same mem chunk... well
you'd have to make the mem chunk atomically refcounted, and have g_mem_chunk_new
have a private hash table with a lock, etc. That's a separate issue though; I'd
prefer not to address it in this bug.

== Summary ==

GStreamer has avoided GMemChunk because it's slower than malloc and isn't
threadsafe. However because GMemChunk is used by data structures that we use
(notably arrays), all protected by mutexen in GLib, it really needs to be dealt
with. GstMemChunk is API-compatible with GMemChunk, is threadsafe, and only
slightly slower in the single-threaded case. It's implemented as a threadsafe
GFreeList.

So, as a way forward, I think we should:

  0) Figure out the O(1) free business
  1) Get GstTrashStack in as GFreeList (or some other name)
  2) Get GMemChunk replaced by GstMemChunk
  3) Patch all users of GMemChunk to not have locks
     - Drop GTrashStack front-end to GMemChunk in gqueue.c
  4) Do to glist/gnode/gslist what we decide in 0)

How does this sound as a plan? And then, more specifically, should we implement
the complexity of the o(1) free?
Comment 14 Andy Wingo 2005-03-31 13:58:37 UTC
Add to CC, fix wim's email
Comment 15 Matthias Clasen 2005-03-31 14:46:29 UTC
Isn't your test slightly unfair, since it uses G_ALLOC_AND_FREE for the
GMemChunk, while GstMemChunk does the equivalent of G_ALLOC_ONLY ?

Anyway, I think the general plan looks pretty good, if we can implement the O(1)
freeing as you outlined above.
Comment 16 Andy Wingo 2005-03-31 15:07:19 UTC
Ah, you're right. The version pointed to in the link now does ALLOC_ONLY. Here's
a sample run on my uniprocessor x86:

1000000 alloc+frees X 1 threads
0.345726s (0.345726s/thread) - GMemChunk
0.141357s (0.141357s/thread) - GstMemChunk
0.266718s (0.266718s/thread) - g_malloc/g_free

1000000 alloc+frees X 10 threads
3.367218s (0.336722s/thread) - GMemChunk
1.413494s (0.141349s/thread) - GstMemChunk
3.105659s (0.310566s/thread) - g_malloc/g_free

That's with GMemChunk having a lock around it, but no profiling.

I'll get to implementing the O(1) freeing soon.
Comment 17 Wim Taymans 2005-03-31 17:26:42 UTC
I propose an atomic GTrashStack that just clobers the first pointer of
structures pushed onto it. It would be used to implement GMemChunk with its
current API. 

For the O(1) freeing of GList, GSlist, GNode,... a new datastructure GFreeList
would be used. GFreeList also uses GTrashStack internally but the pop() function
would also check the second pointer in the structure, follow it and push() it
back onto the GTrashStack. 
Comment 18 Ben Maurer 2005-03-31 18:38:18 UTC
Somebody should look at numbers on an smp box. Also, can somebody compare to a
malloc with threading stuff built into the design. The google malloc comes to mind.
Comment 19 Andy Wingo 2005-04-01 10:31:32 UTC
Wow. Google's malloc is really really fast. The gmemchunktest.c linked to above
will now load libtcmalloc if it's available, and it is fast:

Uniprocessor P4 2.8GHz:

  100000 alloc+frees X 1 threads
  0.055437s (0.055437s/thread) - GMemChunk
  0.023199s (0.023199s/thread) - GstMemChunk
  0.040129s (0.040129s/thread) - g_malloc/g_free
  0.010877s (0.010877s/thread) - google malloc/free

  100000 alloc+frees X 10 threads
  0.372564s (0.037256s/thread) - GMemChunk
  0.139306s (0.013931s/thread) - GstMemChunk
  0.273138s (0.027314s/thread) - g_malloc/g_free
  0.079811s (0.007981s/thread) - google malloc/free

Dual hyperthreaded Xeon (pseudo 4-way) 2.4GHz:

  1000000 alloc+frees X 1 threads
  0.381471s (0.381471s/thread) - GMemChunk
  0.150774s (0.150774s/thread) - GstMemChunk
  0.297379s (0.297379s/thread) - g_malloc/g_free
  0.086142s (0.086142s/thread) - google malloc/free

  1000000 alloc+frees X 10 threads
  3.802258s (0.380226s/thread) - GMemChunk
  1.511163s (0.151116s/thread) - GstMemChunk
  3.081182s (0.308118s/thread) - g_malloc/g_free
  0.939856s (0.093986s/thread) - google malloc/free

Um.... Faced with these numbers, why do we have memchunks at all? You still have
the theoretical ability to mlockall a chunk, but damn. Google malloc needs to be
pushed either to glibc or glib.
Comment 20 Edward Hervey 2005-04-01 11:09:48 UTC
Ok, I needed to try it too on amd64 :)

First I encountered problems when compiling the google-performance library. I
had to get rid of all the profiling stuff, but that shouldn't change anything
(since that code isn't actually used by libtcmalloc).

CPU : AMD Athlon(tm) 64 Processor 2800+ (1.6GHz)

100000 alloc+frees X 1 threads
0.015604s (0.015604s/thread) - GMemChunk
0.008723s (0.008723s/thread) - GstMemChunk
0.009573s (0.009573s/thread) - g_malloc/g_free
0.008909s (0.008909s/thread) - google malloc/free

100000 alloc+frees X 10 threads
0.162649s (0.016265s/thread) - GMemChunk
0.099027s (0.009903s/thread) - GstMemChunk
0.099321s (0.009932s/thread) - g_malloc/g_free
0.091263s (0.009126s/thread) - google malloc/free

Same run but with 10 times more allocations (1 Million)

1000000 alloc+frees X 1 threads
0.161057s (0.161057s/thread) - GMemChunk
0.086330s (0.086330s/thread) - GstMemChunk
0.098534s (0.098534s/thread) - g_malloc/g_free
0.088163s (0.088163s/thread) - google malloc/free

1000000 alloc+frees X 10 threads
1.641056s (0.164106s/thread) - GMemChunk
0.938081s (0.093808s/thread) - GstMemChunk
0.986813s (0.098681s/thread) - g_malloc/g_free
0.893960s (0.089396s/thread) - google malloc/free


So really google malloc isn't THAT faster on amd64... but funny though that
amd64's kick the crap out of Xeons :)
Comment 21 Ben Maurer 2005-04-01 17:51:18 UTC
I am not sure how much Google's malloc impl is tuned for the desktop. For
example, they don't do tricks like mmaping mallocs > 128 kb so that memory
doesn't get fragmented. Also, for sizes that are not multiples of 8 bytes, they
will waste 4 bytes / object from what I can tell (this would affect hash nodes
if I remember correctly how fields are done).

For AMD: does anybody have a dual amd?

Also, what is the test case used here? We may be missing one extremely important
stat: how does the test handle the cache. Take a look at the blog from ricom (a
.NET performance guru from msft):

http://blogs.msdn.com/ricom/archive/2005/01/05/347361.aspx
http://blogs.msdn.com/ricom/archive/2005/01/07/348691.aspx

Note how badly the freelist does in this case. I don't know if google's malloc
is smarter in how it handles things here (it probably can't be anywhat as smart
as a GC, but it can be close). Also, the fact that it uses thread local storage
may be acting to its advantage (rather than from a global freelist).
Comment 22 Ben Maurer 2005-04-01 17:54:40 UTC
Also, on smp boxes, we should try tests that use the same number of threads as
cpus (try both logical and physical here: so for the dual xeon, try 2 and 4).

I also wonder why the google malloc did so horribly on amd64. One idea is
because they are using a different data structure to do thier lookup thingies.
Another is maybe the CFLAGS are not as tuned as they are for glibc.
Comment 23 Manish Singh 2005-04-01 21:50:16 UTC
The current google malloc implementation has some caveats that make it a pretty
poor choice for replacement of the standard libc malloc. The big 2 being that it
doesn't actually return memory to the system, and it gobbles up 6 MB of memory
on startup. It's also in C++, which makes inclusion into glib or glibc pretty slim.

These all can be addressed, but as it stands, it's not really a general drop in
replacement.
Comment 24 Ben Maurer 2005-04-01 21:54:41 UTC
The 6 mb of memory is easy for them to address (i've talked to them about this
already). They just need to get zeroed memory from the system so that things are
from the shared zero page.

As for releasing memory, that it probably a bit of a bigger problem.

We probaby want to steal the best things from google malloc that we can rather
than trying to use it as a drop in replacement. With the freelist, we already
know much more than a malloc does: we know the size of the object we get back.
This means we can avoid costly lookups.
Comment 25 Matthias Clasen 2005-07-26 02:15:53 UTC
Wim, did you ever get around to implement O(1) freeing as outlined in 
comment #17 ?
Comment 26 Matthias Clasen 2005-09-13 04:51:42 UTC
Hmm, I converted the GstMemChunk code into a GLib patch and
added O(1) list freeing, but I can't reproduce Wim's numbers.

On this dual-Xeon box, malloc clearly outperforms the patched GMemChunk
code as soon as the number of threads is > 1:

[mclasen@golem mem]$ ./gmemchunktest 1 100000
100000 alloc+frees X 1 threads
0.037062s (0.037062s/thread) - g_malloc/g_free
0.014173s (0.014173s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 2 100000
100000 alloc+frees X 2 threads
0.045879s (0.022940s/thread) - g_malloc/g_free
0.073822s (0.036911s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 5 100000
100000 alloc+frees X 5 threads
0.093965s (0.018793s/thread) - g_malloc/g_free
0.207979s (0.041596s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 10 100000
100000 alloc+frees X 10 threads
0.180635s (0.018064s/thread) - g_malloc/g_free
0.413845s (0.041385s/thread) - GMemChunk
Comment 27 Matthias Clasen 2005-09-13 04:59:38 UTC
Hmm, I converted the GstMemChunk code into a GLib patch and
added O(1) list freeing, but I can't reproduce Wim's numbers.

On this dual-Xeon box, malloc clearly outperforms the patched GMemChunk
code as soon as the number of threads is > 1:

[mclasen@golem mem]$ ./gmemchunktest 1 100000
100000 alloc+frees X 1 threads
0.037504s (0.037504s/thread) - g_malloc/g_free
0.014732s (0.014732s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 2 100000
100000 alloc+frees X 2 threads
0.040671s (0.020335s/thread) - g_malloc/g_free
0.074221s (0.037110s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 5 100000
100000 alloc+frees X 5 threads
0.107214s (0.021443s/thread) - g_malloc/g_free
0.223165s (0.044633s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 10 100000
100000 alloc+frees X 10 threads
0.199082s (0.019908s/thread) - g_malloc/g_free
0.460681s (0.046068s/thread) - GMemChunk

If I compile GLib with G_DISABLE_MEM_POOLS to reduce
the GMemPool to malloc() + atomic freelist, I
get basically the same numbers, so the overhead
must be in the atomic free list.

[mclasen@golem mem]$ ./gmemchunktest 1 100000
100000 alloc+frees X 1 threads
0.037062s (0.037062s/thread) - g_malloc/g_free
0.014173s (0.014173s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 2 100000
100000 alloc+frees X 2 threads
0.045879s (0.022940s/thread) - g_malloc/g_free
0.073822s (0.036911s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 5 100000
100000 alloc+frees X 5 threads
0.093965s (0.018793s/thread) - g_malloc/g_free
0.207979s (0.041596s/thread) - GMemChunk
[mclasen@golem mem]$ ./gmemchunktest 10 100000
100000 alloc+frees X 10 threads
0.180635s (0.018064s/thread) - g_malloc/g_free
0.413845s (0.041385s/thread) - GMemChunk
Comment 28 Matthias Clasen 2005-09-13 05:00:41 UTC
Created attachment 52154 [details] [review]
the patch
Comment 29 Benoît Dejean 2005-09-13 09:47:20 UTC
Sorry to add noise, but are you seriously comparing execution times that are
less than a second ?
Comment 30 Andy Wingo 2005-09-13 11:27:17 UTC
Weird, I'm seeing similar results on the dual-xeon box running fedora here. The
only change has been an upgrade from fc2 to fc4.

My desktop is now an x86-64, so I can't test the optimized code. I tried it out
on a coworker's single-processor machine running ubuntu breezy, and gstmemchunk
still kicks the crap out of malloc. Still investigating...
Comment 31 Andy Wingo 2005-09-13 11:30:19 UTC
Both breezy and FC4 have glibc 2.3.5, it can't be a difference in upstream
Comment 32 Andy Wingo 2005-09-13 13:26:39 UTC
The breezy box I was testing on didn't have libc6-686 installed. Installing it
didn't and munging the necessary ld.so.conf files didn't make a difference in
timing. Perhaps Ubuntu uses different configure options from Fedora.

There is one more optimization that could be made, which is having per-thread
trash stacks (freelists). Wim indicated his desire to flesh this out, but he's
working remotely until Monday -- will see if I can get him to comment.

One more data point -- on my x86-64 single-xeon breezy system, hyperthreaded,
g_malloc is about 4x faster than the fallback gstmemchunk implementation for the
10-thread case, and still marginally faster in the single-thread case.

Open item: is this really worth it? Why not just punt all this to malloc? The
only real reason I can see would be that you can mlockall(2) a memchunk, which
is good for realtime allocation as well, but glib isn't really a realtime library.
Comment 33 Matthias Clasen 2005-09-13 13:35:42 UTC
> There is one more optimization that could be made, which is having per-thread
> trash stacks (freelists). Wim indicated his desire to flesh this out, but he's
> working remotely until Monday -- will see if I can get him to comment.

Yes, thats the one thing that we could and maybe should still try out.

> Open item: is this really worth it? Why not just punt all this to malloc? The
> only real reason I can see would be that you can mlockall(2) a memchunk, which
> is good for realtime allocation as well, but glib isn't really a realtime 
> library.

I was slowly coming to this conclusion as well. The one argument in favor of
memchunks is that it allows us to avoid some memory overhead (glibc malloc
requires 4 bytes of extra room for bookkeeping, and won't allocate blocks
smaller 16 bytes anyway) - we should probably measure how much we actually waste
due to that.
Comment 34 Matthias Clasen 2005-09-13 15:24:41 UTC
*** Bug 116805 has been marked as a duplicate of this bug. ***
Comment 35 Matthias Clasen 2005-09-13 16:14:58 UTC
Here is a very quick hack using a thread private free list. seems to scale
better here:

[mclasen@golem mem]$ ./gmemchunktest 1 100000
100000 alloc+frees X 1 threads
0.038504s (0.038504s/thread) - g_malloc/g_free
0.029129s (0.029129s/thread) - GMemChunk
0.013549s (0.013549s/thread) - XMemChunk
[mclasen@golem mem]$ ./gmemchunktest 10 100000
100000 alloc+frees X 10 threads
0.210929s (0.021093s/thread) - g_malloc/g_free
0.354550s (0.035455s/thread) - GMemChunk
0.083927s (0.008393s/thread) - XMemChunk
[mclasen@golem mem]$ ./gmemchunktest 100 100000
100000 alloc+frees X 100 threads
1.847946s (0.018479s/thread) - g_malloc/g_free
3.934647s (0.039346s/thread) - GMemChunk
0.693209s (0.006932s/thread) - XMemChunk
[
Comment 36 Matthias Clasen 2005-09-13 16:15:42 UTC
Created attachment 52179 [details]
thread private free lists
Comment 37 Matthias Clasen 2005-09-14 04:24:38 UTC
Of course, this simple hack doesn't handle some interesting questions:

1) what to do with the free list when a thread terminates
2) how to handle transfer of memory between threads
Comment 38 Matthias Clasen 2005-09-14 17:05:13 UTC
Hmm, I spent way too much time thinking about these issues, and
came up with yet another design.
Comment 39 Matthias Clasen 2005-09-14 17:05:48 UTC
Created attachment 52238 [details]
another approach
Comment 40 Matthias Clasen 2005-09-14 17:13:10 UTC
Created attachment 52239 [details]
header
Comment 41 Andy Wingo 2005-09-16 16:16:15 UTC
After changing POOL_HEADER_SIZE to sizeof(PoolHeader) for my 64-bit x86
hyperthreaded pentium, here are the results of that benchmark:

1000000 alloc+frees X 10 threads
1.627959s (0.162796s/thread) - MemChunk
18.364705s (1.836471s/thread) - GMemChunk
8.814012s (0.881401s/thread) - GstMemChunk
1.356361s (0.135636s/thread) - g_malloc/g_free
0.663839s (0.066384s/thread) - google malloc/free

MemChunk is your slab-allocator thing.

I then changed the benchmark to touch every byte in the allocated pieces, just
to see. I got segv's doing that with your slab allocator.

I suspect malloc is as good as we can maintainably hope to get.
Comment 42 Soren Sandmann Pedersen 2005-09-17 02:00:20 UTC
> I suspect malloc is as good as we can maintainably hope to get.

I agree. A good malloc/free has the advantage that it can often hand out memory
that it knows is in cache. All the freelists can only do that if the application
repeatedly frees and allocates blocks of the same size. This is an effect that
will show up when the application actually uses the memory, so simple
allocate/free benchmarks are biased in favor of freelists. 

It's hard to come up with credible benchmarks, other than measuring how fast
real-world applications with real-world memory allocation patterns actually run.
Comment 43 Matthias Clasen 2005-09-18 03:42:39 UTC
Yes, maybe we just need to wait for a lockfree malloc implementation, as described
in http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
Comment 44 Matthias Clasen 2005-09-18 06:17:32 UTC
I have made one final attempt to come up with something simple but scalable.
It is inspired by the "magazine layer" of the classical slab allocator. 
Lists of free chunks are kept below a fixed maximum size. There is one global list
of free lists, and each thread keeps two lists to itself. Lists are only
transferred between the global list and the thread-local storage when necessary. 
In order to handle thread termination, the mem chunk also keeps an extra list
of "rest chunks". The minimum block size is 2 pointers, since we store two next 
pointers in the chunks in the global free list. Memory is never freed, since 
we don't keep track of the memory area a chunk belongs to. 

This code seems to be consistently faster than malloc/free in the trivial
alloc/free test scenario:

[mclasen@golem mem]$ ./gmemchunktest 1 1000000 1000000 alloc+frees X 1 threads
0.340801s (0.340801s/thread) - g_malloc/g_free
0.298237s (0.298237s/thread) - GMemChunk
0.077505s (0.077505s/thread) - MemChunk
[mclasen@golem mem]$ ./gmemchunktest 2 1000000
1000000 alloc+frees X 2 threads
0.346512s (0.173256s/thread) - g_malloc/g_free
0.546255s (0.273128s/thread) - GMemChunk
0.235085s (0.117542s/thread) - MemChunk
[mclasen@golem mem]$ ./gmemchunktest 10 1000000
1000000 alloc+frees X 10 threads
1.690034s (0.169003s/thread) - g_malloc/g_free
3.363173s (0.336317s/thread) - GMemChunk
0.553422s (0.055342s/thread) - MemChunk
[mclasen@golem mem]$ ./gmemchunktest 20 1000000
1000000 alloc+frees X 20 threads
3.460256s (0.173013s/thread) - g_malloc/g_free
6.731648s (0.336582s/thread) - GMemChunk
0.890570s (0.044528s/thread) - MemChunk
Comment 45 Matthias Clasen 2005-09-18 06:19:04 UTC
Created attachment 52354 [details]
memchunk.c
Comment 46 Matthias Clasen 2005-09-18 06:19:32 UTC
Created attachment 52355 [details]
memchunk.h
Comment 47 Matthias Clasen 2005-09-26 04:19:43 UTC
I tested this allocator in GLib, and found that it was slightly slower than the
current code when allocating and freeing lists of length ~100 in a single thread,
due to the O(1) freeing of lists. Therefore, I modified my allocator to support
O(1) list freeing. 
Comment 48 Matthias Clasen 2005-09-26 04:26:51 UTC
Created attachment 52649 [details] [review]
gmem.[hc] patch
Comment 49 Matthias Clasen 2005-09-26 04:27:50 UTC
Created attachment 52650 [details] [review]
use the new allocator for GList, GSList, GHashTable
Comment 50 Matthias Clasen 2005-09-26 04:42:35 UTC
Maybe I should write a quick outline how my allocator works. 
Memory is allocated in reasonably large blocks (1k currently), chopped into
equal-sized chunks, and put on a free list. The free list is actually a list
of "magazines", which hold a fixed number of chunks. Each thread has its own
local freelist which is limited in size to two magazines. Transferring memory
between the global and local freelists is always done in full magazines (except
for thread termination, which is handled separately using a "rest" list).
Each thread allocates and frees chunks out of its local freelist, as long
as thats possible. As an extra twist, the 2 magazines of the local freelist
can be swapped, which can sometimes avoid exchanging memory with the global free
list.

For O(1) list freeing, we give up the idea that the members of a magazines are
individual chunks. Instead, they can be lists themselves, and the code handles
that when allocating a chunk.
Comment 51 Matthias Clasen 2005-09-26 04:52:58 UTC
Here are numbers for allocating and freeing 1000000 GList elements, where the
freeing is done on lists of length 96. The runs are with 1, 5, 10 and 20 threads.

Current code:

[mclasen@golem mem]$ ./listtest1 1 1000000
0.339554s (0.339554s/thread)
[mclasen@golem mem]$ ./listtest1 5 1000000
5.239699s (1.047940s/thread)
[mclasen@golem mem]$ ./listtest1 10 1000000
11.194050s (1.119405s/thread)
[mclasen@golem mem]$ ./listtest1 20 1000000
25.882314s (1.294116s/thread)

New allocator:

[mclasen@golem mem]$ ./listtest1 1 1000000
0.188417s (0.188417s/thread)
[mclasen@golem mem]$ ./listtest1 5 1000000
0.560558s (0.112112s/thread)
[mclasen@golem mem]$ ./listtest1 10 1000000
0.987478s (0.098748s/thread)
[mclasen@golem mem]$ ./listtest1 20 1000000
2.058154s (0.102908s/thread)
Comment 52 Matthias Clasen 2005-09-26 05:09:07 UTC
Created attachment 52651 [details] [review]
necessary gthread changes
Comment 53 Matthias Clasen 2005-09-26 05:10:52 UTC
Using GPrivate for the thread local magazines turned out to be very hard, since
GSList, GHashTable, GArray all use memchunks. I ended up embedding the per-thread
pointers directly in GRealThread.
Comment 54 Andy Wingo 2005-09-26 08:39:50 UTC
- Should be doing the listtest with malloc/free as well -- at this point the
  difference is not so much your allocator versus the existing, it's yours versus
  malloc, no?
- Care to post your listtest1.c?
Comment 55 Matthias Clasen 2005-09-26 13:18:16 UTC
Ok, here are the same number=s for plain malloc

[mclasen@golem mem]$ ./listtest1 1 1000000
0.229110s (0.229110s/thread)
[mclasen@golem mem]$ ./listtest1 5 1000000
0.794192s (0.158838s/thread)
[mclasen@golem mem]$ ./listtest1 10 1000000
1.810608s (0.181061s/thread)
[mclasen@golem mem]$ ./listtest1 20 1000000
3.834949s (0.191747s/thread)
Comment 56 Matthias Clasen 2005-09-26 13:21:00 UTC
Created attachment 52672 [details]
listtest1.c
Comment 57 Matthias Clasen 2005-09-27 17:42:00 UTC
Created attachment 52734 [details] [review]
Cleaned up patch that was to timj for review
Comment 58 Matthias Clasen 2005-11-18 13:47:16 UTC
New g_slice_* API is in place, and will get a real implementation before 2.10.
The current implementation is just a malloc wrapper.
Comment 59 Tim Janik 2005-12-02 10:38:17 UTC
*** Bug 169389 has been marked as a duplicate of this bug. ***
Comment 60 Tim Janik 2005-12-02 10:40:19 UTC
the new g_slice_*() implementation is in CVS now, as described in:
http://mail.gnome.org/archives/gtk-devel-list/2005-December/msg00031.html