GNOME Bugzilla – Bug 118439
Add GFreeList, deprecate GMemChunk
Last modified: 2011-02-18 15:49:17 UTC
Here is a patch that adds new GFreeList API and deprecates the GMemChunks. The new GFreeList is a data structure that allocates objects from chunks of memory and puts them on a free-list when they are freed. It doesn't attempt to free the chunks so it can be much simpler than GMemChunk. The alloc and free functions are inlined and expand to a dozen instructions or so. - Freeing a list becomes slower because the list will have to be walked. This is not worse in the algorithmic sense because the n elements must have been allocated at some point which is O(n). - The only remaining use of GMemChunk is in ghook.[ch]. The reason is that ghook.h exports GMemChunk as public API.
Created attachment 18652 [details] [review] patch
Created attachment 18653 [details] GFreeList header file
Created attachment 18654 [details] GFreeList source file
It feels a little not-opaque-enough to me but maybe it's worth it for the performance win. A possibly-nice thing DBusMemPool has is a flag to zero the blocks on alloc (like g_malloc0/calloc)
*** Bug 84536 has been marked as a duplicate of this bug. ***
There is a typo in the GFreeList c source: DISBLE_MEM_POOLS should be DISABLE_MEM_POOLS
- The g_free_list_free_internal() function is not used, so it can be deleted - If we want the zero-on-alloc feature, I'd do it as a new function g_free_list_alloc0 () - It may be a good idea to not let the user specify the n_preallocs and have the free list decide the block size dynamically (say by doubling every time it runs out of atoms). This might also make it possible to avoid malloc() overhead completely by using anonymous mmap()s of complete pages. However, the n_preallocs argument to g_allocator_new() would have to be ignored, or passed to the free list by some other (internal) means. - I don't really see how to make the structure opaque without breaking the G_FREE_LIST_DEFINE macros. You can't use malloc to initialize a global variable in C. Replacing the "blocks" pointer with a "priv" pointer might be a good idea though.
Some random thoughts: - It may be hard to replace the GList and GSList free lists with a generic freelist, at least without some extra API; the change of the GList freelist from a list to a list of lists was a pretty big efficiency win, in my memory. (the point being that g_list_free() is currently O(1)) - What Tim wanted to do in this area was to replace memchunks with fixed size allocators, so that every usage of blocks of size 8 shared the same free list. - I think it would be a mistake to add new API here without thinking about threading and locking; a big issue with GLib currently in heavily threaded apps is lock contention on the free lists. (Bug 116805; Havoc also noted this in the similarly structured DBus features) The GStreamer folks apparently have an atomic-operation trash stack to avoid lock overhead (this was mentioned to me at GUADEC, I've added Wim Taymans to the CC, I think he was the person working on this) That may well be good enough; there are potentially still problems with: - Overhead of atomic primitive (lock the bus) - Locality of memory usage That might make some sort of scheme where free list elements normally stay local to a processor/thread better, but the complexity and extra instructions there might be considerable. - Does GTrashStack also get deprecated? How does this relate to GTrashStack? Anyways, just thought I'd add a few comments, and make things a bit murkier :-)
I found that DBusMemPool (which is very similar to this API) worked well when I had one mem pool per hash table _instance_, but seems to suck a bit for lists when used globally for all list nodes due to the thread issue.
- You could imagine keeping track of the tail of the freelist: G_INLINE_FUNC void g_free_list_free (GFreeList *list, gpointer mem) { if (G_LIKELY (list->tail)) { mem->next = list->atoms; list->atoms = mem; } else g_free_list_free_internal(list, mem); } then having a g_free_list_free_many (GFreeList *list, gpointer mem, gpointer *next); that would reuse the next-pointer if it happened to match the one in GFreeListAtom. - A big part of the problem in bug 116805 is probably that the lock is held for a very long time as the memchunk updates all its internal trees and stuff. The GString memchunk currently does not have a trash stack list in front of it. - Having a global 8-bytes free list would impose locking even in cases where it wouldn't otherwise be necessary (like inside gtk+ in a multithreaded application).
the glib memory caching code needs a major overhaul. it should: - make use of libc malloc/free performance advantages where it makes sense (e.g. the glibc allocaters are *fast*, there's not much point in caching here) - avoid seperate trash lists for structures of different sizes (not just 8 bytes) - do singly-linked-list frees in O(1) for all structures that use *next as second pointer (GSList, GList, GNode, ...) basically, that means getting rid of memchunks internally and providing one global lock-free allocater with trash-list support.
When we get around to doing something about this, we should also make sure that we get rid of the ABI change due to --disable-mem-pools...
Hey, some notes from the GStreamer camp. == What we have == We do have a GstMemChunk[0] that is mt-safe. Its init allocates a block of memory and pushes the chunks onto a trash stack. After the init, all operations are on the trash stack. [0] http://cvs.freedesktop.org/gstreamer/gstreamer/gst/gstmemchunk.h?view=markup http://cvs.freedesktop.org/gstreamer/gstreamer/gst/gstmemchunk.c?view=markup GstMemChunk doesn't have any locks itself; it's our trash stack implementation, GstTrashStack[1], that's threadsafe. The trash stack can't be implemented with the stock atomic ops in gatomic.c because of the so-called "ABA problem" (google for it). The x86 implementation is lock-free because it supports double word compare and swap. The fallback C implementation uses a mutex around the stack, but is still fast on linux 2.6. Optimized implementations for other architectures are possible. [1] http://cvs.freedesktop.org/gstreamer/gstreamer/gst/gsttrashstack.h?view=markup == An aside regarding GMemChunk == The default defines for glib don't have G_DISABLE_CHECKS, which is good. What's not good is that without G_DISABLE_CHECKS defined, gmemchunk instruments each call to alloc and free, updating TLS call counters. The result is that it is slower than malloc on my box, even without being threadsafe. That profiling code has no place in a production allocator. GMemChunk is significantly faster than malloc if the profiling code is removed. I removed it in the tests below. == Performance of the x86 version == I wrote a small test[2] to run N threads doing M allocations, freeing immediately after each allocation. The wall clock time is then printed out for each test. Here are some typical results on a P4 2.8GHz single-processor machine. [2] http://cvs.freedesktop.org/gstreamer/gstreamer/tests/memchunk/gmemchunktest.c?view=markup 1000000 alloc+frees X 1 threads 0.396343s - GMemChunk 0.184425s - GstMemChunk 0.343142s - g_malloc/g_free This is without GMemChunk's TLS profiling, but with a lock around the memchunk. If I remove the lock (#undef GMEMCHUNK_THREADSAFE in the test), I get something like this: 1000000 alloc+frees X 1 threads 0.143318s - GMemChunk 0.183703s - GstMemChunk 0.343673s - g_malloc/g_free If I increase the number of threads, I have to put the lock back, of course. On my single-processor machine the relative times are the same, and the same per thread. On a dual CPU Xeon with hyperthreading, the results are like this (I'm doing 10 times as much): 1000000 alloc+frees X 1 threads 0.398000s (0.398000s/thread) - GMemChunk 0.149681s (0.149681s/thread) - GstMemChunk 0.299621s (0.299621s/thread) - g_malloc/g_free 1000000 alloc+frees X 10 threads 3.786688s (0.378669s/thread) - GMemChunk 2.231514s (0.223151s/thread) - GstMemChunk 3.084715s (0.308471s/thread) - g_malloc/g_free == Performance of the fallback implementation == AMD64 running in 64-bit mode on linux 2.6: 100000 alloc+frees X 1 threads 0.012986s - GMemChunk 0.008702s - GstMemChunk 0.010135s - g_malloc/g_free 100000 alloc+frees X 10 threads 0.126268s - GMemChunk 0.105127s - GstMemChunk 0.099638s - g_malloc/g_free == Compatibility == GstMemChunk offers the same API as GMemChunk, so it can be a dropin replacement. Since GMemChunk is an opaque structure this would work out nicely. GstTrashStack is larger than GTrashStack, for the counter and mutex needed for threadsafety. Also, it doesn't implement the _peek opteration. So it should probably be a new API, and GTrashStack should be deprecated. Perhaps GstTrashStack can be called GFreeList. == Issues == GstMemChunk will never free any memory until the user calls gst_mem_chunk_destroy(). Neither will GFreeList (Soeren's structure) though. GstMemChunk doesn't do O(1) list frees by itself. That's kind of nasty. But here's how you might do it: The trash stack currently clobbers the first word in the structure, to use as its own next pointer. But the list next pointer is the second word in GList, GSList, and GNode. So you could have GstMemChunk check the second word after popping an item, and if it's not NULL push it back on the trash stack. g_mem_chunk_free would by default set the second word to NULL, so it's not followed, while we could add g_mem_chunk_free_list to not NULL out the second word. So for a normal case, you still have one pop operation. For the case of the list, you'd have a pop and a push, but a push is cheaper than a pop because it doesn't have to avoid the ABA problem. This would imply a minimum chunk size of two words. Not terrible. To make all mem chunk allocators of the same size use the same mem chunk... well you'd have to make the mem chunk atomically refcounted, and have g_mem_chunk_new have a private hash table with a lock, etc. That's a separate issue though; I'd prefer not to address it in this bug. == Summary == GStreamer has avoided GMemChunk because it's slower than malloc and isn't threadsafe. However because GMemChunk is used by data structures that we use (notably arrays), all protected by mutexen in GLib, it really needs to be dealt with. GstMemChunk is API-compatible with GMemChunk, is threadsafe, and only slightly slower in the single-threaded case. It's implemented as a threadsafe GFreeList. So, as a way forward, I think we should: 0) Figure out the O(1) free business 1) Get GstTrashStack in as GFreeList (or some other name) 2) Get GMemChunk replaced by GstMemChunk 3) Patch all users of GMemChunk to not have locks - Drop GTrashStack front-end to GMemChunk in gqueue.c 4) Do to glist/gnode/gslist what we decide in 0) How does this sound as a plan? And then, more specifically, should we implement the complexity of the o(1) free?
Add to CC, fix wim's email
Isn't your test slightly unfair, since it uses G_ALLOC_AND_FREE for the GMemChunk, while GstMemChunk does the equivalent of G_ALLOC_ONLY ? Anyway, I think the general plan looks pretty good, if we can implement the O(1) freeing as you outlined above.
Ah, you're right. The version pointed to in the link now does ALLOC_ONLY. Here's a sample run on my uniprocessor x86: 1000000 alloc+frees X 1 threads 0.345726s (0.345726s/thread) - GMemChunk 0.141357s (0.141357s/thread) - GstMemChunk 0.266718s (0.266718s/thread) - g_malloc/g_free 1000000 alloc+frees X 10 threads 3.367218s (0.336722s/thread) - GMemChunk 1.413494s (0.141349s/thread) - GstMemChunk 3.105659s (0.310566s/thread) - g_malloc/g_free That's with GMemChunk having a lock around it, but no profiling. I'll get to implementing the O(1) freeing soon.
I propose an atomic GTrashStack that just clobers the first pointer of structures pushed onto it. It would be used to implement GMemChunk with its current API. For the O(1) freeing of GList, GSlist, GNode,... a new datastructure GFreeList would be used. GFreeList also uses GTrashStack internally but the pop() function would also check the second pointer in the structure, follow it and push() it back onto the GTrashStack.
Somebody should look at numbers on an smp box. Also, can somebody compare to a malloc with threading stuff built into the design. The google malloc comes to mind.
Wow. Google's malloc is really really fast. The gmemchunktest.c linked to above will now load libtcmalloc if it's available, and it is fast: Uniprocessor P4 2.8GHz: 100000 alloc+frees X 1 threads 0.055437s (0.055437s/thread) - GMemChunk 0.023199s (0.023199s/thread) - GstMemChunk 0.040129s (0.040129s/thread) - g_malloc/g_free 0.010877s (0.010877s/thread) - google malloc/free 100000 alloc+frees X 10 threads 0.372564s (0.037256s/thread) - GMemChunk 0.139306s (0.013931s/thread) - GstMemChunk 0.273138s (0.027314s/thread) - g_malloc/g_free 0.079811s (0.007981s/thread) - google malloc/free Dual hyperthreaded Xeon (pseudo 4-way) 2.4GHz: 1000000 alloc+frees X 1 threads 0.381471s (0.381471s/thread) - GMemChunk 0.150774s (0.150774s/thread) - GstMemChunk 0.297379s (0.297379s/thread) - g_malloc/g_free 0.086142s (0.086142s/thread) - google malloc/free 1000000 alloc+frees X 10 threads 3.802258s (0.380226s/thread) - GMemChunk 1.511163s (0.151116s/thread) - GstMemChunk 3.081182s (0.308118s/thread) - g_malloc/g_free 0.939856s (0.093986s/thread) - google malloc/free Um.... Faced with these numbers, why do we have memchunks at all? You still have the theoretical ability to mlockall a chunk, but damn. Google malloc needs to be pushed either to glibc or glib.
Ok, I needed to try it too on amd64 :) First I encountered problems when compiling the google-performance library. I had to get rid of all the profiling stuff, but that shouldn't change anything (since that code isn't actually used by libtcmalloc). CPU : AMD Athlon(tm) 64 Processor 2800+ (1.6GHz) 100000 alloc+frees X 1 threads 0.015604s (0.015604s/thread) - GMemChunk 0.008723s (0.008723s/thread) - GstMemChunk 0.009573s (0.009573s/thread) - g_malloc/g_free 0.008909s (0.008909s/thread) - google malloc/free 100000 alloc+frees X 10 threads 0.162649s (0.016265s/thread) - GMemChunk 0.099027s (0.009903s/thread) - GstMemChunk 0.099321s (0.009932s/thread) - g_malloc/g_free 0.091263s (0.009126s/thread) - google malloc/free Same run but with 10 times more allocations (1 Million) 1000000 alloc+frees X 1 threads 0.161057s (0.161057s/thread) - GMemChunk 0.086330s (0.086330s/thread) - GstMemChunk 0.098534s (0.098534s/thread) - g_malloc/g_free 0.088163s (0.088163s/thread) - google malloc/free 1000000 alloc+frees X 10 threads 1.641056s (0.164106s/thread) - GMemChunk 0.938081s (0.093808s/thread) - GstMemChunk 0.986813s (0.098681s/thread) - g_malloc/g_free 0.893960s (0.089396s/thread) - google malloc/free So really google malloc isn't THAT faster on amd64... but funny though that amd64's kick the crap out of Xeons :)
I am not sure how much Google's malloc impl is tuned for the desktop. For example, they don't do tricks like mmaping mallocs > 128 kb so that memory doesn't get fragmented. Also, for sizes that are not multiples of 8 bytes, they will waste 4 bytes / object from what I can tell (this would affect hash nodes if I remember correctly how fields are done). For AMD: does anybody have a dual amd? Also, what is the test case used here? We may be missing one extremely important stat: how does the test handle the cache. Take a look at the blog from ricom (a .NET performance guru from msft): http://blogs.msdn.com/ricom/archive/2005/01/05/347361.aspx http://blogs.msdn.com/ricom/archive/2005/01/07/348691.aspx Note how badly the freelist does in this case. I don't know if google's malloc is smarter in how it handles things here (it probably can't be anywhat as smart as a GC, but it can be close). Also, the fact that it uses thread local storage may be acting to its advantage (rather than from a global freelist).
Also, on smp boxes, we should try tests that use the same number of threads as cpus (try both logical and physical here: so for the dual xeon, try 2 and 4). I also wonder why the google malloc did so horribly on amd64. One idea is because they are using a different data structure to do thier lookup thingies. Another is maybe the CFLAGS are not as tuned as they are for glibc.
The current google malloc implementation has some caveats that make it a pretty poor choice for replacement of the standard libc malloc. The big 2 being that it doesn't actually return memory to the system, and it gobbles up 6 MB of memory on startup. It's also in C++, which makes inclusion into glib or glibc pretty slim. These all can be addressed, but as it stands, it's not really a general drop in replacement.
The 6 mb of memory is easy for them to address (i've talked to them about this already). They just need to get zeroed memory from the system so that things are from the shared zero page. As for releasing memory, that it probably a bit of a bigger problem. We probaby want to steal the best things from google malloc that we can rather than trying to use it as a drop in replacement. With the freelist, we already know much more than a malloc does: we know the size of the object we get back. This means we can avoid costly lookups.
Wim, did you ever get around to implement O(1) freeing as outlined in comment #17 ?
Hmm, I converted the GstMemChunk code into a GLib patch and added O(1) list freeing, but I can't reproduce Wim's numbers. On this dual-Xeon box, malloc clearly outperforms the patched GMemChunk code as soon as the number of threads is > 1: [mclasen@golem mem]$ ./gmemchunktest 1 100000 100000 alloc+frees X 1 threads 0.037062s (0.037062s/thread) - g_malloc/g_free 0.014173s (0.014173s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 2 100000 100000 alloc+frees X 2 threads 0.045879s (0.022940s/thread) - g_malloc/g_free 0.073822s (0.036911s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 5 100000 100000 alloc+frees X 5 threads 0.093965s (0.018793s/thread) - g_malloc/g_free 0.207979s (0.041596s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 10 100000 100000 alloc+frees X 10 threads 0.180635s (0.018064s/thread) - g_malloc/g_free 0.413845s (0.041385s/thread) - GMemChunk
Hmm, I converted the GstMemChunk code into a GLib patch and added O(1) list freeing, but I can't reproduce Wim's numbers. On this dual-Xeon box, malloc clearly outperforms the patched GMemChunk code as soon as the number of threads is > 1: [mclasen@golem mem]$ ./gmemchunktest 1 100000 100000 alloc+frees X 1 threads 0.037504s (0.037504s/thread) - g_malloc/g_free 0.014732s (0.014732s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 2 100000 100000 alloc+frees X 2 threads 0.040671s (0.020335s/thread) - g_malloc/g_free 0.074221s (0.037110s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 5 100000 100000 alloc+frees X 5 threads 0.107214s (0.021443s/thread) - g_malloc/g_free 0.223165s (0.044633s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 10 100000 100000 alloc+frees X 10 threads 0.199082s (0.019908s/thread) - g_malloc/g_free 0.460681s (0.046068s/thread) - GMemChunk If I compile GLib with G_DISABLE_MEM_POOLS to reduce the GMemPool to malloc() + atomic freelist, I get basically the same numbers, so the overhead must be in the atomic free list. [mclasen@golem mem]$ ./gmemchunktest 1 100000 100000 alloc+frees X 1 threads 0.037062s (0.037062s/thread) - g_malloc/g_free 0.014173s (0.014173s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 2 100000 100000 alloc+frees X 2 threads 0.045879s (0.022940s/thread) - g_malloc/g_free 0.073822s (0.036911s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 5 100000 100000 alloc+frees X 5 threads 0.093965s (0.018793s/thread) - g_malloc/g_free 0.207979s (0.041596s/thread) - GMemChunk [mclasen@golem mem]$ ./gmemchunktest 10 100000 100000 alloc+frees X 10 threads 0.180635s (0.018064s/thread) - g_malloc/g_free 0.413845s (0.041385s/thread) - GMemChunk
Created attachment 52154 [details] [review] the patch
Sorry to add noise, but are you seriously comparing execution times that are less than a second ?
Weird, I'm seeing similar results on the dual-xeon box running fedora here. The only change has been an upgrade from fc2 to fc4. My desktop is now an x86-64, so I can't test the optimized code. I tried it out on a coworker's single-processor machine running ubuntu breezy, and gstmemchunk still kicks the crap out of malloc. Still investigating...
Both breezy and FC4 have glibc 2.3.5, it can't be a difference in upstream
The breezy box I was testing on didn't have libc6-686 installed. Installing it didn't and munging the necessary ld.so.conf files didn't make a difference in timing. Perhaps Ubuntu uses different configure options from Fedora. There is one more optimization that could be made, which is having per-thread trash stacks (freelists). Wim indicated his desire to flesh this out, but he's working remotely until Monday -- will see if I can get him to comment. One more data point -- on my x86-64 single-xeon breezy system, hyperthreaded, g_malloc is about 4x faster than the fallback gstmemchunk implementation for the 10-thread case, and still marginally faster in the single-thread case. Open item: is this really worth it? Why not just punt all this to malloc? The only real reason I can see would be that you can mlockall(2) a memchunk, which is good for realtime allocation as well, but glib isn't really a realtime library.
> There is one more optimization that could be made, which is having per-thread > trash stacks (freelists). Wim indicated his desire to flesh this out, but he's > working remotely until Monday -- will see if I can get him to comment. Yes, thats the one thing that we could and maybe should still try out. > Open item: is this really worth it? Why not just punt all this to malloc? The > only real reason I can see would be that you can mlockall(2) a memchunk, which > is good for realtime allocation as well, but glib isn't really a realtime > library. I was slowly coming to this conclusion as well. The one argument in favor of memchunks is that it allows us to avoid some memory overhead (glibc malloc requires 4 bytes of extra room for bookkeeping, and won't allocate blocks smaller 16 bytes anyway) - we should probably measure how much we actually waste due to that.
*** Bug 116805 has been marked as a duplicate of this bug. ***
Here is a very quick hack using a thread private free list. seems to scale better here: [mclasen@golem mem]$ ./gmemchunktest 1 100000 100000 alloc+frees X 1 threads 0.038504s (0.038504s/thread) - g_malloc/g_free 0.029129s (0.029129s/thread) - GMemChunk 0.013549s (0.013549s/thread) - XMemChunk [mclasen@golem mem]$ ./gmemchunktest 10 100000 100000 alloc+frees X 10 threads 0.210929s (0.021093s/thread) - g_malloc/g_free 0.354550s (0.035455s/thread) - GMemChunk 0.083927s (0.008393s/thread) - XMemChunk [mclasen@golem mem]$ ./gmemchunktest 100 100000 100000 alloc+frees X 100 threads 1.847946s (0.018479s/thread) - g_malloc/g_free 3.934647s (0.039346s/thread) - GMemChunk 0.693209s (0.006932s/thread) - XMemChunk [
Created attachment 52179 [details] thread private free lists
Of course, this simple hack doesn't handle some interesting questions: 1) what to do with the free list when a thread terminates 2) how to handle transfer of memory between threads
Hmm, I spent way too much time thinking about these issues, and came up with yet another design.
Created attachment 52238 [details] another approach
Created attachment 52239 [details] header
After changing POOL_HEADER_SIZE to sizeof(PoolHeader) for my 64-bit x86 hyperthreaded pentium, here are the results of that benchmark: 1000000 alloc+frees X 10 threads 1.627959s (0.162796s/thread) - MemChunk 18.364705s (1.836471s/thread) - GMemChunk 8.814012s (0.881401s/thread) - GstMemChunk 1.356361s (0.135636s/thread) - g_malloc/g_free 0.663839s (0.066384s/thread) - google malloc/free MemChunk is your slab-allocator thing. I then changed the benchmark to touch every byte in the allocated pieces, just to see. I got segv's doing that with your slab allocator. I suspect malloc is as good as we can maintainably hope to get.
> I suspect malloc is as good as we can maintainably hope to get. I agree. A good malloc/free has the advantage that it can often hand out memory that it knows is in cache. All the freelists can only do that if the application repeatedly frees and allocates blocks of the same size. This is an effect that will show up when the application actually uses the memory, so simple allocate/free benchmarks are biased in favor of freelists. It's hard to come up with credible benchmarks, other than measuring how fast real-world applications with real-world memory allocation patterns actually run.
Yes, maybe we just need to wait for a lockfree malloc implementation, as described in http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
I have made one final attempt to come up with something simple but scalable. It is inspired by the "magazine layer" of the classical slab allocator. Lists of free chunks are kept below a fixed maximum size. There is one global list of free lists, and each thread keeps two lists to itself. Lists are only transferred between the global list and the thread-local storage when necessary. In order to handle thread termination, the mem chunk also keeps an extra list of "rest chunks". The minimum block size is 2 pointers, since we store two next pointers in the chunks in the global free list. Memory is never freed, since we don't keep track of the memory area a chunk belongs to. This code seems to be consistently faster than malloc/free in the trivial alloc/free test scenario: [mclasen@golem mem]$ ./gmemchunktest 1 1000000 1000000 alloc+frees X 1 threads 0.340801s (0.340801s/thread) - g_malloc/g_free 0.298237s (0.298237s/thread) - GMemChunk 0.077505s (0.077505s/thread) - MemChunk [mclasen@golem mem]$ ./gmemchunktest 2 1000000 1000000 alloc+frees X 2 threads 0.346512s (0.173256s/thread) - g_malloc/g_free 0.546255s (0.273128s/thread) - GMemChunk 0.235085s (0.117542s/thread) - MemChunk [mclasen@golem mem]$ ./gmemchunktest 10 1000000 1000000 alloc+frees X 10 threads 1.690034s (0.169003s/thread) - g_malloc/g_free 3.363173s (0.336317s/thread) - GMemChunk 0.553422s (0.055342s/thread) - MemChunk [mclasen@golem mem]$ ./gmemchunktest 20 1000000 1000000 alloc+frees X 20 threads 3.460256s (0.173013s/thread) - g_malloc/g_free 6.731648s (0.336582s/thread) - GMemChunk 0.890570s (0.044528s/thread) - MemChunk
Created attachment 52354 [details] memchunk.c
Created attachment 52355 [details] memchunk.h
I tested this allocator in GLib, and found that it was slightly slower than the current code when allocating and freeing lists of length ~100 in a single thread, due to the O(1) freeing of lists. Therefore, I modified my allocator to support O(1) list freeing.
Created attachment 52649 [details] [review] gmem.[hc] patch
Created attachment 52650 [details] [review] use the new allocator for GList, GSList, GHashTable
Maybe I should write a quick outline how my allocator works. Memory is allocated in reasonably large blocks (1k currently), chopped into equal-sized chunks, and put on a free list. The free list is actually a list of "magazines", which hold a fixed number of chunks. Each thread has its own local freelist which is limited in size to two magazines. Transferring memory between the global and local freelists is always done in full magazines (except for thread termination, which is handled separately using a "rest" list). Each thread allocates and frees chunks out of its local freelist, as long as thats possible. As an extra twist, the 2 magazines of the local freelist can be swapped, which can sometimes avoid exchanging memory with the global free list. For O(1) list freeing, we give up the idea that the members of a magazines are individual chunks. Instead, they can be lists themselves, and the code handles that when allocating a chunk.
Here are numbers for allocating and freeing 1000000 GList elements, where the freeing is done on lists of length 96. The runs are with 1, 5, 10 and 20 threads. Current code: [mclasen@golem mem]$ ./listtest1 1 1000000 0.339554s (0.339554s/thread) [mclasen@golem mem]$ ./listtest1 5 1000000 5.239699s (1.047940s/thread) [mclasen@golem mem]$ ./listtest1 10 1000000 11.194050s (1.119405s/thread) [mclasen@golem mem]$ ./listtest1 20 1000000 25.882314s (1.294116s/thread) New allocator: [mclasen@golem mem]$ ./listtest1 1 1000000 0.188417s (0.188417s/thread) [mclasen@golem mem]$ ./listtest1 5 1000000 0.560558s (0.112112s/thread) [mclasen@golem mem]$ ./listtest1 10 1000000 0.987478s (0.098748s/thread) [mclasen@golem mem]$ ./listtest1 20 1000000 2.058154s (0.102908s/thread)
Created attachment 52651 [details] [review] necessary gthread changes
Using GPrivate for the thread local magazines turned out to be very hard, since GSList, GHashTable, GArray all use memchunks. I ended up embedding the per-thread pointers directly in GRealThread.
- Should be doing the listtest with malloc/free as well -- at this point the difference is not so much your allocator versus the existing, it's yours versus malloc, no? - Care to post your listtest1.c?
Ok, here are the same number=s for plain malloc [mclasen@golem mem]$ ./listtest1 1 1000000 0.229110s (0.229110s/thread) [mclasen@golem mem]$ ./listtest1 5 1000000 0.794192s (0.158838s/thread) [mclasen@golem mem]$ ./listtest1 10 1000000 1.810608s (0.181061s/thread) [mclasen@golem mem]$ ./listtest1 20 1000000 3.834949s (0.191747s/thread)
Created attachment 52672 [details] listtest1.c
Created attachment 52734 [details] [review] Cleaned up patch that was to timj for review
New g_slice_* API is in place, and will get a real implementation before 2.10. The current implementation is just a malloc wrapper.
*** Bug 169389 has been marked as a duplicate of this bug. ***
the new g_slice_*() implementation is in CVS now, as described in: http://mail.gnome.org/archives/gtk-devel-list/2005-December/msg00031.html