GNOME Bugzilla – Bug 585375
Performance and Contention problems with g_type_class_ref and type_rw_lock
Last modified: 2011-02-18 15:59:31 UTC
------------------------ The gory part : After having hudge performance decreases passed a certain point in a GStreamer based multithreaded application. I started profiling the mutexes in glib and found : Profiling Result : (Note that this backtrace is only the first taken path for the mutex) Mutex : 0x7f89b902aca8 , type 0, num_locks : 8143886 num_read_locks 0, num_write_locks 0 max_read_wait 0, max_write_wait 0 bt : /opt/trx-HEAD/glib/2.18.0//lib/libglib-2.0.so.0(g_my_profiling_add_mutex+0xe1) [0x7f89b82170d1] bt : /opt/trx-HEAD/glib/2.18.0//lib/libgthread-2.0.so.0 [0x7f89b89e2601] bt : /opt/trx-HEAD/glib/2.18.0//lib/libglib-2.0.so.0(g_static_rw_lock_writer_lock+0x64) [0x7f89b8217494] bt : /opt/trx-HEAD/glib/2.18.0//lib/libgobject-2.0.so.0(g_type_init_with_debug_flags+0x4a) [0x7f89b8e0bc8a] bt : /opt/trx-HEAD/gstreamer/gstreamer/0.10.20/lib/libgstreamer-0.10.so.0 [0x7f89b9052b57] bt : /opt/trx-HEAD/glib/2.18.0//lib/libglib-2.0.so.0(g_option_context_parse+0x58) [0x7f89b81fd288] bt : /opt/trx-HEAD/gstreamer/gstreamer/0.10.20/lib/libgstreamer-0.10.so.0(gst_init_check+0xad) [0x7f89b9051e8d] bt : /opt/trx-HEAD/gstreamer/gstreamer/0.10.20/lib/libgstreamer-0.10.so.0(gst_init+0x17) [0x7f89b9051f77] This mutex is taken 8 million times for 100 gstreamer transcoding pipelines during 5 minutes Or about 26 000 times per second!! The next mutex most taken is taken about 800 000 times.. The mutex in question is : g_type_rw_lock (or more presisely it's own real ->mutex) This is a read write lock that is used to protect the class information of a type. ------------------------ To better understand the problem here's a quick intro on how to types work in glib : Type generation : A type is created by a get_type function like gst_buffer_get_type (); This creates a static GTypeInfo structure. This contains the function pointers used to do class and instance initialisation for example. Type registration : The first time the get_type function is called for a for a type. g_type_register_static will be called with the GTypeInfo structure and a parent structure. This will register the type in a tree of types using the parent type. This tree of types consists of TypeInfo nodes with the fundamental glib types as their root. Once this is done a dynamic TypeData structure is created (data = g_malloc(sizeof (TypeData)..) this data pointer is part of the TypeInfo node. And the GTypeInfo information is copied on that structure. Refcounting : One important thing to consider about this type system that is directly related to the mutex contention problem is the refcounting of these types. Once your TypeData structure is created, any new object of that type will increment a refcount in this structure. So that if there are no more objects using that type. The TypeData structure will be freed. (Do not get confused with the register_"static" .. static types are still refcounted and are "dynamic" in that sense.. the static in that func has nothing to do what that) Instance creation : To create an object you use g_object_new or more generally g_type_create_instance. This will call g_type_class_ref on the type of the object since a new object for that class is created. And g_type_class_unref when it is destructed. And here the fun starts... Locking : Since the TypeInfo->data pointer for the TypeData structure for a type can be created and destroyed as it's refcount reaches 0 it must somehow be locked. Glib is using a read/write lock called g_type_rw_lock for that. At this point, I think it's worth taking a look at how a read/write lock works in glib : The two functions used to take a read or write lock for the g_type_rw_lock mutex are : g_static_rw_lock_reader_lock and g_static_rw_lock_writer_lock Checking the implementation of g_static_rw_lock_reader_lock : g_static_mutex_lock (&lock->mutex); lock->want_to_read++; while (lock->have_writer || lock->want_to_write) g_static_rw_lock_wait (&lock->read_cond, &lock->mutex); lock->want_to_read--; lock->read_counter++; g_static_mutex_unlock (&lock->mutex); And for g_static_rw_lock_writer_lock : g_static_mutex_lock (&lock->mutex); lock->want_to_write++; while (lock->have_writer || lock->read_counter) g_static_rw_lock_wait (&lock->write_cond, &lock->mutex); lock->want_to_write--; lock->have_writer = TRUE; g_static_mutex_unlock (&lock->mutex); And : g_static_rw_lock_wait is a simple g_cond_wait on lock->mutex The unlock functions lock the lock->mutex and change the counts / flags as expected. So in fact each _lock operation on a rw lock can lock the mutex twice once to change it's counter and once to retake the lock after. And each unlock takes the mutex once. So as you do a rw_read_lock /unlock you do at least 2 locks and 2 unlocks on the same mutex. And these locks are only there to protect the rw locks counters. I think this could be done atomicly and create much less overhead for a lot of usecases. Ok so to end my kinda lengthy parenthesis on the rw_lock mutexes suffice it to say if you use a rw lock "mutex" a lot .. you will lock a lot more of real mutexes on the way and multiply your contention. Contention : So now that we know that it's possible to have contention with a rw lock why do we have so much contention ? Because each call to g_object_new / g_type_create_instance must call g_type_class_ref and the destruction _unref and take the type_rw_lock!! So each time an object is created or destroyed in your whole application this static rw lock is taken !!! On multithreaded applications that create a lot of objects like applications based on GStreamer that creates an enormous amount of GstBuffers that becomes a real problem. I did some profiling with LTTng and saw that at some point the contention was so great that the futex started taking the slow-path in the kernel and this created a hudge performance decrease and mutilple calls from the kernel to switch the process from a cpu to another... litteraly killing the performance of the whole application. So What can be done : The ugly patch : A quick and easy workaround is to optimise the common path to g_class_ref for a type of object like GstBuffer so that the lock is not taken and that it is reffed only once and not unreffed and that the class information in the case of g_class_ref is returned directly. But this assumes that your type will stay reffed for the whole duration of the program. In the case of my gstreamer based application this gets rid of the problem as GstBuffers are by far the most instanciated objects. The clean patch : This would make use of atomic operations to do the refcounting and possibly do that refcounting in the static GTypeInfo structure... And use a lock only to protect the creation /destruction of the TypeInfo->data pointer. Or some other idea... ?
oops forget about adding refcounting in the GTyepInfo struct.. this is const.. and would require a large rewrite of the using apps... :( Must be another way
or change glib to use proper posix read/write locks. I might give that a try today.
also, see this: http://0pointer.de/blog/projects/mutrace
*** Bug 595394 has been marked as a duplicate of this bug. ***
I took wim's work and rewrote GStaticRWLock to use native rwlock It's pending work, but brings a 10-15% boost to the performance test ABOVE alexander's patches. http://cgit.freedesktop.org/~bilboed/glib/log/
Using a native rwlock doesn't really help us in the end :( I wrote a simple stress case for GStreamer which spawns threads that create/destroys gstreamer buffers (which are miniobjects, as lightweight as you can get). (I'm on a dual core system). #./gstbufferstress 1 500000 main(): Creating 1 threads. total 0:00:00.283225000 - average 0:00:00.000000566 - Thread 0 *** total 0:00:00.283791000 - average 0:00:00.000000567 - Done creating 500000 buffers When using 2 threads... the results can vary drastically: #./gstbufferstress 2 500000 main(): Creating 2 threads. total 0:00:02.457827000 - average 0:00:00.000004915 - Thread 1 total 0:00:02.464033000 - average 0:00:00.000004928 - Thread 0 *** total 0:00:02.464540000 - average 0:00:00.000002464 - Done creating 1000000 buffers #./gstbufferstress 2 500000 main(): Creating 2 threads. total 0:00:00.779137000 - average 0:00:00.000001558 - Thread 0 total 0:00:00.802021000 - average 0:00:00.000001604 - Thread 1 *** total 0:00:00.823024000 - average 0:00:00.000000823 - Done creating 1000000 buffers So... it ranges from... being 50% slower than a single thread.. to being 5 times slower. (while the numbers in an individual thread going up 3-9 times). FWIW, Let's imagine the top scenario with 100 pipelines. $ ./gstbufferstress 100 500000 main(): Creating 100 threads. total 0:01:27.863730000 - average 0:00:00.000175727 - Thread 37 total 0:01:31.278407000 - average 0:00:00.000182556 - Thread 64 ... total 0:02:09.661608000 - average 0:00:00.000259323 - Thread 13 total 0:02:09.707792000 - average 0:00:00.000259415 - Thread 50 *** total 0:02:09.756068000 - average 0:00:00.000002595 - Done creating 50000000 buffers Ideally we want to reach (time_per_buffer_single_threaded / num_cores)... not got the complete other direction and have that multiplied by 5. The culprit is the "one-lock-to-rule-them-all" design in gtype.c I'm having a look at splitting this up in a more MT-friendly design (while obviously keeping it API/ABI compatible).
Created attachment 143571 [details] gstbuffer stress test The buffer stress test used above. compile with `pkg-config --cflags --libs gstreamer-0.10`
(In reply to comment #6) > Using a native rwlock doesn't really help us in the end :( > It will but the first problem to solve is not using native rw locks but changing glib to not take the writer lock all the time.
(In reply to comment #8) > (In reply to comment #6) > > Using a native rwlock doesn't really help us in the end :( > > > > It will but the first problem to solve is not using native rw locks but > changing glib to not take the writer lock all the time. Ergo the "really" in my sentence :)
Created attachment 143818 [details] [review] gobject/gtype.c: Lockless type reffing/unreffing. Fixes #585375 This is basically done by the following: * Move ref_count from TypeData to TypeNode * Only access ref_count unlocked with g_atomic_int_*() Passes all checks.
It does need some cleanups though (about to go out for Merce Festival, but thought people would like to try it out). Testing, feedback, comments more than welcome.
The patch is a bit big, it'd be nice if the stuff that marks as /* (UN)LOCKED */ were a different patch. I also don't like that you make the locked and the unlocked code call the same function. Last but not least, the patch is buggy. - You need to do the first ref/last unref while holding the lock. Otherwise there's a race between unlocking the mutex and changing the refcount. - You need to unref first thing after locking, otherwise other code can happily take a reference while you're destructing the class data. All of this is a bit ugly because you can call g_type_class_ref() on a type without a reference and the code needs to account for that. Do you have a test where you just do do { g_type_class_unref (g_type_class_ref (my_type)); } while (TRUE); in lots of threads? Anyway, I guess I'm gonna look at this tomorrow :)
will fix that tomorrow. Just thought people could have a shoot at it.
I'm cleaning/fixing the patch according to comment #12
Created attachment 143873 [details] [review] gobject/gtype.c: Lockless type reffing/unreffing. Fixes #585375 This is basically done by the following: * Move ref_count from TypeData to TypeNode * Only access ref_count unlocked with g_atomic_int_*() * ref_count corresponds to the existence of TypeData, therefore only the creation of destruction of TypeData is protected by the WRITE_LOCK. Modified according to comments by Benjamin Otte Passes all checks.
Created attachment 143883 [details] [review] gobject/tests: Add a test that refs/unrefs a GType in several threads.
Created attachment 143890 [details] [review] gobject/tests: New test for dynamic class creation/destruction Starts 3 threads which ref/unref a dynamic class.
Created attachment 143891 [details] [review] Move ref_count from TypeNode->data to TypeNode
Created attachment 143892 [details] [review] Pass the TypeNode to type_data_last_unref_Wm() Previously the GType was looked up just for calling the function Also moves the unref functions together in the code.
Created attachment 143893 [details] [review] Move setting the refcount to the end of the function This is a safety feature for when making it atomic later.
Created attachment 143894 [details] [review] Add a NODE_REFCOUNT getter This is useful when moving the code to be atomic. It also will make that patch smaller.
Created attachment 143895 [details] [review] type_data_unref_Wm => type_data_unref_U Make the type unref function not hold any locks when called. This makes it easier to optimize it to be atomic later.
Created attachment 143896 [details] [review] Make all accesses of Node->ref_count atomic This does not change any locking behavior at all, it just replaces simple getters/setters of the variable with atomic versions. The ref_count variable was kept as unsigned, even though that requires casting for all operations, to mirror GObject->refcount.
Created attachment 143897 [details] [review] Make type_data_unref_U not take locks in the common case
Created attachment 143898 [details] [review] Make ClassData->init_State atomic This is necessary to make g_type_class_ref() lockless.
Created attachment 143899 [details] [review] Reorganize g_type_class_ref() Moves the first check out of the lock, as it's not required.
Created attachment 143900 [details] [review] Add type_data_ref_U() and use it in g_type_class_ref() The function returns TRUE if the type was previously initialized and can be easily reused. It returns FALSE and does not take a reference if the type is not referenced yet. g_type_class_ref() uses this to avoid taking locks in the common path, which speeds up object creation a lot - in particular in multithreaded applications.
Here's a huge patchset I came up with when splitting the above patch from Edward to understand it better. The patches are on top of the gobject-performance branch (where I intend to push them once Edward says it looks ok to him). I hope they work the intended way, I didn't do extensive testing. They certainly pass the attached testcase.
Antoine, any chance that you could try the optimisations that have been done in the gobject-performance branch in glib git?
I should have time next week to test it :)
The gobject-performance branch is now rebased onto the gobject-performance-2 branch based on glib 2.18, with some stuff merged.
Alex, you got me scared there :) It's rebased against 2.22.1 and not 2.18 I'll clean up my persona repos soon, got some additions to the performance and performance-threaded tests, as well as some other tests (that unfortunately show bugs that were present before).
As gobject-performance-2 has been merged, Can this bug be closed?