GNOME Bugzilla – Bug 594525
Implement lock-free interface lookup
Last modified: 2009-12-01 09:21:45 UTC
Currently all interface lookups, i.e. g_type_is_a (object, SOME_INTERFACE_TYPE) or interface vfunc calls, take a global lock. This is pretty poor from a threaded scalability perspective, as two threads that do otherwise unrelated things that use interfaces get unnecessary lock contention. GObject allows adding interfaces to classes after construction of the classes which make this a bit complex, but its doable.
Created attachment 142711 [details] [review] Implement lock free interface lookup We implement lock free interface lookup by moving the n_ifaces counter into memory pointed to by TypeNode->iface_entries, and then updating this in RCU-style by always copying it, modifying the copy and then when the modification is done replace the old pointer with g_atomic_pointer_set. There is one additional complexity when freeing the old memory, since the old memory region can be in use. To handle this we have an atomic refcounter on the iface_entries that we ref when accessing the memory and unref after. When freeing if there are no outstanding refs we free immediately, otherwise we put the memory on a free list that we semi-regularly look at for possible freeing. With this infrastructure the patch then implements a lock-free version of type_lookup_iface_entry_L called type_lookup_iface_entry_I and use it in: g_type_interface_peek, g_type_interface_peek_parent and type_node_check_conformities_UorL. Using the performance tests from bug 557100 shows that the general performance difference is negligible, but the lack of a lock for each type check and interface vfunc call should greatly enhance threaded scalability.
atomic_ref_aquire_I is racy, we're working on a replacement.
Created attachment 142775 [details] [review] Implement lock free interface lookup We implement lock free interface lookup by moving the n_ifaces counter into memory pointed to by TypeNode->iface_entries, and then updating this in RCU-style by always copying it, modifying the copy and then when the modification is done replace the old pointer with g_atomic_pointer_set. There is one additional complexity when freeing the old memory, since the old memory region can be in use. To handle this we don't free such memory, but put it on a free list and reuse it later. This means that lock-free lookups must be able to handle the memory being modified in random ways without crashing, and at the end we verify that the memory didn't change and the transaction is ok. With this infrastructure the patch then implements a lock-free version of type_lookup_iface_entry_L called type_lookup_iface_vtable_I and use it in: g_type_interface_peek, g_type_interface_peek_parent and type_node_check_conformities_UorL. Using the performance tests from bug 557100 shows that the general performance difference is negligible, but the lack of a lock for each type check and interface vfunc call should greatly enhance threaded scalability.
Created attachment 142791 [details] [review] tests This adds some tests - each of them uses a liststore-type (direct subclass of GObject, 5 interfaces) and then does per run 1000 times: "liststore-is-a" - run g_type_is_a (liststore_type, interface_type) for each of the implemented interfaces and one on-implemented interface "liststore-interface-peek" - run g_type_interface_peek (liststore_class, interface_type) for each of the implemented interfaces "liststore-interface-peek-same" - run g_type_interface_peek (liststore_class, interface_type) for the same implemented interface 5 times The results are ordered with the numbers before the patch on the top line, the numbers after applying the current patch on the bottom. Running the test single-threaded: Running test "liststore-is-a" 2175 runs, min/avg/max = 0.818/0.916/4.115 ms 2542 runs, min/avg/max = 0.718/0.783/1.538 ms Running test "liststore-interface-peek" 4169 runs, min/avg/max = 0.474/0.476/1.286 ms 4702 runs, min/avg/max = 0.396/0.422/7.459 ms Running test "liststore-interface-peek-same" 5019 runs, min/avg/max = 0.372/0.395/0.487 ms 5790 runs, min/avg/max = 0.337/0.342/1.742 ms For two threads running concurrently on a dual core machine: Running test "liststore-is-a" 443 runs, min/avg/max = 1.941/9.037/39.310 ms 5067 runs, min/avg/max = 0.719/0.785/18.201 ms Running test "liststore-interface-peek" 545 runs, min/avg/max = 1.495/7.337/15.112 ms 8987 runs, min/avg/max = 0.417/0.440/14.300 ms Running test "liststore-interface-peek-same" 583 runs, min/avg/max = 1.236/6.859/20.641 ms 9924 runs, min/avg/max = 0.381/0.398/16.410 ms The last numbers are scary, but only if you consider that you get very heavy lock contention on the type lock in this testcase. In a normal program this should be less of an issue.
I just noticed that the change is nontrivial even on a single thread. The previous number don't call g_thread_init(), but if you call it (i.e. run the attached testcase with "-t 1") you get these results: Running test "liststore-is-a" 979 runs, min/avg/max = 1.913/2.040/3.817 ms 2500 runs, min/avg/max = 0.657/0.795/6.723 ms Running test "liststore-interface-peek" 1293 runs, min/avg/max = 1.475/1.542/4.159 ms 4460 runs, min/avg/max = 0.431/0.444/3.722 ms Running test "liststore-interface-peek-same" 1557 runs, min/avg/max = 1.265/1.280/3.870 ms 4664 runs, min/avg/max = 0.415/0.424/4.485 ms So it seems we gain a roughly 3x performance improvement from avoiding the lock.
While the min/avg are always lower after the patch, I'm wondering why the max time has been 6x times higher for the single-threaded liststore-interface-peek case, and 3.5x times for liststore-interface-peek-same ?
(In reply to comment #6) > While the min/avg are always lower after the patch, I'm wondering why the max > time has been 6x times higher for the single-threaded liststore-interface-peek > case, and 3.5x times for liststore-interface-peek-same ? > Likely because there's a context switch happening to some different process. If we're assuming that lots of processes use g_timeout_add_seconds(), once per second there should be a lot of processes that the kernel prefers scheduling. Once those are done after 4ms, control goes back to the otherwise idle performance test. What I didn't yet add (but should), is dropping the bottom 10% of results, as those are usually way off due to external factors.
All these patches are now in the gobject-performance branch in git
I added some instrumentation to GAtomicArray to see how much stuff ends up "leaked" on the free list. This is what i get in nautilus: GAtomicArray leaks 19 items, total size 1301 size 1: 2 (2 bytes) size 3: 1 (3 bytes) size 4: 2 (8 bytes) size 32: 1 (32 bytes) size 56: 9 (504 bytes) size 152: 1 (152 bytes) size 176: 1 (176 bytes) size 200: 1 (200 bytes) size 224: 1 (224 bytes) For e.g. gedit i get similar results, and I believe most apps would get similar results (at least the same order of magnitude). I had various ideas wrt how reuse could be improved (generation counters in allocate block, use low 2 bits in pointer, "round up" allocations to larger chunks, etc). However, given the small cost here I don't think it makes much sense to do this. One thing that would be nice to do however is to make the free list store the next pointer in the memory chunk instead of using a separate list node.
The gobject-performance branch is now rebased onto the gobject-performance-2 branch based on glib 2.18, with some stuff merged.
now in master