After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 594525 - Implement lock-free interface lookup
Implement lock-free interface lookup
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gobject
unspecified
Other Linux
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks: 594650
 
 
Reported: 2009-09-08 16:44 UTC by Alexander Larsson
Modified: 2009-12-01 09:21 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Implement lock free interface lookup (21.14 KB, patch)
2009-09-08 16:46 UTC, Alexander Larsson
none Details | Review
Implement lock free interface lookup (25.27 KB, patch)
2009-09-09 11:24 UTC, Alexander Larsson
none Details | Review
tests (10.14 KB, patch)
2009-09-09 14:22 UTC, Benjamin Otte (Company)
none Details | Review

Description Alexander Larsson 2009-09-08 16:44:34 UTC
Currently all interface lookups, i.e. g_type_is_a (object, SOME_INTERFACE_TYPE) or interface vfunc calls, take a global lock. This is pretty poor from a threaded scalability perspective, as two threads that do otherwise unrelated things that use interfaces get unnecessary lock contention.

GObject allows adding interfaces to classes after construction of the classes which make this a bit complex, but its doable.
Comment 1 Alexander Larsson 2009-09-08 16:46:25 UTC
Created attachment 142711 [details] [review]
Implement lock free interface lookup

We implement lock free interface lookup by moving the n_ifaces
counter into memory pointed to by TypeNode->iface_entries, and
then updating this in RCU-style by always copying it, modifying
the copy and then when the modification is done replace the old
pointer with g_atomic_pointer_set.

There is one additional complexity when freeing the old memory,
since the old memory region can be in use. To handle this we
have an atomic refcounter on the iface_entries that we ref when
accessing the memory and unref after. When freeing if there are
no outstanding refs we free immediately, otherwise we put the
memory on a free list that we semi-regularly look at for possible
freeing.

With this infrastructure the patch then implements a lock-free
version of type_lookup_iface_entry_L called type_lookup_iface_entry_I
and use it in: g_type_interface_peek, g_type_interface_peek_parent
and type_node_check_conformities_UorL.

Using the performance tests from bug 557100 shows that the general
performance difference is negligible, but the lack of a lock for each
type check and interface vfunc call should greatly enhance threaded
scalability.
Comment 2 Alexander Larsson 2009-09-09 08:11:21 UTC
atomic_ref_aquire_I is racy, we're working on a replacement.
Comment 3 Alexander Larsson 2009-09-09 11:24:00 UTC
Created attachment 142775 [details] [review]
Implement lock free interface lookup

We implement lock free interface lookup by moving the n_ifaces
counter into memory pointed to by TypeNode->iface_entries, and
then updating this in RCU-style by always copying it, modifying
the copy and then when the modification is done replace the old
pointer with g_atomic_pointer_set.

There is one additional complexity when freeing the old memory,
since the old memory region can be in use. To handle this we
don't free such memory, but put it on a free list and reuse it
later. This means that lock-free lookups must be able to
handle the memory being modified in random ways without crashing,
and at the end we verify that the memory didn't change and the
transaction is ok.

With this infrastructure the patch then implements a lock-free
version of type_lookup_iface_entry_L called type_lookup_iface_vtable_I
and use it in: g_type_interface_peek, g_type_interface_peek_parent
and type_node_check_conformities_UorL.

Using the performance tests from bug 557100 shows that the general
performance difference is negligible, but the lack of a lock for each
type check and interface vfunc call should greatly enhance threaded
scalability.
Comment 4 Benjamin Otte (Company) 2009-09-09 14:22:13 UTC
Created attachment 142791 [details] [review]
tests

This adds some tests - each of them uses a liststore-type (direct subclass of GObject, 5 interfaces) and then does per run 1000 times:
"liststore-is-a" - run g_type_is_a (liststore_type, interface_type) for each of the implemented interfaces and one on-implemented interface
"liststore-interface-peek" - run g_type_interface_peek (liststore_class, interface_type) for each of the implemented interfaces
"liststore-interface-peek-same" - run g_type_interface_peek (liststore_class, interface_type) for the same implemented interface 5 times

The results are ordered with the numbers before the patch on the top line, the numbers after applying the current patch on the bottom.

Running the test single-threaded:

Running test "liststore-is-a"
  2175 runs, min/avg/max = 0.818/0.916/4.115 ms
  2542 runs, min/avg/max = 0.718/0.783/1.538 ms
Running test "liststore-interface-peek"
  4169 runs, min/avg/max = 0.474/0.476/1.286 ms
  4702 runs, min/avg/max = 0.396/0.422/7.459 ms
Running test "liststore-interface-peek-same"
  5019 runs, min/avg/max = 0.372/0.395/0.487 ms
  5790 runs, min/avg/max = 0.337/0.342/1.742 ms

For two threads running concurrently on a dual core machine:

Running test "liststore-is-a"
  443 runs, min/avg/max = 1.941/9.037/39.310 ms
  5067 runs, min/avg/max = 0.719/0.785/18.201 ms
Running test "liststore-interface-peek"
  545 runs, min/avg/max = 1.495/7.337/15.112 ms
  8987 runs, min/avg/max = 0.417/0.440/14.300 ms
Running test "liststore-interface-peek-same"
  583 runs, min/avg/max = 1.236/6.859/20.641 ms
  9924 runs, min/avg/max = 0.381/0.398/16.410 ms

The last numbers are scary, but only if you consider that you get very heavy lock contention on the type lock in this testcase. In a normal program this should be less of an issue.
Comment 5 Benjamin Otte (Company) 2009-09-09 14:25:59 UTC
I just noticed that the change is nontrivial even on a single thread.
The previous number don't call g_thread_init(), but if you call it (i.e. run the attached testcase with "-t 1") you get these results:

Running test "liststore-is-a"
  979 runs, min/avg/max = 1.913/2.040/3.817 ms
  2500 runs, min/avg/max = 0.657/0.795/6.723 ms
Running test "liststore-interface-peek"
  1293 runs, min/avg/max = 1.475/1.542/4.159 ms
  4460 runs, min/avg/max = 0.431/0.444/3.722 ms
Running test "liststore-interface-peek-same"
  1557 runs, min/avg/max = 1.265/1.280/3.870 ms
  4664 runs, min/avg/max = 0.415/0.424/4.485 ms

So it seems we gain a roughly 3x performance improvement from avoiding the lock.
Comment 6 Steve Frécinaux 2009-09-09 14:39:43 UTC
While the min/avg are always lower after the patch, I'm wondering why the max time has been 6x times higher for the single-threaded liststore-interface-peek case, and 3.5x times for liststore-interface-peek-same ?
Comment 7 Benjamin Otte (Company) 2009-09-09 15:26:11 UTC
(In reply to comment #6)
> While the min/avg are always lower after the patch, I'm wondering why the max
> time has been 6x times higher for the single-threaded liststore-interface-peek
> case, and 3.5x times for liststore-interface-peek-same ?
>
Likely because there's a context switch happening to some different process. If we're assuming that lots of processes use g_timeout_add_seconds(), once per second there should be a lot of processes that the kernel prefers scheduling. Once those are done after 4ms, control goes back to the otherwise idle performance test.

What I didn't yet add (but should), is dropping the bottom 10% of results, as those are usually way off due to external factors.
Comment 8 Alexander Larsson 2009-09-09 19:12:41 UTC
All these patches are now in the gobject-performance branch in git
Comment 9 Alexander Larsson 2009-09-16 12:39:27 UTC
I added some instrumentation to GAtomicArray to see how much stuff ends up "leaked" on the free list. This is what i get in nautilus:

GAtomicArray leaks 19 items, total size 1301
size 1: 2 (2 bytes)
size 3: 1 (3 bytes)
size 4: 2 (8 bytes)
size 32: 1 (32 bytes)
size 56: 9 (504 bytes)
size 152: 1 (152 bytes)
size 176: 1 (176 bytes)
size 200: 1 (200 bytes)
size 224: 1 (224 bytes)

For e.g. gedit i get similar results, and I believe most apps would get similar results (at least the same order of magnitude).

I had various ideas wrt how reuse could be improved (generation counters in allocate block, use low 2 bits in pointer, "round up" allocations to larger chunks, etc). However, given the small cost here I don't think it makes much sense to do this.

One thing that would be nice to do however is to make the free list store the next pointer in the memory chunk instead of using a separate list node.
Comment 10 Alexander Larsson 2009-10-02 19:14:18 UTC
The gobject-performance branch is now rebased onto the gobject-performance-2 branch based on glib 2.18, with some stuff merged.
Comment 11 Alexander Larsson 2009-12-01 09:21:45 UTC
now in master