GNOME Bugzilla – Bug 650458
reduce overhead in g_object_set/get_data
Last modified: 2011-06-04 01:14:37 UTC
Created attachment 188015 [details] [review] patch It irks me that we use g_datalist for this, which (for reasons that are not entirely clear to me) uses a global lock, which is unncessary here, considering it is not allowed to use the same object from multiple threads anyway, without your own locking. Also, g_datalist is a linked list, which means that lookups are slower than they should be (and e.g. GtkWidget has some 20 keys that it uses with g_object_set_data). The attached patch is a proof of concept that replaces GData by a GArray. A quick test that just sets/gets 14 key/value pairs on an object in multiple threads (each thread has its own object, of course), shows the following numbers on a dualcore machine: current code, using a linked list with a lock 1 thread: 6365485 set_qdata/get_qdata pairs per second 2 threads: 2819527 4 threads: 2730305 8 threads: 2703831 new code, using an array without lock 1 thread: 15117908 2 threads: 28068796 4 threads: 28101023 8 threads: 28517746
I would guess that you would break quite a few things by removing the lock on the datalist, because we use the datalist for a bunch of things - like weak refs or toggle refs - that need to be threadsafe (for GStreamer and GCs). And you cannot use a custom lock for those, because you'd need to share this lock with all the other code that might potentially access the datalist...
Sounds like an argument against sharing the data structure used for things that do (toggle, weak refs) and don't (object data) need locks...
Indeed. But then, where do we put this stuff? The datalist has historically been (ab)used for everything, as it's the only place where we can find spare space on GObject. We could make the datalist pointer a private pointer and have a GObjectPrivate of course, but that wouldn't make things less tricky.
As i said in https://bugzilla.gnome.org/show_bug.cgi?id=608695#c18 I do believe we have one spare bit in the data flags. This could be used to put a bit-lock, allowing a non-global lock for the qdata. Additionally, on 64bit arches we have a 32bit hole in the GObject struct that could be reused to put the flags in to avoid all the masking out of the flags in the qdata pointer.
i have an initial patch based on this, but its got some bug that i'm tracking down, will put it up soon
Created attachment 188168 [details] [review] Make g_datalist not use a global lock and perform better This implementation uses a per-list bitlock for user data, and a simple array rather than a linked list which uses less memory and less allocations. It also gets better cache behaviour since related things are stored close to each other.
Created attachment 188169 [details] [review] Make quark to string lockless We do this by assigning to g_quarks atomically and leaking it when replacing it atomically. Then its safe to consume the array on the reader side (atomically). Also, since we're leaking quarks on growing, bump the block size so that we're not leaking as much. gtk3-demo allocates > 1500 quarks, and gnome apps > 3000. I'm setting the block to 2048 which means no leaks for small gtk3 apps and just one leak for gnome apps.
Created attachment 188170 [details] [review] Make g_datalist_get_data not look up the quark Instead of converting the string to a quark and comparing quarks we use the new lockless g_quark_to_string and just compare the quarks in the datalist with the given string. This means we avoid the global lock for string to quark. Additionally most of the time the data list will be quite short, so the cost of doing the sting comparisons is likely similar to that of the quark hashtable lookup (which does at least one string comparison for a successfull lookup).
Created attachment 188171 [details] [review] Make g_object_get_data use the new faster g_datalist_get_data
Here are a bunch of performance fixes. Lightly tested, passed make check and trivial gnome app use. Needs careful review. Particularly sensitive is the using of an extra flag in GDataList for the bitlock, as this requires that the GData allocationa are aligned by 8, which i think is safe to assume since its > 8 bytes, but I'm not 100% sure (there is an assert to catch it). I have not really done any performance testing on this yet. Do you have the test app you used matthias?
Review of attachment 188169 [details] [review]: ::: glib/gdataset.c @@ +1210,1 @@ + quark_seq_id = g_atomic_int_get (&g_quark_seq_id); Shouldn't both quark_seq_id and g_quark_seq_id be gint, then ? I guess it doesn't really make a difference.
Created attachment 188195 [details] not exactly hitech testing...
That looks like the gparam test case...
Review of attachment 188169 [details] [review]: ::: glib/gdataset.c @@ +1210,1 @@ + quark_seq_id = g_atomic_int_get (&g_quark_seq_id); GQuark is an guint32, which is unlikely to be incompatible with int, but since these are all internal variables we might as well make them all ints to avoid some ugly casts.
Created attachment 188222 [details] try again
Here are some numbers from my Core2 Duo system. All numbers are get_data/set_data calls per second First with the current code: \threads 1 2 4 8 keys\ 1 6363509 3040576 2878283 2876310 5 5267876 2655079 2622530 2536603 10 5062729 2639883 2560441 2497500 15 4922879 2587071 2489951 2382917 20 4785701 2363509 2212731 2224256 With Alex' patches: 1 2 4 8 1 3587006 6506900 6528459 6564964 5 3163814 5692822 5766011 5752818 10 2829766 5147011 5211217 5217748 15 2673185 4889400 4916322 4869587 20 2520503 4594003 4664669 4486572 You can clearly see how the loss of the global lock makes things scale here, but the drop for the single-threaded case is troublesome. So, I looked at our current atomic operations, and tried to use gcc builtins more properly (see patches in bug 617491 ). With that, I get: 1 2 4 8 1 8226796 14668671 15040647 15076873 5 6811859 12232341 12444288 12389128 10 6557484 12023884 11782712 11932618 15 6298347 11444877 11428105 11433832 20 6040943 10955581 11038980 10976504
What performance do you get with proper atomics, but without my fixes?
with proper atomics, and the global lock: 1 2 4 8 1 9325922 4200101 3849140 4120139 5 7190313 3484227 3401579 3362393 10 7076114 3280597 3441523 3316965 15 6571332 3166933 3303017 3110672 20 6300911 3333817 3101632 3034678
So, the uncontended slowdown is not that bad, but there is one still. The core difference is in g_data_set_internal, and the main difference there is that the old one does a mutex lock + a linear search over a linked list, whereas the new code does a bitlock and a search over an array. My *guess* would be that an array search is faster than a list search (and the greater scalability over N_KEYS makes this likely), so the difference would be in the locking methods in the uncontended case? Although it could also be just be the case that g_data_set_internal is a bit too general and could use some micro-optimizations. Even then, its still faster than the old broken atomics version, and its much more scalable. So, i think its worth it. (although its still useful to see if we can avoid the slowdown totally)
One obvious problem with the g_bit_lock API is that it takes the bit nr as an argument, so even if the value is constant at build time it will do the shift each time it runs. This could be fixed with a simple macro call that expands the shift in the caller side. Not sure how much this would help though.
Here is what I get with bitlocks + a patch to use sync_fetch_and_and/or in bitlock implementations + simple macro wrappers a la: #define g_bit_lock(address,lock_bit) \ __extension__ ({ \ gint v; \ v = g_atomic_int_exchange_and_or (address, 1u << lock_bit); \ if (v & (1u << lock_bit)) \ g_bit_lock_impl (address, v, 1u << lock_bit); \ }) #define g_bit_trylock(address,lock_bit) \ __extension__ ({ \ gint v; \ v = g_atomic_int_exchange_and_or (address, 1u << lock_bit); \ (v & (1u << lock_bit)) == 0; \ }) #define g_bit_unlock(address,lock_bit) \ g_bit_unlock_impl (address, 1u << lock_bit); 1 2 4 8 1 9034545 15632174 16332697 16830987 5 8031757 14643620 14775454 14779180 10 7356896 13445988 13476118 13519465 15 7090342 12873340 12861367 12958982 20 6746167 12289838 12317209 12333660 Note that this beats the global lock in all cases but the 1 thread/1 key one. Given that widgets have on the order of 20 keys in their gdatalists, this should be a win almost everywhere.
Created attachment 188362 [details] [review] Some microoptimizations to the g_datalist_id_get/set_data
These microoptimizations give me a few hundred thousand ops per second more.
I'm getting second thoughts about the bitlock approach used in my patch. It passes the pointer to the right uint32 part of the pointer for 64bit pointers, but i'm not sure atomicity works right if the atomic data is accessed as 64bit in some places and as 32bit in some other places. To be 100% sure we should perhaps implement a bitlock on pointer-sized ints.
Created attachment 188572 [details] [review] Use g_atomic_pointer_exchange_and_or/and to set datalist flags
Created attachment 188573 [details] [review] Make g_datalist not use a global lock and perform better This implementation uses a per-list bitlock for user data, and a simple array rather than a linked list which uses less memory and less allocations. It also gets better cache behaviour since related things are stored close to each other.
Created attachment 188574 [details] [review] Make quark to string lockless We do this by assigning to g_quarks atomically and leaking it when replacing it atomically. Then its safe to consume the array on the reader side (atomically). Also, since we're leaking quarks on growing, bump the block size so that we're not leaking as much. gtk3-demo allocates > 1500 quarks, and gnome apps > 3000. I'm setting the block to 2048 which means no leaks for small gtk3 apps and just one leak for gnome apps.
Created attachment 188575 [details] [review] Make g_datalist_get_data not look up the quark Instead of converting the string to a quark and comparing quarks we use the new lockless g_quark_to_string and just compare the quarks in the datalist with the given string. This means we avoid the global lock for string to quark. Additionally most of the time the data list will be quite short, so the cost of doing the sting comparisons is likely similar to that of the quark hashtable lookup (which does at least one string comparison for a successfull lookup).
Created attachment 188576 [details] [review] Make g_object_get_data use the new faster g_datalist_get_data
New series uses pointer atomics and bitlocks from bug 650823 and fixes the gint/gquark mixing.
Created attachment 188880 [details] [review] Use g_atomic_pointer_or/and to set datalist flags
Updated one patch to work with whats in head. The whole series now depends only depends on git master and the pointer bitlock patch in bug 651467.