After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 650458 - reduce overhead in g_object_set/get_data
reduce overhead in g_object_set/get_data
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gobject
unspecified
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2011-05-18 05:31 UTC by Matthias Clasen
Modified: 2011-06-04 01:14 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
patch (14.65 KB, patch)
2011-05-18 05:31 UTC, Matthias Clasen
none Details | Review
Make g_datalist not use a global lock and perform better (15.12 KB, patch)
2011-05-19 19:59 UTC, Alexander Larsson
none Details | Review
Make quark to string lockless (2.79 KB, patch)
2011-05-19 19:59 UTC, Alexander Larsson
none Details | Review
Make g_datalist_get_data not look up the quark (3.75 KB, patch)
2011-05-19 19:59 UTC, Alexander Larsson
none Details | Review
Make g_object_get_data use the new faster g_datalist_get_data (921 bytes, patch)
2011-05-19 19:59 UTC, Alexander Larsson
none Details | Review
not exactly hitech testing... (8.97 KB, text/plain)
2011-05-20 02:55 UTC, Matthias Clasen
  Details
try again (1.18 KB, text/plain)
2011-05-20 13:04 UTC, Matthias Clasen
  Details
Some microoptimizations to the g_datalist_id_get/set_data (5.79 KB, patch)
2011-05-23 09:21 UTC, Alexander Larsson
none Details | Review
Use g_atomic_pointer_exchange_and_or/and to set datalist flags (1.60 KB, patch)
2011-05-25 10:30 UTC, Alexander Larsson
none Details | Review
Make g_datalist not use a global lock and perform better (15.15 KB, patch)
2011-05-25 10:30 UTC, Alexander Larsson
none Details | Review
Make quark to string lockless (3.09 KB, patch)
2011-05-25 10:30 UTC, Alexander Larsson
none Details | Review
Make g_datalist_get_data not look up the quark (3.82 KB, patch)
2011-05-25 10:30 UTC, Alexander Larsson
none Details | Review
Make g_object_get_data use the new faster g_datalist_get_data (921 bytes, patch)
2011-05-25 10:30 UTC, Alexander Larsson
none Details | Review
Use g_atomic_pointer_or/and to set datalist flags (1.52 KB, patch)
2011-05-30 12:14 UTC, Alexander Larsson
none Details | Review

Description Matthias Clasen 2011-05-18 05:31:51 UTC
Created attachment 188015 [details] [review]
patch

It irks me that we use g_datalist for this, which (for reasons that are not entirely clear to me) uses a global lock, which is unncessary here, considering it is not allowed to use the same object from multiple threads anyway, without your own locking. Also, g_datalist is a linked list, which means that lookups are slower than they should be (and e.g. GtkWidget has some 20 keys that it uses with g_object_set_data).

The attached patch is a proof of concept that replaces GData by a GArray. 
A quick test that just sets/gets 14 key/value pairs on an object in multiple threads (each thread has its own object, of course), shows the following numbers on a dualcore machine:

current code, using a linked list with a lock

1 thread:   6365485 set_qdata/get_qdata pairs per second
2 threads:  2819527
4 threads:  2730305
8 threads:  2703831

new code, using an array without lock

1 thread:  15117908
2 threads: 28068796
4 threads: 28101023
8 threads: 28517746
Comment 1 Benjamin Otte (Company) 2011-05-18 11:22:37 UTC
I would guess that you would break quite a few things by removing the lock on the datalist, because we use the datalist for a bunch of things - like weak refs or toggle refs - that need to be threadsafe (for GStreamer and GCs). And you cannot use a custom lock for those, because you'd need to share this lock with all the other code that might potentially access the datalist...
Comment 2 Matthias Clasen 2011-05-18 11:41:56 UTC
Sounds like an argument against sharing the data structure used for things that do (toggle, weak refs) and don't (object data) need locks...
Comment 3 Benjamin Otte (Company) 2011-05-18 11:46:41 UTC
Indeed. But then, where do we put this stuff? The datalist has historically been (ab)used for everything, as it's the only place where we can find spare space on GObject. We could make the datalist pointer a private pointer and have a GObjectPrivate of course, but that wouldn't make things less tricky.
Comment 4 Alexander Larsson 2011-05-18 21:07:01 UTC
As i said in https://bugzilla.gnome.org/show_bug.cgi?id=608695#c18 I do believe we have one spare bit in the data flags. This could be used to put a bit-lock, allowing a non-global lock for the qdata.

Additionally, on 64bit arches we have a 32bit hole in the GObject struct that could be reused to put the flags in to avoid all the masking out of the flags in the qdata pointer.
Comment 5 Alexander Larsson 2011-05-19 12:53:55 UTC
i have an initial patch based on this, but its got some bug that i'm tracking down, will put it up soon
Comment 6 Alexander Larsson 2011-05-19 19:59:00 UTC
Created attachment 188168 [details] [review]
Make g_datalist not use a global lock and perform better

This implementation uses a per-list bitlock for user data, and a simple array rather
than a linked list which uses less memory and less allocations. It also gets better
cache behaviour since related things are stored close to each other.
Comment 7 Alexander Larsson 2011-05-19 19:59:04 UTC
Created attachment 188169 [details] [review]
Make quark to string lockless

We do this by assigning to g_quarks atomically and leaking it when
replacing it atomically. Then its safe to consume the array
on the reader side (atomically).

Also, since we're leaking quarks on growing, bump the block size
so that we're not leaking as much. gtk3-demo allocates > 1500 quarks,
and gnome apps > 3000. I'm setting the block to 2048 which means no
leaks for small gtk3 apps and just one leak for gnome apps.
Comment 8 Alexander Larsson 2011-05-19 19:59:07 UTC
Created attachment 188170 [details] [review]
Make g_datalist_get_data not look up the quark

Instead of converting the string to a quark and comparing quarks we
use the new lockless g_quark_to_string and just compare the quarks
in the datalist with the given string.

This means we avoid the global lock for string to quark. Additionally
most of the time the data list will be quite short, so the cost of
doing the sting comparisons is likely similar to that of the quark
hashtable lookup (which does at least one string comparison for a
successfull lookup).
Comment 9 Alexander Larsson 2011-05-19 19:59:12 UTC
Created attachment 188171 [details] [review]
Make g_object_get_data use the new faster g_datalist_get_data
Comment 10 Alexander Larsson 2011-05-19 20:02:18 UTC
Here are a bunch of performance fixes. Lightly tested, passed make check and trivial gnome app use. Needs careful review. Particularly sensitive is the using of an extra flag in GDataList for the bitlock, as this requires that the GData allocationa are aligned by 8, which i think is safe to assume since its > 8 bytes, but I'm not 100% sure (there is an assert to catch it).

I have not really done any performance testing on this yet. Do you have the test app you used matthias?
Comment 11 Matthias Clasen 2011-05-20 02:48:40 UTC
Review of attachment 188169 [details] [review]:

::: glib/gdataset.c
@@ +1210,1 @@
+  quark_seq_id = g_atomic_int_get (&g_quark_seq_id);

Shouldn't both quark_seq_id and g_quark_seq_id be gint, then ? 
I guess it doesn't really make a difference.
Comment 12 Matthias Clasen 2011-05-20 02:55:41 UTC
Created attachment 188195 [details]
not exactly hitech testing...
Comment 13 Alexander Larsson 2011-05-20 06:24:14 UTC
That looks like the gparam test case...
Comment 14 Alexander Larsson 2011-05-20 09:03:07 UTC
Review of attachment 188169 [details] [review]:

::: glib/gdataset.c
@@ +1210,1 @@
+  quark_seq_id = g_atomic_int_get (&g_quark_seq_id);

GQuark is an guint32, which is unlikely to be incompatible with int, but since these are all internal variables we might as well make them all ints to avoid some ugly casts.
Comment 15 Matthias Clasen 2011-05-20 13:04:42 UTC
Created attachment 188222 [details]
try again
Comment 16 Matthias Clasen 2011-05-22 04:49:56 UTC
Here are some numbers from my Core2 Duo system.
All numbers are get_data/set_data calls per second

First with the current code:

\threads   1       2       4       8
keys\
 1   6363509 3040576 2878283 2876310
 5   5267876 2655079 2622530 2536603
10   5062729 2639883 2560441 2497500
15   4922879 2587071 2489951 2382917
20   4785701 2363509 2212731 2224256

With Alex' patches:

         1       2       4       8
 1 3587006 6506900 6528459 6564964
 5 3163814 5692822 5766011 5752818
10 2829766 5147011 5211217 5217748
15 2673185 4889400 4916322 4869587
20 2520503 4594003 4664669 4486572

You can clearly see how the loss of the global lock makes things scale here, but the drop for the single-threaded case is troublesome.

So, I looked at our current atomic operations, and tried to use gcc builtins more properly (see patches in bug 617491 ). With that, I get:

         1        2        4        8
 1 8226796 14668671 15040647 15076873
 5 6811859 12232341 12444288 12389128
10 6557484 12023884 11782712 11932618
15 6298347 11444877 11428105 11433832
20 6040943 10955581 11038980 10976504
Comment 17 Alexander Larsson 2011-05-22 20:16:54 UTC
What performance do you get with proper atomics, but without my fixes?
Comment 18 Matthias Clasen 2011-05-22 20:58:34 UTC
with proper atomics, and the global lock:
       
         1       2      4       8
 1 9325922 4200101 3849140 4120139
 5 7190313 3484227 3401579 3362393
10 7076114 3280597 3441523 3316965
15 6571332 3166933 3303017 3110672
20 6300911 3333817 3101632 3034678
Comment 19 Alexander Larsson 2011-05-22 21:46:22 UTC
So, the uncontended slowdown is not that bad, but there is one still.
The core difference is in g_data_set_internal, and the main difference there is that the old one does a mutex lock + a linear search over a linked list, whereas the new code does a bitlock and a search over an array.

My *guess* would be that an array search is faster than a list search (and the greater scalability over N_KEYS makes this likely), so the difference would be in the locking methods in the uncontended case? Although it could also be just be the case that g_data_set_internal is a bit too general and could use some micro-optimizations.

Even then, its still faster than the old broken atomics version, and its much more scalable. So, i think its worth it. (although its still useful to see if we can avoid the slowdown totally)
Comment 20 Alexander Larsson 2011-05-22 21:48:42 UTC
One obvious problem with the g_bit_lock API is that it takes the bit nr as an argument, so even if the value is constant at build time it will do the shift each time it runs. This could be fixed with a simple macro call that expands the shift in the caller side. Not sure how much this would help though.
Comment 21 Matthias Clasen 2011-05-22 23:17:11 UTC
Here is what I get with bitlocks + a patch to use sync_fetch_and_and/or in bitlock implementations + simple macro wrappers a la:

#define g_bit_lock(address,lock_bit) \
  __extension__ ({ \
  gint v; \
  v = g_atomic_int_exchange_and_or (address, 1u << lock_bit); \
  if (v & (1u << lock_bit)) \
    g_bit_lock_impl (address, v, 1u << lock_bit); \
  })

#define g_bit_trylock(address,lock_bit) \
  __extension__ ({ \
  gint v; \
  v = g_atomic_int_exchange_and_or (address, 1u << lock_bit); \
  (v & (1u << lock_bit)) == 0; \
  })

#define g_bit_unlock(address,lock_bit) \
  g_bit_unlock_impl (address, 1u << lock_bit);


            1          2          4          8 
 1    9034545   15632174   16332697   16830987 
 5    8031757   14643620   14775454   14779180 
10    7356896   13445988   13476118   13519465 
15    7090342   12873340   12861367   12958982 
20    6746167   12289838   12317209   12333660 


Note that this beats the global lock in all cases but the
1 thread/1 key one. Given that widgets have on the order of 20 keys in their gdatalists, this should be a win almost everywhere.
Comment 22 Alexander Larsson 2011-05-23 09:21:39 UTC
Created attachment 188362 [details] [review]
Some microoptimizations to the g_datalist_id_get/set_data
Comment 23 Alexander Larsson 2011-05-23 09:22:22 UTC
These microoptimizations give me a few hundred thousand ops per second more.
Comment 24 Alexander Larsson 2011-05-23 18:47:54 UTC
I'm getting second thoughts about the bitlock approach used in my patch. It passes the pointer to the right uint32 part of the pointer for 64bit pointers, but i'm not sure atomicity works right if the atomic data is accessed as 64bit in some places and as 32bit in some other places.

To be 100% sure we should perhaps implement a bitlock on pointer-sized ints.
Comment 25 Alexander Larsson 2011-05-25 10:30:05 UTC
Created attachment 188572 [details] [review]
Use g_atomic_pointer_exchange_and_or/and to set datalist flags
Comment 26 Alexander Larsson 2011-05-25 10:30:12 UTC
Created attachment 188573 [details] [review]
Make g_datalist not use a global lock and perform better

This implementation uses a per-list bitlock for user data, and a
simple array rather than a linked list which uses less memory and less
allocations. It also gets better cache behaviour since related things
are stored close to each other.
Comment 27 Alexander Larsson 2011-05-25 10:30:15 UTC
Created attachment 188574 [details] [review]
Make quark to string lockless

We do this by assigning to g_quarks atomically and leaking it when
replacing it atomically. Then its safe to consume the array
on the reader side (atomically).

Also, since we're leaking quarks on growing, bump the block size
so that we're not leaking as much. gtk3-demo allocates > 1500 quarks,
and gnome apps > 3000. I'm setting the block to 2048 which means no
leaks for small gtk3 apps and just one leak for gnome apps.
Comment 28 Alexander Larsson 2011-05-25 10:30:18 UTC
Created attachment 188575 [details] [review]
Make g_datalist_get_data not look up the quark

Instead of converting the string to a quark and comparing quarks we
use the new lockless g_quark_to_string and just compare the quarks
in the datalist with the given string.

This means we avoid the global lock for string to quark. Additionally
most of the time the data list will be quite short, so the cost of
doing the sting comparisons is likely similar to that of the quark
hashtable lookup (which does at least one string comparison for a
successfull lookup).
Comment 29 Alexander Larsson 2011-05-25 10:30:22 UTC
Created attachment 188576 [details] [review]
Make g_object_get_data use the new faster g_datalist_get_data
Comment 30 Alexander Larsson 2011-05-25 10:32:24 UTC
New series uses pointer atomics and bitlocks from bug 650823 and fixes the gint/gquark mixing.
Comment 31 Alexander Larsson 2011-05-30 12:14:14 UTC
Created attachment 188880 [details] [review]
Use g_atomic_pointer_or/and to set datalist flags
Comment 32 Alexander Larsson 2011-05-30 12:16:00 UTC
Updated one patch to work with whats in head. The whole series now depends only depends on git master and the pointer bitlock patch in bug 651467.