GNOME Bugzilla – Bug 730807
GMutex performance regression
Last modified: 2014-06-06 16:35:31 UTC
Whilst looking at some performance-related issues in GStreamer, I noticed a performance regression from GLib 2.30 to 2.32 and later. A degenerate test case of 100 million uncontended lock/unlock cycles of an inited static mutex takes about 150msecs with 2.30 and 5.5secs with 2.32/2.41. I would expect the impact to be even worse on embedded systems, but haven't measured that yet. While the test case may be completely artificial, using pthread_mutex_lock/unlock directly instead of g_mutex_lock/unlock in main code paths in GStreamer led to a ~30% performance improvement in actual test pipelines where most of the work should be copying gigabytes of video data around.
I'm guessing this has something to do with the __sync_synchronize () in g_atomic_pointer_get()...
Created attachment 277296 [details] [review] gatomic: fix typo in deprecation attribute
Created attachment 277297 [details] [review] gatomic: whitespace fixes
Created attachment 277298 [details] [review] gatomic: use GCC C11-style atomics, if available GCC does not yet support ISO C11 atomic operations, but it has compatible versions available as an extension. Use these for load and store if they are available in order to avoid emitting a hard fence instruction (since in many cases, we do not need it -- on x86, for example). For now we use the fully seqentially-consistent memory model, since these APIs are documented rather explicitly: "This call acts as a full compiler and hardware memory barrier". In the future we can consider introducing new APIs for the more relaxed memory models, if they are available (or fall back to stricter ones otherwise).
With these I get an improvement from 10.8s -> 4.3s in a quick example that does a tight loop of lock(); unlock();. I'd be interested in knowing what impact this has on your real world case.
Thanks for the quick response! With the patches applied on top of GLib master and with GStreamer master recompiled: - degenerate mutex-benchmark: 5.5s -> 2.5s (-55%) - degenerate padpush-benchmark: 3.06s -> 1.5s (-51%) - (super unoptimised) real world case [*]: 10.0s -> 8.0s (-20%) which is quite impressive! [*] time gst-launch-1.0 videotestsrc pattern=black num-buffers=1000 ! video/x-raw,format=I420,width=1920,height=1080 ! rtpvrawpay ! fakesink So basically videotestsrc will 1000 times: allocate a 3MB buffer, fill it with the equivalent of black pixels, push the buffer to rtpvrawpay, which will then split each input buffer into ~2200 output buffers of ~1400 bytes, which it will push to the fakesink, which will discard them. So we have about 3GB of data that gets memset by videotestsrc, and then copied into new buffers by the payloader, and about 2.2 million gst_pad_push() calls altogether (this is all not very efficient and can be optimised in multiple ways, but it's a degenerate real world use case of some sort).
Some more research: After the above patches and with -O3, g_mutex_get_impl() is fast and it gets inlined. Here's g_mutex_lock(), for the record. void g_mutex_lock (GMutex *mutex) { gint status; if G_UNLIKELY ((status = pthread_mutex_lock (g_mutex_get_impl (mutex))) != 0) g_thread_abort (status, "pthread_mutex_lock"); } Doing the error check is sort of expensive because it means we can't tailcall pthread_mutex_lock(). Even if we remove this check, though, we get some very bad output from the compiler, even at -O3: a1440: 41 54 push %r12 a1442: 55 push %rbp a1443: 48 89 fd mov %rdi,%rbp a1446: 53 push %rbx a1447: 48 8b 1f mov (%rdi),%rbx a144a: 48 85 db test %rbx,%rbx a144d: 48 89 df mov %rbx,%rdi a1450: 74 0e je a1460 <g_mutex_lock+0x20> a1452: 5b pop %rbx a1453: 5d pop %rbp a1454: 41 5c pop %r12 a1456: e9 55 94 f7 ff jmpq 1a8b0 <pthread_mutex_lock@plt> (g_mutex_lock+0x20 is the slow path case where we must create the mutex for the first time). This means in the 99.999% case, we end up executing all of those instructions (with the jump not taken), including three completely pointless pairs of push/pop. It seems that gcc pushes r12 in case it will be used in the other path, even though we tag that path with G_UNLIKELY. I'd love suggestions about how I can convince it not to do this. For example, it would be great if we got code more like: mov esi, [edi] test esi, esi je <slow path> mov edi, esi jmp pthread_mutex_lock fwiw, I tried __attribute__((hot)) but it didn't seem to make much of a difference. One more note: it was the intention that we would replace the pthreads version of GMutex with a native home-grown implementation on Linux implemented using atomic operations and futex. In my testing I found I was able to easily beat the performance of pthread mutexes (which have to contain checks for the various modes that they can be in). It may be time to explore this...
(In reply to comment #7) > Some more research: > > After the above patches and with -O3, g_mutex_get_impl() is fast and it gets > inlined. > Wouldn't it be worth marking those various private g_*_get_impl() as inline ? Compiling all of glib/gobject with -O3 (as opposed to -O2) doesn't always return the expected speedup. Whereas here those functions are only used in a limited number of functions.
Created attachment 277624 [details] [review] gthread: use inline keyword for _get_impl() functions Give compiler a hint that these should be inlined, which doesn't seem to happen by default with -O2. Yields 5% speedup in artificial benchmarks, and 1% speedup in a real-world test case doing a lot of mutex locking and unlocking. https://bugzilla.gnome.org/show_bug.cgi?id=730807 This seems to be both trivial and a worthwhile thing to do, so here's a patch for that. I should also note that all my observations above have been with default glib cflags, so -O2 (gcc 4.8.3). The inlining appears to have an even greater effect on top of Ryan's gatomic patches (2.5% speedup in real-world test case, 5-7% in artificial benchmarks).
Attachment 277296 [details] pushed as aa0e873 - gatomic: fix typo in deprecation attribute Attachment 277297 [details] pushed as 875eeb2 - gatomic: whitespace fixes Attachment 277298 [details] pushed as db0e43d - gatomic: use GCC C11-style atomics, if available Attachment 277624 [details] pushed as 256305d - gthread: use inline keyword for _get_impl() functions