Bug 730807 – GMutex performance regression

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 730807 - GMutex performance regression


Summary:	GMutex performance regression


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	gthread
Version:	2.41.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-05-27 11:14 UTC by Tim-Philipp Müller
Modified:	2014-06-06 16:35 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gatomic: fix typo in deprecation attribute (938 bytes, patch) 2014-05-27 14:36 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review
gatomic: whitespace fixes (8.16 KB, patch) 2014-05-27 14:36 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review
gatomic: use GCC C11-style atomics, if available (6.06 KB, patch) 2014-05-27 14:36 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review
gthread: use inline keyword for _get_impl() functions (1.94 KB, patch) 2014-05-31 14:01 UTC, Tim-Philipp Müller	committed	Details \| Review

Description Tim-Philipp Müller 2014-05-27 11:14:05 UTC

Whilst looking at some performance-related issues in GStreamer, I noticed a performance regression from GLib 2.30 to 2.32 and later.

A degenerate test case of 100 million uncontended lock/unlock cycles of an inited static mutex takes about 150msecs with 2.30 and 5.5secs with 2.32/2.41. I would expect the impact to be even worse on embedded systems, but haven't measured that yet.

While the test case may be completely artificial, using pthread_mutex_lock/unlock directly instead of g_mutex_lock/unlock in main code paths in GStreamer led to a ~30% performance improvement in actual test pipelines where most of the work should be copying gigabytes of video data around.

Comment 1 Allison Karlitskaya (desrt) 2014-05-27 11:33:43 UTC

I'm guessing this has something to do with the __sync_synchronize () in g_atomic_pointer_get()...

Comment 2 Allison Karlitskaya (desrt) 2014-05-27 14:36:15 UTC

Created attachment 277296 [details] [review]
gatomic: fix typo in deprecation attribute

Comment 3 Allison Karlitskaya (desrt) 2014-05-27 14:36:22 UTC

Created attachment 277297 [details] [review]
gatomic: whitespace fixes

Comment 4 Allison Karlitskaya (desrt) 2014-05-27 14:36:27 UTC

Created attachment 277298 [details] [review]
gatomic: use GCC C11-style atomics, if available

GCC does not yet support ISO C11 atomic operations, but it has
compatible versions available as an extension.  Use these for load and
store if they are available in order to avoid emitting a hard fence
instruction (since in many cases, we do not need it -- on x86, for
example).

For now we use the fully seqentially-consistent memory model, since
these APIs are documented rather explicitly: "This call acts as a full
compiler and hardware memory barrier".

In the future we can consider introducing new APIs for the more relaxed
memory models, if they are available (or fall back to stricter ones
otherwise).

Comment 5 Allison Karlitskaya (desrt) 2014-05-27 14:37:23 UTC

With these I get an improvement from 10.8s -> 4.3s in a quick example that does a tight loop of lock(); unlock();.  I'd be interested in knowing what impact this has on your real world case.

Comment 6 Tim-Philipp Müller 2014-05-27 15:15:24 UTC

Thanks for the quick response!

With the patches applied on top of GLib master and with GStreamer master recompiled:

  - degenerate mutex-benchmark: 5.5s -> 2.5s (-55%)

  - degenerate padpush-benchmark: 3.06s -> 1.5s (-51%)

  - (super unoptimised) real world case [*]: 10.0s -> 8.0s (-20%)

which is quite impressive!

[*] time gst-launch-1.0 videotestsrc pattern=black num-buffers=1000 ! video/x-raw,format=I420,width=1920,height=1080 ! rtpvrawpay ! fakesink

So basically videotestsrc will 1000 times: allocate a 3MB buffer, fill it with the equivalent of black pixels, push the buffer to rtpvrawpay, which will then split each input buffer into ~2200 output buffers of ~1400 bytes, which it will push to the fakesink, which will discard them. So we have about 3GB of data that gets memset by videotestsrc, and then copied into new buffers by the payloader, and about 2.2 million gst_pad_push() calls altogether (this is all not very efficient and can be optimised in multiple ways, but it's a degenerate real world use case of some sort).

Comment 7 Allison Karlitskaya (desrt) 2014-05-27 15:26:36 UTC

Some more research:

After the above patches and with -O3, g_mutex_get_impl() is fast and it gets inlined.

Here's g_mutex_lock(), for the record.

void
g_mutex_lock (GMutex *mutex)
{
  gint status;
  
  if G_UNLIKELY ((status = pthread_mutex_lock (g_mutex_get_impl (mutex))) != 0)
    g_thread_abort (status, "pthread_mutex_lock");
}

Doing the error check is sort of expensive because it means we can't tailcall pthread_mutex_lock().  Even if we remove this check, though, we get some very bad output from the compiler, even at -O3:

   a1440:       41 54                   push   %r12
   a1442:       55                      push   %rbp
   a1443:       48 89 fd                mov    %rdi,%rbp
   a1446:       53                      push   %rbx
   a1447:       48 8b 1f                mov    (%rdi),%rbx
   a144a:       48 85 db                test   %rbx,%rbx
   a144d:       48 89 df                mov    %rbx,%rdi
   a1450:       74 0e                   je     a1460 <g_mutex_lock+0x20>
   a1452:       5b                      pop    %rbx
   a1453:       5d                      pop    %rbp
   a1454:       41 5c                   pop    %r12
   a1456:       e9 55 94 f7 ff          jmpq   1a8b0 <pthread_mutex_lock@plt>


(g_mutex_lock+0x20 is the slow path case where we must create the mutex for the first time).


This means in the 99.999% case, we end up executing all of those instructions (with the jump not taken), including three completely pointless pairs of push/pop.

It seems that gcc pushes r12 in case it will be used in the other path, even though we tag that path with G_UNLIKELY.

I'd love suggestions about how I can convince it not to do this.

For example, it would be great if we got code more like:

mov esi, [edi]
test esi, esi
je <slow path>
mov edi, esi
jmp pthread_mutex_lock


fwiw, I tried __attribute__((hot)) but it didn't seem to make much of a difference.



One more note: it was the intention that we would replace the pthreads version of GMutex with a native home-grown implementation on Linux implemented using atomic operations and futex.  In my testing I found I was able to easily beat the performance of pthread mutexes (which have to contain checks for the various modes that they can be in).  It may be time to explore this...

Comment 8 Edward Hervey 2014-05-30 08:42:05 UTC

(In reply to comment #7)
> Some more research:
> 
> After the above patches and with -O3, g_mutex_get_impl() is fast and it gets
> inlined.
> 

  Wouldn't it be worth marking those various private g_*_get_impl() as inline ?

  Compiling all of glib/gobject with -O3 (as opposed to -O2) doesn't always return the expected speedup. Whereas here those functions are only used in a limited number of functions.

Comment 9 Tim-Philipp Müller 2014-05-31 14:01:51 UTC

Created attachment 277624 [details] [review]
gthread: use inline keyword for _get_impl() functions

    Give compiler a hint that these should be inlined,
    which doesn't seem to happen by default with -O2.
    Yields 5% speedup in artificial benchmarks, and
    1% speedup in a real-world test case doing a lot
    of mutex locking and unlocking.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=730807

This seems to be both trivial and a worthwhile thing to do, so here's a patch for that.

I should also note that all my observations above have been with default glib cflags, so -O2 (gcc 4.8.3).

The inlining appears to have an even greater effect on top of Ryan's gatomic patches (2.5% speedup in real-world test case, 5-7% in artificial benchmarks).

Comment 10 Allison Karlitskaya (desrt) 2014-06-06 16:35:19 UTC

Attachment 277296 [details] pushed as aa0e873 - gatomic: fix typo in deprecation attribute
Attachment 277297 [details] pushed as 875eeb2 - gatomic: whitespace fixes
Attachment 277298 [details] pushed as db0e43d - gatomic: use GCC C11-style atomics, if available
Attachment 277624 [details] pushed as 256305d - gthread: use inline keyword for _get_impl() functions