GNOME Bugzilla – Bug 795180
Investigate performance impacts of recent compiler features on hot functions
Last modified: 2018-05-18 10:42:32 UTC
Over the past couple of years, our compiler toolchains have gotten lots of features to help protect generic C code from common mistakes. Some distributions, like good stewards, enable this by default. I was slogging through some assembly this morning, and was rather dismayed of a proper fast path for our generated GObject get_type functions. I tried a number of compiler optimizations and tweaking the headers to get things smaller, but generally they had little effect. This is mostly a tracking bug so I can dump my thoughts, and get some feedback from peers. What is interesting about get_type functions is that they are called *a lot* in our stack, but are short enough to not really show up profiles when using a sampling profiler (perf, etc). The bit that seemed to explode the fast path (checking for an existing GType, returning it) is -fstack-protector(-strong). It's arguable if the get_type functions need this because if they fail, you got other issues to deal with. But in particular, if we cared about enabling it, we could separate the user supplied code fragments in _G_DEFINE_TYPE_EXTENDED() into a separate function. To disable the stack-protector, we need to supply __attribute__((optimize("no-stack-protector"))) to the function declaration. I thought supplying this to just the expansion in the .c file would be enough, but it was not. I had to put it in the G_DECLARE_*_TYPE() macros. The fast path for -O0 is atrocious, but that is to be expected. For -O2 we were roughly 15 instructions w/ 2 conditional jumps. With the stack-protector disabled, we're at 5 instructions with 1 conditional jump. A quick reminder that these get called from all over the place in our stack. From function guards, to cast checks, to type checks. Another possible approach would be to allow the macros to reserve space for the GObject type node statically (and then graft it into the type tree by passing a pointer to it at registration time). That would have the benefit of making get_type functions mostly a "return (GType)&type_node" situation to get the same effect. (But more involved process). Thoughts? Comments? Thanks for reading :)
Created attachment 370827 [details] [review] example patch to apply no-stack-protector This is a very minimal patch that is not intended for integration, but simply to illustrate the "fix" to make get_type functions succinct assembly again.
In a contrived test consisting of a tight loop calling an object's get_type() function, there is about a 12.5% performance improvement by reducing the instructions from 15 (with 2 conditional jumps) to 5 (with 1 conditional jump). - 15+2jump is the expected output on gcc 8 + -O2 -fomit-frame-pointer. - 5+1jump is the expected output with same optimizations but get_type changed to emit no-stack-protector. Tests on x86_64 w/ 4mb cache I expect this to also reduce the amount of instruction cache bloat which can have other benefits to surrounding code. I'd like to test on an Arm device to see if we get similar numbers. Instruction cache is much more important there.
I ran the tests on a couple year old arm-linux-gnueabihf Odroid X2 (Exynos) running a variant of Ubuntu LTS 14.04. There, the speedup (or rather, less slowdown from stack-protector) was greater. A 24% improvement on average. Test code is at https://github.com/chergert/stack-protector-test. Just ./run.sh
Created attachment 370919 [details] [review] macros: add G_GNUC_NO_INLINE function attribute
Created attachment 370920 [details] [review] gtype: improve get_type fast path The -fstack-protector-strong used in many distributions by default has a rather drastic slowdown of the fast path in generated _get_type() functions using G_DEFINE_* macros. To work around this, and return our fast-path to what we had previously, we need to break out the slow-path (registering the type) into a secondary function that is not allowed to be a candidate for inlining. This ensures that the common case (type registered, return the GType id) is the hot path and handled in the prologue of the generated assembly. Additionally, this has the benefit of moving the PRELUDE block added for GSocket deadlock prevention into the once function, meaning that we do not unconditionally call g_ensure_type() every time we enter _get_type() functions using PRELUDE. Compare the hot path for g_socket_get_type() before and after. Before: 7e8b0: f3 0f 1e fa endbr64 7e8b4: 55 push %rbp 7e8b5: 53 push %rbx 7e8b6: 48 83 ec 28 sub $0x28,%rsp 7e8ba: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax 7e8c1: 00 00 7e8c3: 48 89 44 24 18 mov %rax,0x18(%rsp) 7e8c8: 31 c0 xor %eax,%eax 7e8ca: e8 d1 29 02 00 callq a12a0 <g_socket_family_get_type@@Base> 7e8cf: 48 89 c7 mov %rax,%rdi 7e8d2: e8 29 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8d7: e8 44 2a 02 00 callq a1320 <g_socket_type_get_type@@Base> 7e8dc: 48 89 c7 mov %rax,%rdi 7e8df: e8 1c 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8e4: e8 37 2b 02 00 callq a1420 <g_socket_protocol_get_type@@Base> 7e8e9: 48 89 c7 mov %rax,%rdi 7e8ec: e8 0f 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8f1: e8 ea 64 00 00 callq 84de0 <g_socket_address_get_type@@Base> 7e8f6: 48 89 c7 mov %rax,%rdi 7e8f9: e8 02 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8fe: e8 5d 08 ff ff callq 6f160 <g_networking_init@@Base> 7e903: 48 8b 05 3e 35 32 00 mov 0x32353e(%rip),%rax # 3a1e48 <_edata@@Base+0x900> 7e90a: 48 85 c0 test %rax,%rax 7e90d: 74 29 je 7e938 <g_socket_get_type@@Base+0x88> 7e90f: 48 8b 05 32 35 32 00 mov 0x323532(%rip),%rax # 3a1e48 <_edata@@Base+0x900> 7e916: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx 7e91b: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx 7e922: 00 00 7e924: 0f 85 e5 00 00 00 jne 7ea0f <g_socket_get_type@@Base+0x15f> 7e92a: 48 83 c4 28 add $0x28,%rsp 7e92e: 5b pop %rbx 7e92f: 5d pop %rbp 7e930: c3 retq After: 7fbe0: 48 8b 05 21 ae 32 00 mov 0x32ae21(%rip),%rax # 3aaa08 <g_define_type_id__volatile.18420> 7fbe7: 48 85 c0 test %rax,%rax 7fbea: 74 0c je 7fbf8 <g_socket_get_type+0x18> 7fbec: 48 8b 05 15 ae 32 00 mov 0x32ae15(%rip),%rax # 3aaa08 <g_define_type_id__volatile.18420> 7fbf3: c3 retq
Alright, I think with the two patches I've attached here we have something to investigate starting to integrate into glib. The first patch just adds G_GNUC_NO_INLINE which allows us to tell the compiler that we really don't want this function to be a candidate for inlining. The second patch improves our G_DEFINE_*TYPE functions to break out the work that does the real type registration into a separate function. This is useful for a couple of reasons. - We can mark it noinline so that it doesn't ruin our fast path in _get_type - It fixes recent changes to add a PRELUDE which does g_type_ensure() unconditionally in the current hot path. They are now done in the get_type_once() helper funtion. - Fast path for _get_type() of objects, boxed, and pointer types are now just a few instructions. - We still get stack-protector on the real registration function which can include user-supplied code. Ready for feedback!
We use the same trick in the linux native implementation (atomics + futex) of g_mutex_lock() IIRC.
Review of attachment 370919 [details] [review]: Looks fine to me, except for the lack of documentation. Also, if we want the performance fix to go into the stable 2.56 branch, we need to avoid new API for it
Presumably we just need to add documentation for the G_GNUC_NO_INLINE correct? As for 2.56, I hadn't thought about backporting it. I guess for that we could do _G_GNUC_NO_INLINE so we don't have to duplicate the type macro w/ and w/o support for the attribute.
Created attachment 371070 [details] [review] macros: add G_GNUC_NO_INLINE function attribute
Created attachment 371071 [details] [review] gtype: improve get_type fast path The -fstack-protector-strong used in many distributions by default has a rather drastic slowdown of the fast path in generated _get_type() functions using G_DEFINE_* macros. To work around this, and return our fast-path to what we had previously, we need to break out the slow-path (registering the type) into a secondary function that is not allowed to be a candidate for inlining. This ensures that the common case (type registered, return the GType id) is the hot path and handled in the prologue of the generated assembly. Additionally, this has the benefit of moving the PRELUDE block added for GSocket deadlock prevention into the once function, meaning that we do not unconditionally call g_ensure_type() every time we enter _get_type() functions using PRELUDE. Compare the hot path for g_socket_get_type() before and after. Before: 7e8b0: f3 0f 1e fa endbr64 7e8b4: 55 push %rbp 7e8b5: 53 push %rbx 7e8b6: 48 83 ec 28 sub $0x28,%rsp 7e8ba: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax 7e8c1: 00 00 7e8c3: 48 89 44 24 18 mov %rax,0x18(%rsp) 7e8c8: 31 c0 xor %eax,%eax 7e8ca: e8 d1 29 02 00 callq a12a0 <g_socket_family_get_type@@Base> 7e8cf: 48 89 c7 mov %rax,%rdi 7e8d2: e8 29 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8d7: e8 44 2a 02 00 callq a1320 <g_socket_type_get_type@@Base> 7e8dc: 48 89 c7 mov %rax,%rdi 7e8df: e8 1c 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8e4: e8 37 2b 02 00 callq a1420 <g_socket_protocol_get_type@@Base> 7e8e9: 48 89 c7 mov %rax,%rdi 7e8ec: e8 0f 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8f1: e8 ea 64 00 00 callq 84de0 <g_socket_address_get_type@@Base> 7e8f6: 48 89 c7 mov %rax,%rdi 7e8f9: e8 02 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8fe: e8 5d 08 ff ff callq 6f160 <g_networking_init@@Base> 7e903: 48 8b 05 3e 35 32 00 mov 0x32353e(%rip),%rax # 3a1e48 <_edata@@Base+0x900> 7e90a: 48 85 c0 test %rax,%rax 7e90d: 74 29 je 7e938 <g_socket_get_type@@Base+0x88> 7e90f: 48 8b 05 32 35 32 00 mov 0x323532(%rip),%rax # 3a1e48 <_edata@@Base+0x900> 7e916: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx 7e91b: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx 7e922: 00 00 7e924: 0f 85 e5 00 00 00 jne 7ea0f <g_socket_get_type@@Base+0x15f> 7e92a: 48 83 c4 28 add $0x28,%rsp 7e92e: 5b pop %rbx 7e92f: 5d pop %rbp 7e930: c3 retq After: 7fbe0: 48 8b 05 21 ae 32 00 mov 0x32ae21(%rip),%rax # 3aaa08 <g_define_type_id__volatile.18420> 7fbe7: 48 85 c0 test %rax,%rax 7fbea: 74 0c je 7fbf8 <g_socket_get_type+0x18> 7fbec: 48 8b 05 15 ae 32 00 mov 0x32ae15(%rip),%rax # 3aaa08 <g_define_type_id__volatile.18420> 7fbf3: c3 retq
I don't think we need/should backport this to 2.56 because that doesn't have the PRELUDE fixes for the deadlock prevention in type initialization. So it would need to be another patch (similar in implementation but basically a rewrite).
Review of attachment 371070 [details] [review]: ::: glib/docs.c @@ +2050,3 @@ +/** + * G_GNUC_NO_INLINE: Needs to be added to docs/reference/glib/glib-sections.txt. @@ +2052,3 @@ + * G_GNUC_NO_INLINE: + * + * Expands to the GNU C noinline function attribute if the compiler is gcc. s/noinline/`noinline`/ And mention that it expands to nothing if the compiler isn’t gcc. @@ +2057,3 @@ + * + * The attribute may be placed before the declaration, right before the + * static keyword. s/static/`static`/ @@ +2059,3 @@ + * static keyword. + * + * See the GNU C documentation for more details. See the [GNU C documentation](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-noinline-function-attribute) for more details.
Review of attachment 371071 [details] [review]: This looks great, thanks Christian. One question/comment: ::: gobject/gtype.h @@ +2104,3 @@ #define _G_DEFINE_BOXED_TYPE_BEGIN(TypeName, type_name, copy_func, free_func) \ +static GType \ +type_name##_get_type_once (void) \ Shouldn’t this one be G_GNUC_NO_INLINE too?
Created attachment 371075 [details] [review] macros: add G_GNUC_NO_INLINE function attribute
Created attachment 371076 [details] [review] gtype: improve get_type fast path The -fstack-protector-strong used in many distributions by default has a rather drastic slowdown of the fast path in generated _get_type() functions using G_DEFINE_* macros. To work around this, and return our fast-path to what we had previously, we need to break out the slow-path (registering the type) into a secondary function that is not allowed to be a candidate for inlining. This ensures that the common case (type registered, return the GType id) is the hot path and handled in the prologue of the generated assembly. Additionally, this has the benefit of moving the PRELUDE block added for GSocket deadlock prevention into the once function, meaning that we do not unconditionally call g_ensure_type() every time we enter _get_type() functions using PRELUDE. Compare the hot path for g_socket_get_type() before and after. Before: 7e8b0: f3 0f 1e fa endbr64 7e8b4: 55 push %rbp 7e8b5: 53 push %rbx 7e8b6: 48 83 ec 28 sub $0x28,%rsp 7e8ba: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax 7e8c1: 00 00 7e8c3: 48 89 44 24 18 mov %rax,0x18(%rsp) 7e8c8: 31 c0 xor %eax,%eax 7e8ca: e8 d1 29 02 00 callq a12a0 <g_socket_family_get_type@@Base> 7e8cf: 48 89 c7 mov %rax,%rdi 7e8d2: e8 29 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8d7: e8 44 2a 02 00 callq a1320 <g_socket_type_get_type@@Base> 7e8dc: 48 89 c7 mov %rax,%rdi 7e8df: e8 1c 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8e4: e8 37 2b 02 00 callq a1420 <g_socket_protocol_get_type@@Base> 7e8e9: 48 89 c7 mov %rax,%rdi 7e8ec: e8 0f 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8f1: e8 ea 64 00 00 callq 84de0 <g_socket_address_get_type@@Base> 7e8f6: 48 89 c7 mov %rax,%rdi 7e8f9: e8 02 81 fb ff callq 36a00 <g_type_ensure@plt> 7e8fe: e8 5d 08 ff ff callq 6f160 <g_networking_init@@Base> 7e903: 48 8b 05 3e 35 32 00 mov 0x32353e(%rip),%rax # 3a1e48 <_edata@@Base+0x900> 7e90a: 48 85 c0 test %rax,%rax 7e90d: 74 29 je 7e938 <g_socket_get_type@@Base+0x88> 7e90f: 48 8b 05 32 35 32 00 mov 0x323532(%rip),%rax # 3a1e48 <_edata@@Base+0x900> 7e916: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx 7e91b: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx 7e922: 00 00 7e924: 0f 85 e5 00 00 00 jne 7ea0f <g_socket_get_type@@Base+0x15f> 7e92a: 48 83 c4 28 add $0x28,%rsp 7e92e: 5b pop %rbx 7e92f: 5d pop %rbp 7e930: c3 retq After: 7fbe0: 48 8b 05 21 ae 32 00 mov 0x32ae21(%rip),%rax # 3aaa08 <g_define_type_id__volatile.18420> 7fbe7: 48 85 c0 test %rax,%rax 7fbea: 74 0c je 7fbf8 <g_socket_get_type+0x18> 7fbec: 48 8b 05 15 ae 32 00 mov 0x32ae15(%rip),%rax # 3aaa08 <g_define_type_id__volatile.18420> 7fbf3: c3 retq
Thanks for the review, patches updated to reflect all review requests.
Review of attachment 371075 [details] [review]: ++
Review of attachment 371076 [details] [review]: Lovely!
Review of attachment 371076 [details] [review]: This looks okay to me, but since it has that slight change of semantics for the classes that call g_type_ensure() inside their get_type() function, I'd like to get some testing done with a reproducer for bug #674885 to ensure we're not regressing there.
(In reply to Emmanuele Bassi (:ebassi) from comment #20) > Review of attachment 371076 [details] [review] [review]: > > This looks okay to me, but since it has that slight change of semantics for > the classes that call g_type_ensure() inside their get_type() function, I'd > like to get some testing done with a reproducer for bug #674885 to ensure > we're not regressing there. That’s a good point, and these patches do seem to directly contradict the reasoning in https://bugzilla.gnome.org/show_bug.cgi?id=674885#c64. They might have to be reworked to allow the prelude to remain outside the GOnce block.
Comment on attachment 371076 [details] [review] gtype: improve get_type fast path Dropping back to reviewed as per comment #21.
I was concerned that was the case. Should be easy enough to fix up. Not in any way thrilled about that implementation, but as long as it's private hopefully we can fix it later.
Created attachment 371682 [details] [review] gtype: improve get_type fast path The -fstack-protector-strong used in many distributions by default has a rather drastic slowdown of the fast path in generated _get_type() functions using G_DEFINE_* macros. The amount can vary by architecture, GCC version, and compiler flags. To work around this, and ensure a higher probability that our fast-path will match what we had previously, we need to break out the slow-path (registering the type) into a secondary function that is not a candidate for inlining. This ensures that the common case (type registered, return the GType id) is the hot path and handled in the prologue of the generated assembly even when -fstack-protector-strong is enabled.
Review of attachment 371682 [details] [review]: Looks good, thanks for updating it. Let’s get this into master now, so that any breakage happens fairly early this cycle.
Attachment 371075 [details] pushed as ede5c3f - macros: add G_GNUC_NO_INLINE function attribute Attachment 371682 [details] pushed as e924f77 - gtype: improve get_type fast path
Untested but following this web page: https://docs.microsoft.com/en-us/cpp/cpp/noinline I think we can do the following as well: diff --git a/glib/gmacros.h b/glib/gmacros.h index 55fb81e..86c0262 100644 --- a/glib/gmacros.h +++ b/glib/gmacros.h @@ -101,7 +101,12 @@ #else #define G_GNUC_PURE #define G_GNUC_MALLOC -#define G_GNUC_NO_INLINE + +# ifdef _MSC_VER +# define G_GNUC_NO_INLINE __declspec(noinline) +# else +# define G_GNUC_NO_INLINE +# endif #endif #if __GNUC__ >= 4
(In reply to Ignacio Casal Quinteiro (nacho) from comment #27) > Untested but following this web page: > https://docs.microsoft.com/en-us/cpp/cpp/noinline > > I think we can do the following as well: > diff --git a/glib/gmacros.h b/glib/gmacros.h > index 55fb81e..86c0262 100644 > --- a/glib/gmacros.h > +++ b/glib/gmacros.h > @@ -101,7 +101,12 @@ > #else > #define G_GNUC_PURE > #define G_GNUC_MALLOC > -#define G_GNUC_NO_INLINE > + > +# ifdef _MSC_VER > +# define G_GNUC_NO_INLINE __declspec(noinline) > +# else > +# define G_GNUC_NO_INLINE > +# endif > #endif > > #if __GNUC__ >= 4 That looks reasonable. Can you file a new bug about it please, since it’s a general improvement for anything which uses G_GNUC_NO_INLINE.