GNOME Bugzilla – Bug 661140
GSignal emission optimization
Last modified: 2012-03-05 11:40:34 UTC
Hi all, These days I'm mostly working with frameworks Mx and Clutter and contributing to projects like this Media Explorer : https://github.com/media-explorer/media-explorer A few months ago we started to profile our application to get better performances on devices with a more "embedded" profile (like Intel CE4100 chipsets). Our profiling showed up that we spend a lot of time at painting our Clutter scene. And a considerable amount of this time is spent on emitting paint signal in clutter actors, that are, most of the time not connected. At the end, most of the paint signals emitted end up calling the virtual function associated to the signal. We do quite a lot of things when calling the g_signal_emit() function. One of them is to transform the variable arguments of the function into an array of GValue which can be considered as negligible when several function are connected to a signal but is quite costly when you don't have anything to call. To avoid as much computation as possible during the signal emission, Robert Bragg came up with the idea of trying to get rid of the GValue boxing of the variable arguments of the g_signal_emit() function and instead transfer a argument va_list directly to the marshallers. On top of that I did some testing & improvements (around PLT symbols) with our application. The optimization can be deactivated at run time. And the results using the performance test show that we double the performances of the signal emission in the case nobody is listening to the emission. $ G_SIGNAL_ENABLE_VAR_ARGS=0 ./performance emit-unhandled Running test emit-unhandled Emissions per second: 1050115 $ G_SIGNAL_ENABLE_VAR_ARGS=1 ./performance emit-unhandled Running test emit-unhandled Emissions per second: 2112535 It also improve performances of connected signals : $ G_SIGNAL_ENABLE_VAR_ARGS=0 ./performance emit-handled Running test emit-handled Emissions per second: 804398 $ G_SIGNAL_ENABLE_VAR_ARGS=1 ./performance emit-handled Running test emit-handled Emissions per second: 1368504 This work started on top of the glib-2-28 branch and has been rebased on glib-2-30. I noticed that a few things have changed on the marshallers generation. It seems quite a few signal moved on using a generic marshaller using ffi. I haven't had the time to look at it yet. Any information on what it does differently from old marshaller would be appreciated. I could like to know whether we can rework our optimization on top of it and investigate performances as well. Thanks
Created attachment 198484 [details] [review] patch1
Created attachment 198485 [details] [review] patch2
Created attachment 198486 [details] [review] patch3
Created attachment 198487 [details] [review] patch4
Created attachment 198488 [details] [review] patch5
Created attachment 198489 [details] [review] patch6
Created attachment 198490 [details] [review] patch7
Created attachment 198491 [details] [review] patch8
Created attachment 198492 [details] [review] patch9
I wanted to do the same thing for the newly introduced generic closure marshaller, but that would require va_list support from libffi, which might be quite a pain in the ass to implement for every ABI... Considering the minimal number of people who manually use the closure API to connect signal handlers, it might could be interesting to create a fast path within gsignal to optimize the 90% cases where we emit signal that nobody really listens to.
(In reply to comment #10) > Considering the minimal number of people who manually use the closure API to > connect signal handlers, it might could be interesting to create a fast path > within gsignal to optimize the 90% cases where we emit signal that nobody > really listens to. As I said on the mailing list, I'd like to see this first - it might not be very hard at all to do.
I had a look at that yesterday and it doesn't look straight forward... Mostly because of these functions : void g_signal_override_class_closure (guint signal_id, GType instance_type, GClosure *class_closure); void g_signal_override_class_handler (const gchar *signal_name, GType instance_type, GCallback class_handler); void g_signal_chain_from_overridden (const GValue *instance_and_params, GValue *return_value); void g_signal_chain_from_overridden_handler (gpointer instance, ...); I don't know who's using these functions, but that makes the fast-path much more complicated to implement.
the _override_class_* functions are meant to be used by classes when installing signals or by their subclasses; they are the function equivalent of "klass->signal = class_closure" - i.e. they should be treated exactly as if there is a class handler when a signal is about to be emitted. signal_chain_from_overridden_handler() is admittedly more complex to handle; I believe is used by language bindings as well, for subclass override + chain up across the language barrier. this sadly can happen *during* the emission as opposed to before the emission.
Review of attachment 198491 [details] [review]: Moving the performances tests patchs to https://bugzilla.gnome.org/show_bug.cgi?id=662154
Updating the bug, since I saw a few new people following this. As discussed on #gtk+, I tried an another approach to specifically target the class closures on signal (i.e. the vfunc called when you emit a signal). This is used a lot in Clutter for instance, when painting actors and also queuing redraw/relayout. It seems to be used in GTK+ as well, but there are other specific stuff that might not benefit from this approach. So basically I now have a signal emission which is about 4 times faster than what we currently have, but only if you have no signal connected but have a class closure on the signal. Just with this, I already have a few Clutter apps that perform 10 fps faster than previously, I guess apps like gnome-shell could benefit from this. I'm currently in the process of cleaning that a bit, get rid of the first approach (patches attached here a few months ago), and also trying to fast past the signal emission with handlers, by squeezing completely GClosure. Of course that optimization won't work if you're using weird signals like a signal registered with g_signal_newv() or if you're using a non standard marshaller (like GtkWidget does on the "draw" signal), but both cases are very specific and don't apply to 90% of your signals.
And this would also speed up signals using the generic marshaller introduced in glib 2.30. :)
Created attachment 207716 [details] [review] signal connection tracking
Created attachment 207717 [details] [review] glib-genmarshal modification
Created attachment 207718 [details] [review] gsignal modification
Created attachment 207719 [details] [review] tests additions 1
Created attachment 207720 [details] [review] tests additions 2
Created attachment 207721 [details] [review] tests additions 3
So this is the 3rd iteration of the patch for gsignal. As explained above the approach is to avoid passing through gclosure as much as possible, so this adds a bit of code inside gsignal. With this modification, you end up having backtraces like that look more this : #6 clutter_stage_paint #5 g_cclosure_marshal_VOID__VOIDv #4 g_signal_emit_valist #3 g_signal_emit #2 clutter_actor_continue_paint #1 clutter_actor_paint instead of this (currently) : #9 clutter_stage_paint #8 g_cclosure_marshal_VOID__VOID #7 g_type_class_meta_marshal #6 g_closure_invoke #5 signal_emit_unlocked_R #4 g_signal_emit_valist #3 g_signal_emit #2 clutter_actor_continue_paint #1 clutter_actor_paint It's pretty obvious that we gain on the number of function call, but also something that you don't see here is that we don't collect the arguments into GValues as well. That avoids taking references on objects and copying strings etc... Of course we preserve the current behavior for signal emissions that need it. If you look at the "gsignal modification" patch, you'll appreciate the complexity of cases handled by gsignal. So with this I manage to run pretty complex applications like evolution (even though I found a small bug in GTK+ https://bugzilla.gnome.org/show_bug.cgi?id=670176) and clutter apps as well. I'll probably give try on a whole distro later, but in the meantime if people could give feedback, that would be greatly appreciated.
Review of attachment 207718 [details] [review]: No detailed review atm, but a free nitpick: 'vaarg' really hurts my sensibility. It should either be 'vararg', or 'va' (as shorthand for 'variable argument')
Created attachment 207826 [details] [review] gsignal whitespaces cleanup
Created attachment 207827 [details] [review] gsignal whitespaces cleanup
Created attachment 207828 [details] [review] gsignal emission optimization
As requested on IRC, split the whitespace cleanup from the actual optimization.
Review of attachment 207828 [details] [review]: pre-review, with a couple of points raised during the hackfest. ::: gobject/gsignal.c @@ +410,3 @@ + +static gpointer +{ this is pretty evil. how does this work on windows? how does it work for statically linked symbols? @@ +431,3 @@ + +void + if (types1[i] == types2[i]) registering marshallers is also pretty evil for compatibility. personally, I'd rather see generating marshallers go away, and just switch to libffi.
I am happy to flag the PLT part as linux specific, but I don't have much knowledge about windows. Any input about what happens on this platform? I could give a try to use libffi directly, but I already know we will loose the gain this patch tries to bring. The idea here is to optimize the common and heavily used marshallers, and leave the rest to libffi. Ideally if this patch lands, we could tell (most) developers to not use/generate marshallers and just give a NULL argument to g_signal_new().
Actually there is a problem _resolve_plt_marshal, I can fix that to make it work with static symbols (which aren't subject to the PLT problem).
Review of attachment 207716 [details] [review]: ::: gobject/gsignal.c @@ +138,3 @@ gpointer instance); static inline Handler* handler_new (gboolean after); +static void signal_handler_insert (SignalNode *node, Why are you renaming this function? It seems unnecessary and confusing with all the other handler_foo_bar functions @@ +590,3 @@ +{ + Handler *handler; + Shouldn't this initialize all_c_closures to TRUE here?
Review of attachment 207716 [details] [review]: Some feedback ::: gobject/gsignal.c @@ +138,3 @@ gpointer instance); static inline Handler* handler_new (gboolean after); +static void signal_handler_insert (SignalNode *node, Why are you renaming this function? It seems unnecessary and confusing with all the other handler_foo_bar functions @@ +590,3 @@ +{ + Handler *handler; + Shouldn't this initialize all_c_closures to TRUE here?
Review of attachment 207717 [details] [review]: The generated marshallers should use G_GNUC_UNUSED for e.g. return values that are not used, just like the normal marshallers. For ease of read we should probably generate the g_cclosure_marshal_VOID__BOOLEANv after the g_cclosure_marshal_VOID__BOOLEAN in the code. You need to handle the G_CCLOSURE_SWAP_DATA case too, as you can e.g. g_signal_override_class_handler with a g_cclosure_new_object_swap.
Review of attachment 207828 [details] [review]: All this weird magic to reverse map the ordinary marshaller to the varargs one just is too ugly to live. I think we should just drop it all and add a new g_signal_new which lets you specify the varargs marshaller. I tried to think of some preprocessor hackery to rewrite existing g_signal_new calls to the new call, but I don't think that is possible. Anyway, this is free software, changing all the core/importants signal registrations to use a new function is easy, and maintaining the ugly stuff forever is much worse.
I'm working on another version of this, but i noticed a real problem with the whole approach. We currently rely a lot on the fact that GObject keeps a ref of all data over the lifetime of the signal emission. For instance, if the first signal handler unrefs some argument it still continues to live in that handler and in the following ones. This is guaranteed by the boxing of the arguments in the GValues. We can still sort of solve this by having the va_list marshaller ref its arguments, but we can then only handle the case of a single signal handler.
Just one question : What the point of rewriting this if it ends up as slow as it was before? Actually if I understand correctly, what you're proposing would be even slower than what we currently have (you would ref/unref for each call to a marshaller instead of only once before calling marshallers). Besides, if you look at how things are done today, if 2 handlers unref one of the argument, you're already screwed. Also there is flag on gsignal : G_SIGNAL_MUST_COLLECT, you should use that if you REALLY want gsignal to take a ref on your arguments. But to me that's already fairly wrong.
Here is an update of the patches. I handled : * comment 29 : - PLT symbols specific code has been flagged * comment 32/33/34 : - renamed to handler_insert - all_c_closures is now set to TRUE at first * comment 35 : - marshallers use G_GNUC_UNUSED - G_CCLOSURE_SWAP_DATA is actually already handled in signal_emit_valist_unlocked_R
Created attachment 208183 [details] [review] glib-genmarshal modification
Created attachment 208184 [details] [review] gsignal emission optimization
If things are slower there is obviously no point in this whole rewrite. However, if it breaks existing code there is also no point in it. It not that I *like* the fact that the optimization might not be possible. There are hundreds of thousands ancient lines of code that depend on the details of GObject, we can't just change the behaviour of things and claim that they should have been using some flag all along (especially ALWAYS_COLLECT which was introduced less than a year ago). I don't think your "two handlers unreffing the argument" argument is quite right, the only bug-free way a handler can unref an argument is if it owns a ref to it, it should never unref the boxed signal arguments ref. Take this example for instance: Class Foo { GObject *prop; signal ping(GObject *obj) } foo emits ping with prop as the argument a signal handler on foo destroys the foo object (it was the only owner of it) a second signal handler was connected to the ping signal, it now receives a reference to the prop object that has been finalized, wheras before the object would be kept alive by the GValue until all handlers had run. So, we can't handle the case where there is multiple signal handlers if we're ref:ing things in each marshaller, which is why I said we can only handle the case with a single closure.
Ok, in that case we better only care about the vfunc class closure. It should just be a matter of removing signal_emit_valist_unlocked_R and fallback to the usual signal_emit_unlocked_R instead. I still think the case you're describing is so convoluted that it's basically a programming error if you end up with an unreffed argument in your handler. You should be aware of the signal semantic : it's going to unref stuff, therefore you should not connect with G_CONNECT_AFTER and define the signal with G_SIGNAL_RUN_CLEANUP.
Also, I don't think the return value collection is right, you can't assume everything thats not one of the primary types is a pointer, you need to do the full G_VALUE_COLLECT_INIT loop, handling collect_format and multiple arguments per value (although you obviously don't need to do the actual collection).
I have a new set of patches to handle this that is imho a bit cleaner. It introduces the concept of va_marshallers properly into GClosure rather than trying to keep the GClosure construction information on the side in gsignal.c. It only optimizes the case of a single handler with no or NULL vfunc class closure or the default class closure (assuming the corresponding closure supports va_marshallers). There is also a generic (ffi based) va_marshaller that goes with the normal generic marshaller. The patch series is mostly independent (although based on Lionels patches) but does rely on the existing "test additions 1" patch. I can run a the gobject tests with this and i've tried various apps a bit, but its not all that well tested. It needs review, more testcases and real life testing. The va_marshallers in the patches don't yet ref/copy objects/boxed arguments, so its not quite conformant. It does however ref the instance. Perormance before (not by disabling at runtime, but just running the tests before applying the optimizations, as we otherwise count the optimization bookkeeping overhead in the before case) is like: Running test emit-unhandled Emissions per second: 1877347 Running test emit-unhandled-empty Emissions per second: 23058544 Running test emit-unhandled-generic Emissions per second: 1686556 Running test emit-unhandled-generic-empty Emissions per second: 23041250 Running test emit-handled Emissions per second: 1411691 Running test emit-handled-empty Emissions per second: 1453611 Running test emit-handled-generic Emissions per second: 1195489 Running test emit-handled-generic-empty Emissions per second: 1333166 (empty means class closure set but vfunc is NULL) And after the optimizations: Running test emit-unhandled Emissions per second: 4107661 Running test emit-unhandled-empty Emissions per second: 22277069 Running test emit-unhandled-generic Emissions per second: 3114286 Running test emit-unhandled-generic-empty Emissions per second: 22271500 Running test emit-handled Emissions per second: 1367398 Running test emit-handled-empty Emissions per second: 3628843 Running test emit-handled-generic Emissions per second: 1141061 Running test emit-handled-generic-empty Emissions per second: 2839890 Its noteworthy that a lot of the cost in the fast path is the ref/unref of the instance. Without that we get: Running test emit-unhandled Emissions per second: 5951305 Running test emit-unhandled-empty Emissions per second: 22422150 Running test emit-unhandled-generic Emissions per second: 4074260 Running test emit-unhandled-generic-empty Emissions per second: 22421699 Running test emit-handled Emissions per second: 1326608 Running test emit-handled-empty Emissions per second: 4984977 Running test emit-handled-generic Emissions per second: 1129769 Running test emit-handled-generic-empty Emissions per second: 3591398 Which is significantly different. So, maybe we should look at making ref/unref faster.
Created attachment 208254 [details] [review] tests: Add generic and empty signal emission performace tests generic means it uses the generic marshaller empty means the vfunc pointer is NULL
Created attachment 208255 [details] [review] Add GRealClosure and move meta_marshallers there This means we're not abusing the notifiers for meta_marshallres, and we're able to later cleanly add other fields to GClosure. We still have to leave the ABI intact for the GClosure->meta_marshal bit, as old G_CLOSURE_N_NOTIFIERS macro instances still accesses it. However, we always set it to zero to keep those macros working.
Created attachment 208256 [details] [review] Add optional support for varargs marshallers to GClosure These closures support being invoked on a va_args which can be useful as you can then avoid boxing the va_args into GValues in certain cases.
Created attachment 208257 [details] [review] Support generating va marshallers in glib-genmarshal
Created attachment 208258 [details] [review] Add g_signal_set_va_marshaller This lets you set a va_marshaller on your signal which will be propagated to all closures for the signal. Also, automatically uses the generica va_marshaller if you specify a NULL c_marshaller.
Created attachment 208259 [details] [review] Optimize single-handler va_marshaller case When there is only one closure handling a signal emission and it doesn't have a bunch of complicated features enabled we can short circuit the va_args collection into GValues and call the callback via the va_marshaller directly. Note, this isn't yet 100% correct. The marshallers should also be extended to ref all non-primitive arguments.
Created attachment 208260 [details] [review] Use the generated va_marshallers in the performance tests
Review of attachment 208256 [details] [review]: ::: gobject/gclosure.c @@ +1476,3 @@ + for (i = 0; i < n_params; i++) + { + atypes[i+1] = va_to_ffi_type (param_types[i], This needs to mask out G_SIGNAL_TYPE_STATIC_SCOPE
Created attachment 208599 [details] [review] Add _g_closure_is_void to check for NULL vfuncs
Created attachment 208600 [details] [review] Precalculate as much as possible for signal emission fast path
Created attachment 208601 [details] [review] generic va_marshaller: Mask out G_SIGNAL_TYPE_STATIC_SCOPE from type
The last three patches try to avoid as much work as possible per-emission in the fastpath code by storing the precalculated state and invalidating it when needed. It also subsumes the existing signal emission NOP checks. This gives us another 50% or so performance increase compared to my old patcheset. Known issues: Still not reffing signal arguments as needed. Also, generally not tested very well.
I just pushed the signal-performance branch to master.