GNOME Bugzilla – Bug 558927
Performance improvement for g_function_info_invoke
Last modified: 2018-01-24 19:01:33 UTC
If we have TLS support from the platform, we can simply reuse the two arrays needed per ffi_call. I will attach a POC that shows that this works with libffi and a patch that adapts ginvoke.c.
Created attachment 121817 [details] POC that executes 100 threads each of which reusing their own two ffi_call arrays
Created attachment 121818 [details] [review] Patch that improves performance of g_function_info_invoke Patch to reuse the two alloca-d arrays each function call, using TLS.
Created attachment 121819 [details] Performance test for above POC pvanhoof@tinc:~$ ./a.out alloca: 23.309869 vs tls: 20.782047 pvanhoof@tinc:~$ pvanhoof@tinc:~$ gcc test.c `pkg-config glib-2.0 --cflags --libs` -lffi -lpthread pvanhoof@tinc:~$ ./a.out alloca: 23.309869 vs tls: 20.782047 pvanhoof@tinc:~$ cat test.c #include <ffi.h> #include <stdio.h> #include <pthread.h> #include <glib.h> static __thread ffi_type *arg_types[200]; static __thread void *arg_values[200]; unsigned char foo(unsigned int x, float y) { unsigned char result = x - y; return result; } static inline void alloca_test (void) { ffi_type **arg_typesi; void **arg_valuesi; arg_typesi = g_alloca (sizeof (ffi_type*) * 2); arg_valuesi = g_alloca (sizeof (gpointer) * 2); ffi_cif cif; ffi_status status; ffi_arg result; arg_typesi[0] = &ffi_type_uint; arg_typesi[1] = &ffi_type_float; status = ffi_prep_cif(&cif, FFI_DEFAULT_ABI, 2, &ffi_type_uint8, arg_typesi); unsigned int arg1 = 42; float arg2 = 5.1; arg_valuesi[0] = &arg1; arg_valuesi[1] = &arg2; ffi_call(&cif, FFI_FN(foo), &result, arg_valuesi); } static inline void tls_test (void) { ffi_cif cif; ffi_status status; ffi_arg result; arg_types[0] = &ffi_type_uint; arg_types[1] = &ffi_type_float; status = ffi_prep_cif(&cif, FFI_DEFAULT_ABI, 2, &ffi_type_uint8, arg_types); unsigned int arg1 = 42; float arg2 = 5.1; arg_values[0] = &arg1; arg_values[1] = &arg2; ffi_call(&cif, FFI_FN(foo), &result, arg_values); } int main(int argc, const char **argv) { GTimer *t1 = g_timer_new (); GTimer *t2 = g_timer_new (); gdouble tt1, tt2; guint i = 0; // Elimenate cache differences for (i = 0; i< 10000000; i++) { alloca_test (); tls_test (); } g_timer_start (t1); for (i = 0; i< 100000000; i++) alloca_test (); g_timer_stop (t1); tt1 = g_timer_elapsed (t1, NULL); g_timer_start (t2); for (i = 0; i< 100000000; i++) tls_test(); g_timer_stop (t2); tt2 = g_timer_elapsed (t2, NULL); g_print ("alloca: %f vs tls: %f\n", tt1, tt2); } pvanhoof@tinc:~$
This one proofs that we should probably cache the ffi_cif too: pvanhoof@tinc:~$ ./a.out alloca: 2.285653 vs tls: 2.036822 vs hash: 0.795505 pvanhoof@tinc:~$ #include <ffi.h> #include <stdio.h> #include <pthread.h> #include <glib.h> static __thread ffi_type *arg_types[200]; static __thread void *arg_values[200]; GHashTable *funcs; unsigned char foo(unsigned int x, float y) { unsigned char result = x - y; return result; } static inline void alloca_test (void) { ffi_type **arg_typesi; void **arg_valuesi; arg_typesi = g_alloca (sizeof (ffi_type*) * 2); arg_valuesi = g_alloca (sizeof (gpointer) * 2); ffi_cif cif; ffi_status status; ffi_arg result; arg_typesi[0] = &ffi_type_uint; arg_typesi[1] = &ffi_type_float; status = ffi_prep_cif(&cif, FFI_DEFAULT_ABI, 2, &ffi_type_uint8, arg_typesi); unsigned int arg1 = 42; float arg2 = 5.1; arg_valuesi[0] = &arg1; arg_valuesi[1] = &arg2; ffi_call(&cif, FFI_FN(foo), &result, arg_valuesi); } static inline void tls_test (void) { ffi_cif cif; ffi_status status; ffi_arg result; arg_types[0] = &ffi_type_uint; arg_types[1] = &ffi_type_float; status = ffi_prep_cif(&cif, FFI_DEFAULT_ABI, 2, &ffi_type_uint8, arg_types); unsigned int arg1 = 42; float arg2 = 5.1; arg_values[0] = &arg1; arg_values[1] = &arg2; ffi_call(&cif, FFI_FN(foo), &result, arg_values); } static inline void hash_test (void) { ffi_cif cif; ffi_cif *cifptr; void **arg_valuesi; ffi_status status; ffi_arg result; cifptr = g_hash_table_lookup (funcs, (gpointer) &foo); if (!cifptr) { ffi_type **arg_typesi; arg_typesi = g_alloca (sizeof (ffi_type*) * 2); arg_typesi[0] = &ffi_type_uint; arg_typesi[1] = &ffi_type_float; status = ffi_prep_cif(&cif, FFI_DEFAULT_ABI, 2, &ffi_type_uint8, arg_typesi); g_hash_table_insert (funcs, (gpointer) &foo, &cif); } else cif = *cifptr; unsigned int arg1 = 42; float arg2 = 5.1; arg_valuesi = g_alloca (sizeof (gpointer) * 2); arg_valuesi[0] = &arg1; arg_valuesi[1] = &arg2; ffi_call(&cif, FFI_FN(foo), &result, arg_values); } int main(int argc, const char **argv) { GTimer *t1 = g_timer_new (); GTimer *t2 = g_timer_new (); GTimer *t3 = g_timer_new (); gdouble tt1, tt2, tt3; guint i = 0; funcs = g_hash_table_new (g_direct_hash, g_direct_equal); // Elimenate cache differences for (i = 0; i< 10000000; i++) { alloca_test (); tls_test (); hash_test (); } g_timer_start (t1); for (i = 0; i< 10000000; i++) alloca_test (); g_timer_stop (t1); tt1 = g_timer_elapsed (t1, NULL); g_timer_start (t2); for (i = 0; i< 10000000; i++) tls_test(); g_timer_stop (t2); tt2 = g_timer_elapsed (t2, NULL); g_timer_start (t3); for (i = 0; i< 10000000; i++) hash_test(); g_timer_stop (t3); tt3 = g_timer_elapsed (t3, NULL); g_print ("alloca: %f vs tls: %f vs hash: %f\n", tt1, tt2, tt3); }
Created attachment 121821 [details] [review] Using a hashtable to cache the ffi_cif This is an untested patch that stores the ffi_cif in a hashtable. Because the ffi_cif seems to be allocated on stack, I don't think this patch is a good idea (to store the pointer to it in a ghashtable): the first thread that called using this ffi_cif might have ended, making its ffi_cif-on-stack invalid for subsequent threads. A better way would be to put the ffi_cif in the struct of GIFunctionInfo or have something like: cif = g_function_info_get_cif (info); Anyway, perhaps caching the ffi_cif should be a different bug and a different patch proposal.
Created attachment 121836 [details] [review] Caching the ffi_cif in GIFunctionInfo This patch caches the ffi_cif in the GIFunctionInfo structure. It also checks if the current n_args equals the cached one. If not we recreate the cif within the cache location. Note about the patch that this makes the GIBaseInfo larger. This might not be wanted behaviour (in case of the ABI having been specified already, for example).
These patches are built to speed up syntactic benchmarks. I'm going to reject them on that basis, since optimizations should be done with real world applications in mind, backed up by real profiling data. Speed is currently not the main focus on gobject-introspection as there are almost no applications written using it. Even so, when using g_function_invoke in a language binding, you'd have to do something like this: LB wrappers -> GITypes -> ffi Before you can do invokation. It'd be much more sensible to just bypass g_function_invoke and call ffi directly, sending in the raw values from language bindings wrappers.
Created attachment 121841 [details] Non-synthetic test With the patch: Performance test: 1.399057 for 1000000 invocations of test1 Without the patch: Performance test: 1.510977 for 1000000 invocations of test1
Not commenting on the patches, just making an observation that the libffi API seems such that the user should probably call ffi_prep_cif() just once, and it would IMO be a good fit to store the result in GIFunctionInfo.
Created attachment 121845 [details] Non-synthetic test using pybank pvanhoof@tinc:~/repos/gnome/pybank$ python test.py 0:00:24.526809 pvanhoof@tinc:~/repos/gnome/pybank$ python test.py 0:00:23.420620 pvanhoof@tinc:~/repos/gnome/pybank$ Relevant environment variables for people who want to repeat this test: declare -x LD_LIBRARY_PATH="./bank/:/opt/gobject-introspection/lib/" declare -x PYTHONPATH="." declare -x PKG_CONFIG_PATH="/opt/gobject-introspection/lib/pkgconfig/" declare -x PWD="/home/pvanhoof/repos/gnome/pybank"
Created attachment 121856 [details] Non-synthetic test using GJS pvanhoof@tinc:~/repos/gnome/gjs/examples$ /opt/gobject-introspection/bin/gjs-console glib.js JS LOG: 7548 pvanhoof@tinc:~/repos/gnome/gjs/examples$ /opt/gobject-introspection/bin/gjs-console glib.js JS LOG: 7632 pvanhoof@tinc:~/repos/gnome/gjs/examples$ pvanhoof@tinc:~/repos/gnome/gjs/examples$ cat glib.js const GLib = imports.gi.GLib; let i = 0; while (i < 1000000) { GLib.basename ("/"); i++; } i = 0; let start = (new Date()).getTime(); while (i < 1000000) { GLib.basename ("/"); i++; } let end = (new Date()).getTime(); log (end - start); pvanhoof@tinc:~/repos/gnome/gjs/examples$
Created attachment 121859 [details] GJS test slightly cleaned up I got before: 6556 ms after: 6519 ms
Why 200 arguments? Seems like a random number to me. 16, or even 8 is probably a sane default. What are the performances changes for 0, 1, 4 and max_args+1 arguments?
Correct me if I'm wrong, but to me the TLS bits seem suspicious. As it's only comparing number of arguments, it'll fail to notice any difference between foo(char*) and bar(int) arguments. And as the arrays are (thread) global, any re-entrant call to g_function_info_invoke() is going to mess things up.
Comment #13: Indeed, it's a randomly picked number. About the amount of arguments I noticed that the smaller the amount of arguments, the larger the delta in time between two patches. So for larger amounts of arguments, the g_alloca has a relative smaller impact. The smaller the amounts of arguments, the g_alloca has a relative larger impact. I must confess, however, that I didn't measure this a lot (just three tests). I also tested whether a small TLS made a difference with a large TLS: this made no measurable difference (probably right after pthread_create, a large TLS will take a little bit longer to allocate). The performance of max_args+1 will be the current performance minus the overhead of the if-conditions. Comment #14: In C there's no method overloading except with a varargs, so any call with different sets of arguments than a previous call, for the same symbol that ain't a varargs one, is erroneous. Therefore I think that unless gobject-introspection will support overloaded C++ methods that just checking the amount of arguments and checking for 'is the function a varargs or not' is ok. We must indeed perhaps add a flag to the GIFunctionInfo like: this is a varargs function, in which case we need to recalculate the ffi_cif each time. The patch doesn't yet have such a check (for varargs), I don't think support for overloaded C++ methods is needed (in gobject-introspection) atm. We could, if we wanted to support them, mark them in a similar way as how a varargs-function will be marked.
A different approach would be to change the API drastically: gboolean g_function_info_invoke (GIFunctionInfo *info, GIFunctionParametersInfo *params, GError **error); typedef struct { ffi_cif cif; } GIFunctionParametersInfoPriv; GIFunctionParametersInfo* g_function_parameters_info_new (void); void g_function_parameters_info_add (GIFunctionParametersInfo *info const GArgument arg, gboolean direction); void g_function_parameters_info_set_return (GIFunctionParametersInfo *info const GArgument arg); void g_function_parameters_info_prepare (GIFunctionParametersInfo *info, ffi_cif *cif, GError **error); Example: static GIFunctionParametersInfo *params_cached = NULL; static void Prepare_Function (void) { GArgument arg, retarg; GIFunctionParametersInfo *params; params = g_function_parameters_info_new (); arg.v_int = 0; retarg.v_int = 0; g_function_parameters_info_add (params, arg); g_function_parameters_info_set_return (params, retarg); if (params_cached) g_object_unref (params_cached); params_cached = params; } static void Exec_Function (GIFunctionInfo *f_info, GError **error) { g_function_info_invoke (f_info, params_cached, error); } static void test (void) { guint i = 0; GIFunctionInfo *info = ... Prepare_Function (); for (i = 0; i < 100; i++) { Exec_function (f_info, NULL); } }
Comment #14: for the arg_values array a reentrant function in one thread will or might indeed cause havoc this way. So a function that recursively calls g_function_info_invoke can't work with the TLS trick. (You are right). The TLS trick is, compared to caching the ffi_cif, a small performance improvement anyway. Although I don't like how g_function_info_invoke consumes ~ 4 or 5 times more stack than a normal function call would ... That makes recursively calling g_function_info_invoke tricky anyway (will consume a lot of memory, if the developer does a lot of recursions). I don't think we can really fix this other than not supporting recursively calling g_function_info_invoke, though :-\ I must say that ... recursively calling g_function_info_invoke does sound a little bit insane to me. Whoever comes up with a valid use-case for that one, gets a free beer from me at FOSDEM :-)
Note that with the API change in Comment #16 recursively calling would work and wouldn't consume ~ 4 or 5 times more memory afaics. static int reached = 100; static GIFunctionParametersInfo *params; static GIFunctionInfo *finfo; static gboolean prepared = FALSE; static void Prepare (void) { if (prepared) return; finfo = g_irepository_find_by_name (rep, "test", "test"); params = g_function_parameters_info_new (); arg.v_int = 0; g_function_parameters_info_add (params, arg); g_function_parameters_info_add (params, arg); prepared = TRUE; } static void being_a_recursive_idiot (void) { Prepare (); g_function_info_invoke (finfo, params, &error); } static void test (int a, int b) { if (reached) { being_a_recursive_idiot (); } reached--; } static void start_stuff (void) { being_a_recursive_idiot (); }
Hmm. Correction on Comment #18: g_function_info_invoke would need a "values" collection too, usually you don't want to cache the values like how you cache the types. So that would be similar to the current API with the ffi_cif cache-hack in that case.
(In reply to comment #15) > Comment #14: In C there's no method overloading except with a varargs, so any > call with different sets of arguments than a previous call, for the same symbol > that ain't a varargs one, is erroneous. Therefore I think that unless > gobject-introspection will support overloaded C++ methods that just checking > the amount of arguments and checking for 'is the function a varargs or not' is > ok. It wasn't about same function with different number of arguments, but same number of arguments, different signatures. But I realize my mistake now, the the thread global array isn't reused for the call and is reset for ffi_cif_prep() (In reply to comment #17) > I must say that ... recursively calling g_function_info_invoke does sound a > little bit insane to me. > > Whoever comes up with a valid use-case for that one, gets a free beer from me > at FOSDEM :-) To make 'Enter' on the last GtkEntry on a dialog to behave just as if OK button was clicked: entry.connect('activate', function() { dialog.response(Gtk.Response.OK); }); And since you want to handle dialog response in just one place: dialog.connect('response', function(dialog, res) { dialog.destroy(); }); Now, hitting Enter will result in g_function_info_invoke("gtk_dialog_response") which emits a signal, which calls g_function_info_invoke("gtk_widget_destroy") Add a couple more indirections (calling any functions during signal emission) and you might very easily be triggering re-entrant calls.
Not opposed to this, but it would seem strange to me if TLS was significantly faster than alloca; I mean, what is the stack but thread local temporary storage? Also strongly agree with comment #7. Besides being faster to go language -> ffi directly, and while using g_function_info_invoke seems easy at first, to do things like array/length handling, multiple value returns etc. a binding needs a higher level function invocation layer. And at that point you might as well replace g_function_info_invoke.
stack is allocated page based. So if more stack than is currently available is needed, a new page is allocated (which takes time). A TLS is a fixed size, it doesn't suffer from other functions influencing the stack's size.
Created attachment 121910 [details] [review] Cache the result of ffi_prep_cif * Add a performance test to tests/invoke/invoke.c * Cache the result of ffi_prep_cif in GIBaseInfo, this requires a g_malloc which is freed in g_base_info_unref With this patch: ...$ /opt/gobject-introspection/bin/gjs-console glib.js JS LOG: 8235 JS LOG: 8256 JS LOG: 8265 JS LOG: 8117 JS LOG: 8025 ...$ ...$ cat glib.js const GLib = imports.gi.GLib; for (let i = 0; i < 1000000; ++i) { GLib.basename ("/"); } for (let y = 0; y < 5; y++) { let start = Date.now(); for (let i = 0; i < 1000000; ++i) { GLib.basename ("/"); } let end = Date.now(); log (end - start); } ...$ .../tests/invoke$ make check | grep -e Perf -e All -e == Performance test: 1.184156 for 1000000 invocations of test1 Performance test: 1.156640 for 1000000 invocations of test1 ================== All 2 tests passed ================== .../tests/invoke$ Without this patch: ...$ /opt/gobject-introspection/bin/gjs-console glib.js JS LOG: 8529 JS LOG: 8525 JS LOG: 8515 JS LOG: 8535 JS LOG: 8547 ...$ .../tests/invoke$ make check | grep -e Perf -e All -e == Performance test: 1.491755 for 1000000 invocations of test1 Performance test: 1.525176 for 1000000 invocations of test1 ================== All 2 tests passed ================== .../tests/invoke$
This is a better test for GJS, as the GLib.basename() has its own overhead (the body of the function influencing the amount of time needed to execute the native code). This test uses the "Everything" test module of GJS and invokes a almost-noop called test_uint64. We notice that the performance improvement is bigger now. ...$ cat test.js const Everything = imports.gi.Everything; for (let i = 0; i < 1000000; ++i) { Everything.test_uint64(42); } for (let y = 0; y < 5; y++) { let start = Date.now(); for (let i = 0; i < 1000000; ++i) { Everything.test_uint64(42); } let end = Date.now(); log (end - start); } ...$ ...$ /opt/gobject-introspection/bin/gjs-console test.js JS LOG: 6999 JS LOG: 6942 JS LOG: 7032 JS LOG: 7037 JS LOG: 6943 ...$ /opt/gobject-introspection/bin/gjs-console test.js JS LOG: 6615 JS LOG: 6593 JS LOG: 6690 JS LOG: 6671 JS LOG: 6626 ...$
Ping, what is the decision for this bug and patch? (cleaning up my bugs)
I don't have a strong opinion myself, I'd just note again that I think bindings should avoid using g_function_info_invoke from the start for the aforementioned reasons (complex argument handling, efficiency). The right people to weigh in on this would be the gjs people.
Fyi. Some groups that want (will have) to use g_function_info_invoke: GBus (experiment being done by desrt), GJS, PyBank, Bindings that will dynamically figure out how to iterate a glib container type like GList, GPtrArray, GHashTable, etc. I stared Bug #560061 as a proposal for a collection API which would avoid having to use g_function_info_invoke for such bindings (the binding developer would have a standard API, and standard implementations of the interfaces for GList, GPtrArray, GHashTable, etc would be made available shortly after that). Also debuggers like Nemiver and Anjuta might want to utilize g_function_info_invoke to for example implement peeking at the value of a property of a GObject while debugging is taking place. I can imagine other possible future consumers, like an equivalent of Object Spaces or Hibernate for GObject. Although I think it's quite unlikely that a framework like this will ever be written for GObject/C. Well, you never know. The most important consumers are of course dynamic binding users like script engines, interpreted languages, jitted and vm-ed languages that prefer not to dlopen() a .so with some glue code and/or that don't want to P/Invoke to the platform by themselves, but instead prefer gi to do this using libffi.
Marking as needinfo, awaiting decision
Actually, NEEDINFO indicates that input from the *reporter* is missing. A NEW or ASSIGNED bug implicitly needs more information from the developer. And I think this is the case here. I am thus reopening. Maybe taking this on a proper mailinglist helps to decide on this issue.
[Mass-moving gobject-introspection tickets to its own Bugzilla product - see bug 708029. Mass-filter your bugmail for this message: introspection20150207 ]
After almost 10 years, this is unlikely to happen. Function invocation performance has been improved independently.