GNOME Bugzilla – Bug 575235
gda_connection_update_meta_store() is slow
Last modified: 2011-11-03 21:35:47 UTC
Compared to the equivalent schema stuff in libgda-3.0, gda_connection_update_meta_store() is very slow, at least for PostgreSQL. This is very noticeable in Glom, which takes around 1 minute to call this when opening databases. We try to reduce how often we call this, and we call it for single tables when possible, (which is fast), but we can't avoid calling it at least once. I don't know yet what is causing the slowness. Kcachegrind is not showing me sensible information. Could it be all network IO? If this really must be this slow then it really should be async, with a callback. Applications shouldn't be forced to block for 1 minute.
I notice that libgda seems to use sqlite (regardless of what provider is used) to store this meta data. I wonder if that's a factor.
I wonder if this is the old syncing issue that Firefox had with sqlite. I hope that all the sqlite stuff is happening in memory, but it would make it slow if it is doing lots of disk IO.
Comparing to Libgda V3 is unfair since then far less meta data was retreived (the big problem was it did not handle schemas at all). I tried to make things better by providing 2 updates mode: one where everything is updated (if you pass a NULL GdaMetaContext in gda_connection_update_meta_store()) and one where only a part of the meta data is updated. I believe things could be improved for PostgreSQL by rewriting/simplyfing some of the SELECT statements used when retreiving meta data: as you're probably aware, you can use the pg_catalog.* tables or the information_schema.* tables and views to retreive meta data but information_schema.* is slower, so some benefit could be gained by using only pg_catalog.* tables. Anyway, the best solution to this problem is (as is done by the gda-sql tool): * if the meta data is retreived for the 1st time, then run it in a thread different than the GUI one, and cache it in a database locally * if a cached version already exists, then use it * update some parts of the meta data when you know it has changed (which is quick, but can also be done in a non GUI thread) (Be aware however that depending on how PostgreSQL was compiled, it may not support multi threading, which means the GdaConnection will only work for the thread which has created it.) As for SQLite, it does not introduce any significative slowdown.
I still doubt that there is so much extra metadata that it can explain an order of magnitude slowdown. However, I have no useful information. Do you know if it is syncing every change to disk immediately? I have changed the Glom code to get just the table names and the supported field types, as well as getting the per-table meta data when it needs it. That is slightly faster, though getting the tables list is still surprisingly slow (5 or 6 seconds here). Hopefully that is enough for us.
Getting the list of all tables seems to be that slow because it also fetches a lot of (internal?) postgresql tables. We can prevent these from being fetched by filtering on the table's schema. The tables Glom uses are all in the "public" schema. I tried this out, and it significantly speeds up the metastore update process, and Glom still seems to have all information it needs. However, the schema name for the non-internal tables doesn't seem common for different database systems. For example, for postgresql it is "public", but for SQLite it is "main". Is there another way to cause it to only update the metastore for all the non-internal tables? Otherwise I think we can just hardcode the corresponding schema for each supported backend in Glom.
Have you tried then to restrict the meta store update to the schemas which are not public, as: GdaMetaContext mcontext = {"_schemata", 1, NULL, NULL}; mcontext.column_names = g_new (gchar *, 1); mcontext.column_names[0] = "schema_internal"; mcontext.column_values = g_new (GValue *, 1); g_value_set_boolean ((mcontext.column_values[0] = gda_value_new (G_TYPE_BOOLEAN)), FALSE); gda_connection_update_meta_store (connection, &mcontext, &error))
Typo correction: "restrict the meta store update to the schemas which are not internal"
I don't think I fully understand: Basically I would like to get information about tables, not schemas, which is why I told it to update the "_tables" table, not the "_schemata" table. Am I supposed to call gda_connection_update_meta_store afterwards again for each non-internal schema? However, the code you posted already takes even longer to execute than just updating "_tables" for all schemas.
The meta store update is done in a way that ensures the database internally used is never left in an inconsistent state. For example if table B references table A, when table A is updated then table B is also updated. This means that when the context specifies that the "_tables" table has to be updated, then all the tables which reference "_tables" are also updated (and recursively). To illustrate this, you can modify the gda-connection.c file and change the #ifdef from GDA_DEBUG_NO to GDA_DEBUG and you'll see some debug info. For example with: GdaMetaContext mcontext = {"_schemata", 1, NULL, NULL}; mcontext.column_names = g_new (gchar *, 1); mcontext.column_names[0] = "schema_internal"; mcontext.column_values = g_new (GValue *, 1); g_value_set_boolean ((mcontext.column_values[0] = gda_value_new (G_TYPE_BOOLEAN)), FALSE); you'll get: UP: GdaMetaContext for table _information_schema_catalog_name: --- ->: GdaMetaContext for table _schemata: [schema_internal => FALSE] DN: GdaMetaContext for table _udt: --- DN: GdaMetaContext for table _domains: --- DN: GdaMetaContext for table _tables: --- DN: GdaMetaContext for table _collations: --- DN: GdaMetaContext for table _routines: --- DN: GdaMetaContext for table _character_sets: --- DN: GdaMetaContext for table _triggers: --- [...MORE...] and with: GdaMetaContext mcontext2 = {"_tables", 2, NULL, NULL}; mcontext2.column_names = g_new (gchar *, 2); mcontext2.column_names[0] = "table_name"; mcontext2.column_values = g_new (GValue *, 2); g_value_set_string ((mcontext2.column_values[0] = gda_value_new (G_TYPE_STRING)), "customers"); mcontext2.column_names[1] = "table_schema"; g_value_set_string ((mcontext2.column_values[1] = gda_value_new (G_TYPE_STRING)), "main"); you'll get: UP: GdaMetaContext for table _information_schema_catalog_name: --- UP: GdaMetaContext for table _schemata: [schema_name => main] ->: GdaMetaContext for table _tables: [table_name => customers] [table_schema => main] DN: GdaMetaContext for table _triggers: --- DN: GdaMetaContext for table _table_constraints: --- DN: GdaMetaContext for table _columns: --- DN: GdaMetaContext for table _referential_constraints: --- DN: GdaMetaContext for table _referential_constraints: --- DN: GdaMetaContext for table _key_column_usage: --- DN: GdaMetaContext for table _check_column_usage: --- DN: GdaMetaContext for table _key_column_usage: --- DN: GdaMetaContext for table _check_column_usage: --- DN: GdaMetaContext for table _view_column_usage: --- You can notice that in the 2nd case much less tables are updated, which is the reason it's much faster. Now the current API does not allow one to specify several GdaMetaContext at once so you can't at the same time specify to update only non internal schemas and only the "_tables" table. Maybe I should add that possibility in 4.2, but in the meanwhile, you'll have to hard code that "public" schema names as you proposed.
Armin, this has made Glom much faster when doing this update. However, I think you maybe caused Glom to not show any data immediately after creating from an example (it works when closing and reloading the file), with these errors: Glom Base_DB::query_execute_select(): Error while executing SQL SELECT "pg_group"."groname" FROM "pg_group" Internal error (Database): Internal error: invalid provider handle Glom Base_DB::query_execute_select(): Error while executing SQL SELECT "pg_group"."groname" FROM "pg_group" Internal error (Database): Internal error: invalid provider handle Glom Base_DB::query_execute_select(): Error while executing SQL SELECT "pg_group"."groname" FROM "pg_group" Internal error (Database): Internal error: invalid provider handle Glom Base_DB::query_execute_select(): Error while executing SQL SELECT "pg_group"."groname", "pg_group"."grolist" FROM "pg_group" WHERE "pg_group"."groname" = 'glom_developer' Internal error (Database): Internal error: invalid provider handle DEBUG: User=glom_default_developer_user _is_ in the developer group on the server. Glom get_choice_values(): Error while executing SQL SELECT "artists"."artist_id", "artists"."name" FROM "artists" ORDER BY "artists"."artist_id" Glom get_choice_values(): Error while executing SQL SELECT "publishers"."publishers_id", "publishers"."name" FROM "publishers" ORDER BY "publishers"."publishers_id" Glom Base_DB::query_execute_select(): Error while executing SQL SELECT "pg_group"."groname" FROM "pg_group" Internal error (Database): Internal error: invalid provider handle Yeah, I should file this as a Glom bug really.
I have also noticed this, and fixed it in git master now. The problem was unrelated to the metastore update, though.
Thanks. Armin, I think libgdamm needs more explanation of what this schema parameter is. So far it's just: /** Update the meta store information for the table named @a table_name * * This can speed up the update of the meta store if you only need the information * for a specific table * @param table_name Name of the table where the information is needed * @param schema_name Name of the schema @table_name is in, or "" to update * all tables with the given name. * @return <tt>true</tt> if no error occurred. */ bool update_meta_store_table(const Glib::ustring& table_name, const Glib::ustring& schema_name = Glib::ustring());
I improved the documentation for it.
The documentation is still really vague about what this schema parameter actually is or means. There's a TODO in the code about that now though.
Created attachment 141640 [details] gda_meta_update_tables_with_schema_fails.c Using these schema names no longer works with libgda in git (at least, the master branch). Here is a test case. It produces this output: SUGGEST META UPDATE Returned FALSE: syntax error at or near "SELECT" LINE 1: ...espace n, pg_roles u WHERE n.nspowner = u.oid AND SELECT cur...
This has been corrected in git, master and LIBGDA_4.0 branches, and was due to a too quick copy/paste...
I think we forgot to close this bug.