GNOME Bugzilla – Bug 318742
big tables memory comsumption
Last modified: 2007-09-27 10:06:19 UTC
There is a big architectural problem in GDA. In industrial databases we can found very often tables with millions of rows. If I have to read one of these tables, for example to make an export to a text file, with the actual GDA architecture all rows will be fetched to memory before I can use it! So there is a big unuseful memory consumption. Example: - 1.000.000 records in the select - 30 fields in the select - to fetch all record GDA caches all records readed in a big array of GdaRows - Every GdaRow consume about ~500 bytes (considering using GdaValue with an average of 16 bytes) - The cache array uses about 4.000.000 bytes to store pointers to GdaRows - The total memory consumption in 4.000.000 + 1.000.000 * 500 = 504.000.000 ~= 480Mb !!!!! -> GDA is not usable in this way!! Instead we can use a "cursor" approach in witch only the current Row are allocated, fetched an available to the client program. Example of code with a Cursor object that store only the current row: Cursor *cursor = gda_connection_open_cursor( connection, command, options... ); while( gda_cursor_fetch(cursor) ) { for( col =0; col<gda_cursor_get_n_column(cursor), col++) { value = gda_cursor_get_value( cursor, col); /* do something with value */ } } If somebody need the cached approach, a DataModel cam be easily made over the Cursor object simply caching the fetched rows in a GPtrArray. In this way we have a low level API with single-current-row-Cursor-approach and a higher level API with all-cached-row-DataModel-approach. See: Python Database 2 specification uses a cursor object with two different method: fetchone() for single row traversing, and fetchall() to get all rows in one time. Pytohn Example - row by row: cursor = connection.cursor() row = cursor.fetchone() while(row): # do something with row data row = cursor.fetchone() cursor.close() Pytohn Example - all row together: cursor = connection.cursor() rows = cursor.fetchall() cursor.close() for row in rows: # do something with row data P.S. Sorry for my bad English ;-)
With what provider have you tested this?
The cursor functionnality is IMHO a requirement for large databases. I'll work towards proposing a post 2.0 API for that purpose. Beware however that not all the DBMS provide such a cursor to iterate through a resulset.
In forebird provider at function "gda_firebird_recordset_new" you can see: /* Fetch all rows */ while (fb_sql_fetch_row (fcnc, recset)) recset->priv->nrows++; In odbc provider at function "process_sql_commands" you can see: /* get rows */ while( SQL_SUCCEEDED(( rc = SQLFetch ( priv_data->hstmt)))) { ... gda_data_model_append_values (GDA_DATA_MODEL (recset), value_list); ... } In oracle provider at function gda_oracle_recordset_get_n_rows you can see: while (result == OCI_SUCCESS) { nrows += 1; result = OCIStmtFetch (priv_data->hstmt, priv_data->cdata->herr, (ub4) 1, (ub2) OCI_FETCH_NEXT, (ub4) OCI_DEFAULT); } In postgres provider at function "gda_postgres_recordset_get_row" you can see: row_list = get_row (GDA_DATA_MODEL (model), priv_data,row); gda_data_model_hash_insert_row(GDA_DATA_MODEL_HASH (model),row, row_list); In mysql provider at function "gda_mysql_recordset_get_row" you can see: for (i = fetched_rows; i <= row; i++) { row_list = fetch_row (recset, i); if (!row_list) return NULL; if (GDA_DATA_MODEL_BASE_CLASS (parent_class)->append_row(model, row_list) == FALSE) return NULL; } and so on... We can see that all provider caches the rows in an internal table. So if we are reading sequentially a lot of record, we go wasting a lot of memory. I know that this is an architectural choise problem. Now to solve it I think we have to re-think the entire architecture of the recordset :-( Best Regards
I don't think introducing cursors means re-thinking the GdaDataModel API. The GdaDataModel interface needs to be modified to also operate with a cursor, along with a method which tells the possible modes of usage for the model, among direct_access (like it's now), cursor_forward_only, cursor_forward_backward (direct access through a cursor). Then a new kind of model (GdaDataModelCursor ?) needs to be written which implements the new cursor data access method, and the providers must use that new model class. Also when a provider executes a query which returns a recordset, we need to add an option parameter to specify how the data in the recordset will be accessed as a hint to which kind of data model class the provider needs to return. This way we don't change the API for the current applications and we introduce a new API to handle cursors.
What we can do for the time being is to fix the providers (most) that get all the rows when opening the recordset. All of them should be doing what the PostgreSQL provider is doing.
The latest API includes support for cursor based data models (which should not load all their data in memory) - for example when importing data from a CSV file it is possible to specify if the resulting data model will be accessed using a cursor (also iterator: using GdaDataModelIter object). The rule is that some GdaDataModel can be accessed in a random way and some can only be accessed using a cursor (random access data models can of course still be accessed using cursors). When running a SELECT query, you can specify that you only need to access the returned data model using a cursor, and then if it supports it, the provider should return a data model which only supports cursor based access, thus reducing memory consumption. Note however that this is only a preference and that the provider may still return a random access data model. Moreover at the moment no provider supports that, it needs to be implemented in each provider now.
When there is an example of this use of cursors, we should close this bug.
> Moreover at the moment no provider supports that, it needs to be implemented in > each provider now. Does anyone have any plans to implement this for any providers, and create an example?
SVN trunk now contains an implementation for PostgreSQL. To request a cursor only data model when executing a command, do as follows: GdaParameterList *options; options = gda_parameter_list_new_inline (NULL, "ITER_MODEL_ONLY", G_TYPE_BOOLEAN, TRUE, NULL); model = gda_connection_execute_select_command (cnc, command, options, NULL); g_object_unref (options); and use an iterator to access the contents (using gda_data_model_get_value_at() won't work): GdaDataModelIter *iter; iter = gda_data_model_create_iter (model); gda_data_model_iter_move_next (iter); while (gda_data_model_iter_is_valid (iter)) { [...] gda_data_model_iter_move_next (iter); }
Closing that bug since the infrastructure is in place and only needs to be implemented in the providers.