After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 318742 - big tables memory comsumption
big tables memory comsumption
Status: RESOLVED FIXED
Product: libgda
Classification: Other
Component: general
unspecified
Other All
: Normal major
: ---
Assigned To: malerba
gnome-db Maintainers
Depends on:
Blocks:
 
 
Reported: 2005-10-13 06:34 UTC by Michele Cremasco
Modified: 2007-09-27 10:06 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Michele Cremasco 2005-10-13 06:34:16 UTC
There is a big architectural problem in GDA.

In industrial databases we can found very often tables with millions of rows.
If I have to read one of these tables, for example to make an export to a text
file, with the actual GDA architecture all rows will be fetched to memory before
I can use it!
So there is a big unuseful memory consumption.

Example:
- 1.000.000 records in the select
- 30 fields in the select
- to fetch all record GDA caches all records readed in a big array of GdaRows
- Every GdaRow consume about ~500 bytes (considering using GdaValue with an
average of 16 bytes)
- The cache array uses about 4.000.000 bytes to store pointers to GdaRows
- The total memory consumption in 4.000.000 + 1.000.000 * 500 = 504.000.000 ~=
480Mb !!!!! 

-> GDA is not usable in this way!!

Instead we can use a "cursor" approach in witch only the current Row are
allocated, fetched an available to the client program.

Example of code with a Cursor object that store only the current row:
Cursor *cursor = gda_connection_open_cursor( connection, command, options... );
while( gda_cursor_fetch(cursor) ) 
{
  for( col =0; col<gda_cursor_get_n_column(cursor), col++)
  {
    value = gda_cursor_get_value( cursor, col);
    /* do something with value */
  }
}

If somebody need the cached approach, a DataModel cam be easily made over the
Cursor object simply caching the fetched rows in a GPtrArray.

In this way we have a low level API with single-current-row-Cursor-approach and
a higher level API with all-cached-row-DataModel-approach.

See:
 Python Database 2 specification uses a cursor object with two different method:
fetchone() for single row traversing, and fetchall() to get all rows in one time.

 Pytohn Example - row by row:

  cursor = connection.cursor()
  row = cursor.fetchone()
  while(row):
     # do something with row data
     row = cursor.fetchone()
  cursor.close()

 Pytohn Example - all row together:

  cursor = connection.cursor()
  rows = cursor.fetchall()
  cursor.close()
  for row in rows:
     # do something with row data

P.S. Sorry for my bad English ;-)
Comment 1 Rodrigo Moya 2005-10-13 11:02:58 UTC
With what provider have you tested this?
Comment 2 malerba 2005-10-13 11:45:52 UTC
The cursor functionnality is IMHO a requirement for large databases. I'll work
towards proposing a post 2.0 API for that purpose. Beware however that not all
the DBMS provide such a cursor to iterate through a resulset.
Comment 3 Michele Cremasco 2005-10-13 14:00:11 UTC
In forebird provider at function "gda_firebird_recordset_new" you can see:

    /* Fetch all rows */
    while (fb_sql_fetch_row (fcnc, recset))
        recset->priv->nrows++;

In odbc provider at function "process_sql_commands" you can see:

    /* get rows */
    while( SQL_SUCCEEDED(( rc = SQLFetch ( priv_data->hstmt)))) {
        ...
        gda_data_model_append_values (GDA_DATA_MODEL (recset), value_list);
        ...
    }

In oracle provider at function gda_oracle_recordset_get_n_rows you can see:

    while (result == OCI_SUCCESS) {
        nrows += 1;
        result = OCIStmtFetch (priv_data->hstmt,
                priv_data->cdata->herr,
                (ub4) 1,
                (ub2) OCI_FETCH_NEXT,
                (ub4) OCI_DEFAULT);
        }

In postgres provider at function "gda_postgres_recordset_get_row" you can see:

    row_list = get_row (GDA_DATA_MODEL (model), priv_data,row);
    gda_data_model_hash_insert_row(GDA_DATA_MODEL_HASH (model),row, row_list);

In mysql  provider at function "gda_mysql_recordset_get_row" you can see:

    for (i = fetched_rows; i <= row; i++)
    {
        row_list = fetch_row (recset, i);
        if (!row_list)
            return NULL;
        if (GDA_DATA_MODEL_BASE_CLASS (parent_class)->append_row(model,
            row_list) == FALSE)
            return NULL;
    }

and so on...
We can see that all provider caches the rows in an internal table.
So if we are reading sequentially a lot of record, we go wasting a lot of memory.

I know that this is an architectural choise problem.
Now to solve it I think we have to re-think the entire architecture of the
recordset :-(

Best Regards
Comment 4 malerba 2005-10-13 14:49:14 UTC
I don't think introducing cursors means re-thinking the GdaDataModel API. 

The GdaDataModel interface needs to be modified to also operate with a cursor,
along with a method which tells the possible modes of usage for the model, among
direct_access (like it's now), cursor_forward_only, cursor_forward_backward
(direct access through a cursor).

Then a new kind of model (GdaDataModelCursor ?) needs to be written which
implements the new cursor data access method, and the providers must use that
new model class.

Also when a provider executes a query which returns a recordset, we need to add
an option parameter to specify how the data in the recordset will be accessed as
a hint to which kind of data model class the provider needs to return.

This way we don't change the API for the current applications and we introduce a
new API to handle cursors.
Comment 5 Rodrigo Moya 2005-10-13 15:03:32 UTC
What we can do for the time being is to fix the providers (most) that get all
the rows when opening the recordset. All of them should be doing what the
PostgreSQL provider is doing.
Comment 6 malerba 2006-05-08 12:15:36 UTC
The latest API includes support for cursor based data models (which should not load all their data in memory) - for example when importing data from a CSV file it is possible to specify if the resulting data model will be accessed using a cursor (also iterator: using GdaDataModelIter object). 

The rule is that some GdaDataModel can be accessed in a random way and some can only be accessed using a cursor (random access data models can of course still be accessed using cursors).

When running a SELECT query, you can specify that you only need to access the returned data model using a cursor, and then if it supports it, the provider should return a data model which only supports cursor based access, thus reducing memory consumption. Note however that this is only a preference and that the provider may still return a random access data model. Moreover at the moment no provider supports that, it needs to be implemented in each provider now.
Comment 7 Murray Cumming 2006-12-28 17:24:56 UTC
When there is an example of this use of cursors, we should close this bug.
Comment 8 Murray Cumming 2007-05-05 12:44:23 UTC
> Moreover at the moment no provider supports that, it needs to be implemented in > each provider now.

Does anyone have any plans to implement this for any providers, and create an example?
Comment 9 malerba 2007-06-06 21:24:44 UTC
SVN trunk now contains an implementation for PostgreSQL. To request a cursor only data model when executing a command, do as follows:

GdaParameterList *options;
options = gda_parameter_list_new_inline (NULL, "ITER_MODEL_ONLY", G_TYPE_BOOLEAN, TRUE, NULL);
model = gda_connection_execute_select_command (cnc, command, options, NULL);
g_object_unref (options);

and use an iterator to access the contents (using gda_data_model_get_value_at() won't work):
GdaDataModelIter *iter;
iter = gda_data_model_create_iter (model);
gda_data_model_iter_move_next (iter);
while (gda_data_model_iter_is_valid (iter)) {
    [...]
    gda_data_model_iter_move_next (iter);
}
Comment 10 malerba 2007-09-27 10:06:19 UTC
Closing that bug since the infrastructure is in place and only needs to be implemented in the providers.