After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 644197 - Importing a 73 MB CSV file (4+ million lines) failes with memory allocation error
Importing a 73 MB CSV file (4+ million lines) failes with memory allocation e...
Status: RESOLVED FIXED
Product: Gnumeric
Classification: Applications
Component: import/export Text
git master
Other All
: Normal normal
: ---
Assigned To: Morten Welinder
Jody Goldberg
Depends on:
Blocks:
 
 
Reported: 2011-03-08 12:53 UTC by dudley
Modified: 2011-05-01 20:29 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch (12.21 KB, patch)
2011-03-23 21:02 UTC, Morten Welinder
none Details | Review
Updated patch (12.65 KB, patch)
2011-03-25 20:20 UTC, Morten Welinder
none Details | Review

Description dudley 2011-03-08 12:53:00 UTC
2 GB memory on PC.

GLIB-error **: gmem.c 157: failed to allocate 50331648 bytes
aborting ...

50,331,648 = c. 50 MB

Physical memory availability reported as 403 kB, so memory exhaustion is reasonable.

But 5 million records, @ 2 fields per record (both numeric) = 10 million. Assuming you use 10 bytes per cell entry = 100 million bytes. Should still not stress 2 GB RAM.
Comment 1 Morten Welinder 2011-03-08 16:05:21 UTC
10 bytes/cell doesn't come close to the amount of data we need per cell.
It's more like 100 bytes/cell.

But that's still only ~1GB.

A quick look shows that, on my 64-bit Linux, something happens during
parsing that causes Gnumeric to grow to ~2.4G.  After parsing, we drop
down to something like 1.4G.  I'm pretty sure it's not too hard to fix
the parsing part.

However, Gnumeric is not optimized for sheets this size so it isn't a
pleasant experience.
Comment 2 Andreas J. Guelzow 2011-03-08 18:13:29 UTC
Assuming that the file attached to bug 644189 contains the same information (4000000 rows, 2 columns), Gnumeric seems to handle the data just fine: after loading my machine uses a total of 1.4GB and Gnumeric is still responding fine (well, ctrl-end takes two seconds to get to the end of the data and calculations involving a whole column take a (long) while, but that is expected.)
Comment 3 dudley 2011-03-08 19:40:23 UTC
(In reply to comment #2)
> Assuming that the file attached to bug 644189 contains the same information
> (4000000 rows, 2 columns), Gnumeric seems to handle the data just fine: after
> loading my machine uses a total of 1.4GB and Gnumeric is still responding fine
> (well, ctrl-end takes two seconds to get to the end of the data and
> calculations involving a whole column take a (long) while, but that is
> expected.)

It does contain the same information. We tried to get around the dbf import issue by loading into Access 2000, exporting to CSV and reading into Gnumeric. With 2 GB RAM machines we then had the memory error issue.
Comment 4 dudley 2011-03-08 19:42:48 UTC
(In reply to comment #1)
> 10 bytes/cell doesn't come close to the amount of data we need per cell.
> It's more like 100 bytes/cell.
> 
> But that's still only ~1GB.
> 
> A quick look shows that, on my 64-bit Linux, something happens during
> parsing that causes Gnumeric to grow to ~2.4G.  After parsing, we drop
> down to something like 1.4G.  I'm pretty sure it's not too hard to fix
> the parsing part.
> 
> However, Gnumeric is not optimized for sheets this size so it isn't a
> pleasant experience.

We are OK with it not being a pleasant experience! We anticipated that processing would be slow using a spreadsheet, and Gnumeric was the only one (Excel 2010 and OpenOffice were our two alternatives) that could even theoretically accommodate the problem size.
Comment 5 Morten Welinder 2011-03-08 21:13:17 UTC
I've checked in a small change that lowers the memory high water mark to 2G.
Memory usage for a 32-bit build will be slightly lower, but since this puts
severe constrains on the memory allocator, it's anyone's guess what will
happen over on win32.

Loading takes 8 minutes.  Ctrl-Down, once loaded, takes ~30s.    I can't
imagine what you want to do with the data that will not drive you crazy.
Comment 6 Morten Welinder 2011-03-10 21:28:01 UTC
See also bug 644437.  glib is responsible for ~600M.
Comment 7 Morten Welinder 2011-03-23 21:02:16 UTC
Created attachment 184172 [details] [review]
Patch

This possible patch creates our own data structure for the cell set.
Comment 8 Morten Welinder 2011-03-25 20:20:01 UTC
Created attachment 184223 [details] [review]
Updated patch

Less bugs.

(Possibly a duplicate due to bugzilla issues.)
Comment 9 Morten Welinder 2011-05-01 20:29:09 UTC
This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.

(Fix requires yet-unreleased glib.)