After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 498912 - Improve format guessing for csv files
Improve format guessing for csv files
Status: RESOLVED FIXED
Product: Gnumeric
Classification: Applications
Component: import/export Text
git master
Other All
: Normal enhancement
: ---
Assigned To: Morten Welinder
Jody Goldberg
Depends on:
Blocks:
 
 
Reported: 2007-11-22 07:03 UTC by Daniel Vianna
Modified: 2008-12-12 19:35 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
CSV (comma separated values) file, original (12.96 KB, text/plain)
2007-11-22 07:09 UTC, Daniel Vianna
Details
Desired result: Gnumeric file (8.61 KB, application/octet-stream)
2007-11-22 07:11 UTC, Daniel Vianna
Details
Test case 1 (48 bytes, text/plain)
2008-12-12 14:05 UTC, Patryk Zawadzki
Details
Test case 2 (66 bytes, text/plain)
2008-12-12 14:08 UTC, Patryk Zawadzki
Details

Description Daniel Vianna 2007-11-22 07:03:16 UTC
I have been using the advanced import widget to deal with comma separated values (CSV) files exported from DataQuest A.R.T. (http://www.datasci.com/products/software/dsi_dataquest_art.asp) manually. But I realise Gnumeric should recognise CSV seamlessly. Well, it doesn't. I'm attaching a sample CSV file and the desired result in a Gnumeric file. Tested in Gnumeric 1.7.14.
Comment 1 Daniel Vianna 2007-11-22 07:09:46 UTC
Created attachment 99465 [details]
CSV (comma separated values) file, original

Maybe Gnumeric gets confused with the file heading. As you can see, lines 2 to 5 have headings for columns A through D. The number of heading lines before the actual data starts is variable, and is related to the number of data columns.
Comment 2 Daniel Vianna 2007-11-22 07:11:05 UTC
Created attachment 99466 [details]
Desired result: Gnumeric file

Notice that the locale of the original file makes it use dots to separate decimals, while in my system I use commas.
Comment 3 Daniel Vianna 2007-11-22 07:12:44 UTC
I just noticed that Gnumeric has an option for reading CSV/TSV files. Well, it does not read my file as CSV, the output is wrong. Maybe it would be a good idea to separate those options in the widget?
Comment 4 Jean Bréfort 2007-11-22 09:45:54 UTC
you have two solutions: either you use gnumeric in C locale, or you use the advanced impoter wher you can give the locale of the imported file.
Comment 5 Morten Welinder 2007-11-22 13:14:55 UTC
There is no spec for csv files.  Everybody and their brother have their
own idea about what constitutes ones.  The same file can and will be
understood differently by different people, so you are going to have
to give Gnumeric hints about what your file is.

Right now, as Jean said, that means locale or the configurable importer.

This is the first time I see "#" comments in a csv file, btw.
Comment 6 Daniel Vianna 2007-11-22 23:53:22 UTC
Another option would be if the configurable importer would remember the selected option from the last import action. That would save me selecting the same thing each time. Same goes for the starting directory when I try to open a file, although GTK already gives me the option of having the directory of interest bookmarked, which is handy.

The point here is: I love the configurable importer! I just think it could be set in a way which would save more time. I bet there would be many users who would have to repeat the recipe over and over in order to import third party files into Gnumeric.
Comment 7 Daniel Vianna 2007-11-23 01:38:06 UTC
Suggestion: A button, or way, to set the default values in the configurable importer?
Comment 8 Patryk Zawadzki 2008-12-12 14:03:35 UTC
This seems like the appropriate place to mumble some more about the CSV importers. I get a megs of CSV data provided by large corporations daily and most of the time Gnumeric gets it wrong. The most obvious case is favoring colons over commas even if there is just a single colon in the whole file while all the rows have an equal nubmer of commas.

Attaching two simple test cases (both are applicable to the particular data I work with).
Comment 9 Patryk Zawadzki 2008-12-12 14:05:38 UTC
Created attachment 124521 [details]
Test case 1

A simple file with a 3x3 matrix. Gnumeric decides to split on colon which is wrong. oocalc guesses the format correctly.
Comment 10 Patryk Zawadzki 2008-12-12 14:08:10 UTC
Created attachment 124522 [details]
Test case 2

Same as test case 1 but using quoted fields. Gnumeric does something completely weird. oocalc correctly guesses the format.
Comment 11 Jean Bréfort 2008-12-12 14:46:21 UTC
Patryk, the default behavior is locale dependent. Opening your files from the command line with locale set to "C" works fine. In a locale where the decimal separator is a comma, I see the same behaviour as you.

When opening a csv file frome the menus, you have the possibility to choose the separator, locale and so on for the file if you choose "Text import (configurable)" after clicking the "Advanced" button, or you'll get the same if you import your data using the "Data/Get External Data/Import Text File..." menu item.
Comment 12 Patryk Zawadzki 2008-12-12 15:07:18 UTC
That's nice but I don't keep a Gnumeric window open all the time. I launch the application by clicking on one of the 12+ MB csv files, wait, curse a lot and then launch oocalc :)

To be honest I've never encounteres a localised csv file. We do use comma as decimal separator in Poland but all the apps I'm aware of use the portable C notion when serializing data to disk. I have plenty of files using commas in the fields but these are properly quoted not to mistake them with field separators.

I don't think it makes much sense to look at LC_* in the case of the second test case. The fields are quoted so the comma between them is certainly not part of a number :)
Comment 13 Morten Welinder 2008-12-12 17:28:58 UTC
I'll get around to this eventually.
Comment 14 Morten Welinder 2008-12-12 19:35:55 UTC
This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.


Method applied to csv files:

1. Look for a line with a double quote in it, preferably as first char.
2. If such a line is found use the character after the matching end quote
   unless that is the end of the line.  In that case, try the character
   before the first quote.
3. If we do not get anything this way, use a ",".


I don't think 3 is ideal.  We should probably look at the number of commas
vs ";" in that case.