GNOME Bugzilla – Bug 498912
Improve format guessing for csv files
Last modified: 2008-12-12 19:35:55 UTC
I have been using the advanced import widget to deal with comma separated values (CSV) files exported from DataQuest A.R.T. (http://www.datasci.com/products/software/dsi_dataquest_art.asp) manually. But I realise Gnumeric should recognise CSV seamlessly. Well, it doesn't. I'm attaching a sample CSV file and the desired result in a Gnumeric file. Tested in Gnumeric 1.7.14.
Created attachment 99465 [details] CSV (comma separated values) file, original Maybe Gnumeric gets confused with the file heading. As you can see, lines 2 to 5 have headings for columns A through D. The number of heading lines before the actual data starts is variable, and is related to the number of data columns.
Created attachment 99466 [details] Desired result: Gnumeric file Notice that the locale of the original file makes it use dots to separate decimals, while in my system I use commas.
I just noticed that Gnumeric has an option for reading CSV/TSV files. Well, it does not read my file as CSV, the output is wrong. Maybe it would be a good idea to separate those options in the widget?
you have two solutions: either you use gnumeric in C locale, or you use the advanced impoter wher you can give the locale of the imported file.
There is no spec for csv files. Everybody and their brother have their own idea about what constitutes ones. The same file can and will be understood differently by different people, so you are going to have to give Gnumeric hints about what your file is. Right now, as Jean said, that means locale or the configurable importer. This is the first time I see "#" comments in a csv file, btw.
Another option would be if the configurable importer would remember the selected option from the last import action. That would save me selecting the same thing each time. Same goes for the starting directory when I try to open a file, although GTK already gives me the option of having the directory of interest bookmarked, which is handy. The point here is: I love the configurable importer! I just think it could be set in a way which would save more time. I bet there would be many users who would have to repeat the recipe over and over in order to import third party files into Gnumeric.
Suggestion: A button, or way, to set the default values in the configurable importer?
This seems like the appropriate place to mumble some more about the CSV importers. I get a megs of CSV data provided by large corporations daily and most of the time Gnumeric gets it wrong. The most obvious case is favoring colons over commas even if there is just a single colon in the whole file while all the rows have an equal nubmer of commas. Attaching two simple test cases (both are applicable to the particular data I work with).
Created attachment 124521 [details] Test case 1 A simple file with a 3x3 matrix. Gnumeric decides to split on colon which is wrong. oocalc guesses the format correctly.
Created attachment 124522 [details] Test case 2 Same as test case 1 but using quoted fields. Gnumeric does something completely weird. oocalc correctly guesses the format.
Patryk, the default behavior is locale dependent. Opening your files from the command line with locale set to "C" works fine. In a locale where the decimal separator is a comma, I see the same behaviour as you. When opening a csv file frome the menus, you have the possibility to choose the separator, locale and so on for the file if you choose "Text import (configurable)" after clicking the "Advanced" button, or you'll get the same if you import your data using the "Data/Get External Data/Import Text File..." menu item.
That's nice but I don't keep a Gnumeric window open all the time. I launch the application by clicking on one of the 12+ MB csv files, wait, curse a lot and then launch oocalc :) To be honest I've never encounteres a localised csv file. We do use comma as decimal separator in Poland but all the apps I'm aware of use the portable C notion when serializing data to disk. I have plenty of files using commas in the fields but these are properly quoted not to mistake them with field separators. I don't think it makes much sense to look at LC_* in the case of the second test case. The fields are quoted so the comma between them is certainly not part of a number :)
I'll get around to this eventually.
This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report. Method applied to csv files: 1. Look for a line with a double quote in it, preferably as first char. 2. If such a line is found use the character after the matching end quote unless that is the end of the line. In that case, try the character before the first quote. 3. If we do not get anything this way, use a ",". I don't think 3 is ideal. We should probably look at the number of commas vs ";" in that case.