GNOME Bugzilla – Bug 549743
gnumeric should understand and ignore the UTF-8 BOM marker
Last modified: 2008-09-01 18:13:06 UTC
Hello, I have an application that exports CSV files. The files are UTF-8 and one of the needs is that Excel opens them without any problems. To get this done I need to separate the columns using ; and use the UTF-8 BOM marker, or Excel will think the file is encoded in latin1. I usually use Gnumeric to try out the files, since I'm developing in Debian. Since I started writing the BOM gnumeric complains that the file is invalid and bails. The BOM is a non-printable character, hex 0xefbbbf; More info here: http://www.websina.com/bugzero/kb/unicode-bom.html. It should be understood as 'this file is UTF-8-encoded', and then ignored. Thanks!
When you are opening the csv file, are you choosing utf-8 encoding?
When trying opening the file with 'gnumeric myfile.csv' or running gnumeric and clicking File->Open and selecting it I get an error message "Unsupported file format.", no chance to select the encoding.
Created attachment 117561 [details] test file here's the file I'm generating; oocalc opens it correctly, too
It doesn't really make sense to use BOM in an UTF-8 encoded file. In fact the link you provided above states clearly that: UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. Yes they indicate that a BOM could be used as a hint that the file is in unicode (but they obvously do not say that it indicates that it is in utf-8, since it is really only useful for utf-16 or utf-32) Unfortuantely there is no standard definition of the "csv" file format. So if you create such non-standard files, please open them from within gnumeric where you in fact are able to choose the encoding!
(In reply to comment #4) > Unfortuantely there is no standard definition of the "csv" file format. So if > you create such non-standard files, please open them from within gnumeric where > you in fact are able to choose the encoding! I tried that, in fact, and gnumeric only told me that it was an invalid file. Creating files like that seems to be the only way to make excel automatically read the file as UTF-8 instead of latin1, and I am pretty sure, though I don't have ways of testing it here, that excel adds such a marker in UTF-8 csvs it exports. Since I need the application to automatically open the file when I click it in the browser, I would rather see gnumeric ignoring the marker if it finds it then bailing with an error. Notice that when the file didn't have the marker gnumeric still noticed correctly that it was UTF-8, so I'm not really asking to use the marker to detect the file as UTF-8 encoding, just to ignore the marker so that the file will be opened at all.
I can open the file you attached just fine from within gumeric by selecting (in teh open dialog) csv as the file type and utf-8 as encoding.
Created attachment 117601 [details] [review] Proposed patch: ignore BOM in csv_tsv_probe Use of the byte-order mark is discussed in RFC 3629 which defines UTF-8 and is quite common with Windows applications according to http://en.wikipedia.org/wiki/Byte_Order_Mark#Usage. Thus, it seems reasonable to me to ignore it when probing for CSV. With this patch, gnumeric opens the test file when it is supplied as a command-line argument and when it is opened through File->Open. Is this OK to commit?
For completeness, I reproduced the reported behaviour with 1.9.1.
I don't think the patch from comment 7 is enough. The patch makes the probe function ignore BOM. Fine. But what about the actual import? We should ignore BOM there too and not stuff it into the first cell. Do we?
The BOM is currently being stuffed in the first cell rather than ignored: for the test file, =dec2hex(unicode(left(A1,1))) evaluates to FEFF.
Created attachment 117632 [details] [review] Proposed patch: Ignore a BOM during actual import
Created attachment 117638 [details] [review] Proposed patch: Ignore a BOM during actual import Fixed braces
Comment on attachment 117638 [details] [review] Proposed patch: Ignore a BOM during actual import I don't think that is in the right place. Are we expecting BOM's for each cell ? A quick look suggests stf_parse_general would be the place to look.
Created attachment 117667 [details] [review] Proposed patch: ignore BOM during actual import
Comment on attachment 117667 [details] [review] Proposed patch: ignore BOM during actual import Almost ok: please check first that there are three or more bytes to work with. There seems to be no guarantee that the string is terminated. (It looks like the validate call needs to handle date_end too.)
Created attachment 117702 [details] [review] Proposed patch: ignore BOM during actual import
Looks fine. Go for it.
Changes to understand and ignore the UTF-8 BOM marker when recognising and importing CSV/stf data have been added to the development version of gnumeric through this commit http://svn.gnome.org/viewvc/gnumeric?view=revision&revision=16769 and will be available in the next major software release. Thank you for your bug report.