Bug 549743 – gnumeric should understand and ignore the UTF-8 BOM marker

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 549743 - gnumeric should understand and ignore the UTF-8 BOM marker


Summary:	gnumeric should understand and ignore the UTF-8 BOM marker


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	import/export Text
Version:	1.8.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Morten Welinder
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-08-28 19:02 UTC by Gustavo Noronha (kov)
Modified:	2008-09-01 18:13 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
test file (242 bytes, text/csv) 2008-08-29 00:32 UTC, Gustavo Noronha (kov)		Details
Proposed patch: ignore BOM in csv_tsv_probe (663 bytes, patch) 2008-08-29 20:12 UTC, J.H.M. Dassen (Ray)	none	Details \| Review
Proposed patch: Ignore a BOM during actual import (640 bytes, patch) 2008-08-30 13:03 UTC, J.H.M. Dassen (Ray)	none	Details \| Review
Proposed patch: Ignore a BOM during actual import (631 bytes, patch) 2008-08-30 13:58 UTC, J.H.M. Dassen (Ray)	needs-work	Details \| Review
Proposed patch: ignore BOM during actual import (486 bytes, patch) 2008-08-30 23:41 UTC, J.H.M. Dassen (Ray)	needs-work	Details \| Review
Proposed patch: ignore BOM during actual import (990 bytes, patch) 2008-08-31 14:13 UTC, J.H.M. Dassen (Ray)	none	Details \| Review

Description Gustavo Noronha (kov) 2008-08-28 19:02:09 UTC

Hello,

I have an application that exports CSV files. The files are UTF-8 and one of the needs is that Excel opens them without any problems. To get this done I need to separate the columns using ; and use the UTF-8 BOM marker, or Excel will think the file is encoded in latin1.

I usually use Gnumeric to try out the files, since I'm developing in Debian. Since I started writing the BOM gnumeric complains that the file is invalid and bails.

The BOM is a non-printable character, hex 0xefbbbf; More info here: http://www.websina.com/bugzero/kb/unicode-bom.html. It should be understood as 'this file is UTF-8-encoded', and then ignored.

Thanks!

Comment 1 Andreas J. Guelzow 2008-08-28 23:54:36 UTC

When you are opening the csv file, are you choosing utf-8 encoding?

Comment 2 Gustavo Noronha (kov) 2008-08-29 00:31:39 UTC

When trying opening the file with 'gnumeric myfile.csv' or running gnumeric and clicking File->Open and selecting it I get an error message "Unsupported file format.", no chance to select the encoding.

Comment 3 Gustavo Noronha (kov) 2008-08-29 00:32:41 UTC

Created attachment 117561 [details]
test file

here's the file I'm generating; oocalc opens it correctly, too

Comment 4 Andreas J. Guelzow 2008-08-29 02:46:14 UTC

It doesn't really make sense to use BOM in an UTF-8 encoded file. In fact the link you provided above states clearly that:
UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream.

Yes they indicate that a BOM could be used as a hint that the file is in unicode (but they obvously do not say that it indicates that it is in utf-8, since it is really only useful for utf-16 or utf-32)

Unfortuantely there is no standard definition of the "csv" file format. So if you create such non-standard files, please open them from within gnumeric where you in fact are able to choose the encoding!

Comment 5 Gustavo Noronha (kov) 2008-08-29 11:24:05 UTC

(In reply to comment #4)
> Unfortuantely there is no standard definition of the "csv" file format. So if
> you create such non-standard files, please open them from within gnumeric where
> you in fact are able to choose the encoding!

I tried that, in fact, and gnumeric only told me that it was an invalid file. 

Creating files like that seems to be the only way to make excel automatically read the file as UTF-8 instead of latin1, and I am pretty sure, though I don't have ways of testing it here, that excel adds such a marker in UTF-8 csvs it exports.

Since I need the application to automatically open the file when I click it in the browser, I would rather see gnumeric ignoring the marker if it finds it then bailing with an error.

Notice that when the file didn't have the marker gnumeric still noticed correctly that it was UTF-8, so I'm not really asking to use the marker to detect the file as UTF-8 encoding, just to ignore the marker so that the file will be opened at all.

Comment 6 Andreas J. Guelzow 2008-08-29 14:35:59 UTC

I can open the file you attached just fine from within gumeric by selecting (in teh open dialog) csv as the file type and utf-8 as encoding.

Comment 7 J.H.M. Dassen (Ray) 2008-08-29 20:12:06 UTC

Created attachment 117601 [details] [review]
Proposed patch: ignore BOM in csv_tsv_probe

Use of the byte-order mark is discussed in RFC 3629 which defines UTF-8
and is quite common with Windows applications according to
http://en.wikipedia.org/wiki/Byte_Order_Mark#Usage. Thus, it seems
reasonable to me to ignore it when probing for CSV.

With this patch, gnumeric opens the test file when it is supplied as a
command-line argument and when it is opened through File->Open.

Is this OK to commit?

Comment 8 J.H.M. Dassen (Ray) 2008-08-29 20:47:22 UTC

For completeness, I reproduced the reported behaviour with 1.9.1.

Comment 9 Morten Welinder 2008-08-29 23:39:10 UTC

I don't think the patch from comment 7 is enough.  The patch makes the
probe function ignore BOM.  Fine.  But what about the actual import?
We should ignore BOM there too and not stuff it into the first cell.
Do we?

Comment 10 J.H.M. Dassen (Ray) 2008-08-30 09:26:13 UTC

The BOM is currently being stuffed in the first cell rather than ignored:
for the test file, =dec2hex(unicode(left(A1,1))) evaluates to FEFF.

Comment 11 J.H.M. Dassen (Ray) 2008-08-30 13:03:06 UTC

Created attachment 117632 [details] [review]
Proposed patch: Ignore a BOM during actual import

Comment 12 J.H.M. Dassen (Ray) 2008-08-30 13:58:33 UTC

Created attachment 117638 [details] [review]
Proposed patch: Ignore a BOM during actual import 

Fixed braces

Comment 13 Jody Goldberg 2008-08-30 15:34:44 UTC

Comment on attachment 117638 [details] [review]
Proposed patch: Ignore a BOM during actual import 

I don't think that is in the right place.  Are we expecting BOM's for each cell ?

A quick look suggests stf_parse_general would be the place to look.

Comment 14 J.H.M. Dassen (Ray) 2008-08-30 23:41:18 UTC

Created attachment 117667 [details] [review]
Proposed patch: ignore BOM during actual import

Comment 15 Morten Welinder 2008-08-31 12:46:31 UTC

Comment on attachment 117667 [details] [review]
Proposed patch: ignore BOM during actual import

Almost ok: please check first that there are three or more bytes to work with.  There seems to be no guarantee that the string is terminated.  (It looks like the validate call needs to handle date_end too.)

Comment 16 J.H.M. Dassen (Ray) 2008-08-31 14:13:24 UTC

Created attachment 117702 [details] [review]
Proposed patch: ignore BOM during actual import

Comment 17 Morten Welinder 2008-09-01 12:48:16 UTC

Looks fine.  Go for it.

Comment 18 J.H.M. Dassen (Ray) 2008-09-01 18:13:06 UTC

Changes to understand and ignore the UTF-8 BOM marker when recognising and
importing CSV/stf data have been added to the development version of
gnumeric through this commit
        http://svn.gnome.org/viewvc/gnumeric?view=revision&revision=16769
and will be available in the next major software release. Thank you for your
bug report.