Bug 401588 – gnumeric unsupported file format / failed to find a valid encoding of data!

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 401588 - gnumeric unsupported file format / failed to find a valid encoding of data!


Summary:	gnumeric unsupported file format / failed to find a valid encoding of data!


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	import/export Text
Version:	1.6.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Morten Welinder
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-01-28 06:31 UTC by Phil
Modified:	2011-09-13 16:59 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
three line text file that shows the bad behavior (61 bytes, text/tab-separated-values) 2007-01-28 18:44 UTC, Phil	Details
iconv -f ISO-8859-1 -t UTF-8 <brokenrecord.tsv >badfile.utf8 (64 bytes, application/octet-stream) 2007-01-28 18:48 UTC, Phil	Details

Description Phil 2007-01-28 06:31:59 UTC

A single character of unusual encoding buried deep within a five megabyte .tsv (tab separated values) text file makes the whole file unusable by gnumeric. Loading the .tsv file in vi and deleting the single character () makes it accessible via
[]$ gnumeric goodfile &

Try these two 1-line files, goodfile and badfile:
A	BB	CCC	DDDD

A	BB	CCC	DDDD

While the character is buried inside a text string one gets the following error messages: (3 different ways)

[]$ gnumeric badrecord.tsv & 
gives unadorned error window with red X stop sign, "Unsupported file format.", and Close button.

[]$ gnumeric &

and then File -> Open -> FileType: Automatic Detection 
gives the same result.

File -> Open -> FileType: Text Import (configurable)
gives on the command line
Reading file:///(snip)/BrokenRecord.tsv
Reading file:///(snip)/BrokenRecord.tsv

** (gnumeric:23821): WARNING **: This is not good -- failed to find a valid encoding of data!

====== Behavior I would prefer: ======
1. Strange characters input into a 
    text string that is not explicitly modified by the spreadsheet user 
   should remain undisturbed in the text string.

2. Give a warning: "gnumeric may not be able to properly print or display one or more characters in line 25678." This would have saved me a lot of time finding one bad character in 5 megabytes. 

Note: It is OK if gnumeric can't display/print everything properly. 
Often gnumeric is used to quickly store and sort data as an ad hoc database table; pretty output can happen elsewhere.

====== Workarounds I would welcome ======
0. Enlightenment if I need to change a setting somewhere.
1. Some guidance as to what kinds of characters gnumeric considers offensive.
2. A script or rule whereby one could filter incoming text to identify or remove offensive characters.

Thank you for the good work! (phil)

Comment 1 Phil 2007-01-28 06:50:48 UTC

The above sample actually works with method (3) above.
Save this to a text file; it fails all three ways.

AAA	BBBï·BBB	BBB
AAA	BBBBBBBBBB	CCC

Comment 2 Morten Welinder 2007-01-28 13:44:10 UTC

1. Please *attach* a sample file that causes problems.  I don't trust
   bugzilla to transfer byte-for-byte.

2. What is the output of "locale" on your machine?


Note: there is no such thing as leaving strange characters alone.  If we don't
know what the encoding is, we do not know how to display them.  In fact, we
cannot even determine what characters are tabs and newlines until we figure
out the text encoding.

Comment 3 Morten Welinder 2007-01-28 13:49:31 UTC

Also, please attach the output of...

iconv -f ISO-8859-1 -t UTF-8 <badfile >badfile.utf8

Comment 4 Phil 2007-01-28 18:44:39 UTC

Created attachment 81385 [details]
three line text file that shows the bad behavior

Comment 5 Phil 2007-01-28 18:48:17 UTC

Created attachment 81386 [details]
iconv -f ISO-8859-1 -t UTF-8 <brokenrecord.tsv >badfile.utf8

here it is!

Comment 6 Morten Welinder 2007-04-12 01:04:01 UTC

This problem has been fixed in the development version of goffice. The fix
will be available in the next major software release. Thank you for your bug
report.

Note: the file will not load from the command line, at least not unless
you select a locale in which the data is valid.  But at least it will
now guess a locale that makes a marginal amount of sense (ISO-8859-1)
and not UTF-8.  And it won't do the "This is not good" thing.

Comment 7 Andreas J. Guelzow 2011-09-13 16:52:17 UTC

Currently go_guess_encoding does not return the length of the written data, so we have to assume that the first NULL terminates that data.

We really need a go_guess_encoding that tells us how much was written. THen we can check for NULLs afterwards.

In this process we may also want to address the need for go_guess_encoding_truncated that is stated in a various places in Gnumeric's code.

Comment 8 Andreas J. Guelzow 2011-09-13 16:59:48 UTC

My last comment was in teh wrong bug report, sorry...