GNOME Bugzilla – Bug 401588
gnumeric unsupported file format / failed to find a valid encoding of data!
Last modified: 2011-09-13 16:59:48 UTC
A single character of unusual encoding buried deep within a five megabyte .tsv (tab separated values) text file makes the whole file unusable by gnumeric. Loading the .tsv file in vi and deleting the single character () makes it accessible via []$ gnumeric goodfile & Try these two 1-line files, goodfile and badfile: A BB CCC DDDD A BB CCC DDDD While the character is buried inside a text string one gets the following error messages: (3 different ways) []$ gnumeric badrecord.tsv & gives unadorned error window with red X stop sign, "Unsupported file format.", and Close button. []$ gnumeric & and then File -> Open -> FileType: Automatic Detection gives the same result. File -> Open -> FileType: Text Import (configurable) gives on the command line Reading file:///(snip)/BrokenRecord.tsv Reading file:///(snip)/BrokenRecord.tsv ** (gnumeric:23821): WARNING **: This is not good -- failed to find a valid encoding of data! ====== Behavior I would prefer: ====== 1. Strange characters input into a text string that is not explicitly modified by the spreadsheet user should remain undisturbed in the text string. 2. Give a warning: "gnumeric may not be able to properly print or display one or more characters in line 25678." This would have saved me a lot of time finding one bad character in 5 megabytes. Note: It is OK if gnumeric can't display/print everything properly. Often gnumeric is used to quickly store and sort data as an ad hoc database table; pretty output can happen elsewhere. ====== Workarounds I would welcome ====== 0. Enlightenment if I need to change a setting somewhere. 1. Some guidance as to what kinds of characters gnumeric considers offensive. 2. A script or rule whereby one could filter incoming text to identify or remove offensive characters. Thank you for the good work! (phil)
The above sample actually works with method (3) above. Save this to a text file; it fails all three ways. AAA BBBï·BBB BBB AAA BBBBBBBBBB CCC
1. Please *attach* a sample file that causes problems. I don't trust bugzilla to transfer byte-for-byte. 2. What is the output of "locale" on your machine? Note: there is no such thing as leaving strange characters alone. If we don't know what the encoding is, we do not know how to display them. In fact, we cannot even determine what characters are tabs and newlines until we figure out the text encoding.
Also, please attach the output of... iconv -f ISO-8859-1 -t UTF-8 <badfile >badfile.utf8
Created attachment 81385 [details] three line text file that shows the bad behavior
Created attachment 81386 [details] iconv -f ISO-8859-1 -t UTF-8 <brokenrecord.tsv >badfile.utf8 here it is!
This problem has been fixed in the development version of goffice. The fix will be available in the next major software release. Thank you for your bug report. Note: the file will not load from the command line, at least not unless you select a locale in which the data is valid. But at least it will now guess a locale that makes a marginal amount of sense (ISO-8859-1) and not UTF-8. And it won't do the "This is not good" thing.
Currently go_guess_encoding does not return the length of the written data, so we have to assume that the first NULL terminates that data. We really need a go_guess_encoding that tells us how much was written. THen we can check for NULLs afterwards. In this process we may also want to address the need for go_guess_encoding_truncated that is stated in a various places in Gnumeric's code.
My last comment was in teh wrong bug report, sorry...