After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 311879 - Automatically detect html file named .xls
Automatically detect html file named .xls
Status: RESOLVED FIXED
Product: Gnumeric
Classification: Applications
Component: import/export HTML
1.5.x
Other Linux
: Normal enhancement
: ---
Assigned To: Jody Goldberg
Jody Goldberg
: 304480 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2005-07-28 16:52 UTC by David Stanaway
Modified: 2006-01-25 15:15 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Output of Microsoft Excel 11 save as web page, save as xml spreadsheet (188.65 KB, application/zip)
2005-09-09 03:06 UTC, David Stanaway
Details

Description David Stanaway 2005-07-28 16:52:58 UTC
Distribution/Version: Debian/Unstable

Some people that send me spreadsheets take advantage of a `feature' of Excel
where an html file named with .xls extention, or with an excel spreadsheet mime
type is interperated as a spreadsheet.

In Gnumeric, I can successfully use the Text HTML import to open such documents,
but this is not automatically detected.

Is it possible to have the automatic type selection handle the situation where
an html file is mislabled as an excel file? OOCalc does this, as does Excel.
Comment 1 David Stanaway 2005-07-29 19:16:47 UTC
It looks like there is a mechanism for this,

src/workbook-view.c:931  (1.5.2)

Loop through probe levels starting with FILE_PROBE_FILE_NAME

Maybe there is a way to tweak this to work for this case. Perhaps to not probe
on name first?
Comment 2 Jon Kåre Hellan 2005-08-15 08:47:37 UTC
You are right, this would be useful. The problem is making sure we don't break
anything.

You are almost right about how this could be solved. Matching on file name first
is not a problem, since we also call the format's probe function (Excel in this
case). If this does not report a match, we proceed to probe file contents. If we
had a probe function for html, your file would be recognized. There is a
proposed, very simple probe function in bugzilla: attachment 49187 [details] [review]. 

I'm not aware of any specification of "html-masquerading-as-xls", so we have to
use heuristics. If the probe function reports a false match, other probe
functions later in the chain will not get a chance to probe, and we would no
longer be able to read content which we now can read. This is why we haven't yet
committed the probe function.
Comment 3 Jon Kåre Hellan 2005-08-15 08:49:53 UTC
*** Bug 304480 has been marked as a duplicate of this bug. ***
Comment 4 Andreas J. Guelzow 2005-08-15 14:17:35 UTC
Some problems with the proposed probe function:

+	magic = (gchar *) "<table";
+	if (g_strstr_len (ulstr, -1, magic) == ulstr) {
+		res = TRUE;
+	} else  {

this code checks the whole file for the magic string but considers success only
if the magic string is at the very beginning. THat's a waste of time. On the
other hand we should at least ignore leading whitespace in the file and make
sure that the found starting table tag is in fact a tag, so there ought to be a
> sometime soon.

+		magic = (gchar *) "<html";
+		if (g_strstr_len (ulstr, -1, magic)) {
+			res = TRUE;
+		} else {

We should prbably be checking for <html> rather than just <html>

+			magic = (gchar *) "<!DOCTYPE html";
+			if (g_strstr_len (ulstr, -1, magic)) {
+				res = TRUE;
+			}

This can never match! Perhaps we should do:
magic = (gchar *) "<!doctype html";
Comment 5 Morten Welinder 2005-08-15 17:14:18 UTC
Please do not cast constant strings to non-constant types for no good
reason.  Just make "magic" and const type.

Andreas: it looks like only 200 bytes are checked.  That is still a waste of
time, but not anything anyone would notice.

On a higher level, I am not happy with using more and more probes.  It
interferes with the lazy-loading on plugins.
Comment 6 Andreas J. Guelzow 2005-08-15 18:27:05 UTC
Yeah, I missed that only 200 bytes are checked but rather than 

+	magic = (gchar *) "<table";
+	if (g_strstr_len (ulstr, -1, magic) == ulstr) {
+		res = TRUE;
+	} else  {

one should use (if we want to check anywhere in those 200 bytes)

+	magic = "<table";
+	if (g_strstr_len (ulstr, -1, magic) != NULL) {
+		res = TRUE;
+	} else  {

or (if we want to check the beginning only)

+	magic = "<table";
+	if (g_str_has_prefix (ulstr, magic)) {
+		res = TRUE;
+	} else  {





Morten: g_strstr_len even expects const gchar*
and I share the worry about more and more probes.
Comment 7 David Stanaway 2005-09-09 03:06:43 UTC
Created attachment 51999 [details]
Output of Microsoft Excel 11 save as web page, save as xml spreadsheet

This is the output of saving pattern.xls, operators.xls and allignment-test.xls
from the gnumeric samples files from Excel using the save as options, "Web
page" and "XML Spreadsheet"

I have not encountered any XML Spreadsheet being passed about, but the save as
webpage seems to crop up from some XL users out there.

Opening the web pages in Excel preserves formula and sheet cross references
(There is an extention to the <td>
EG: <td align=center x:bool="TRUE" x:fmla="=(1=1)">TRUE</td>

And the operators example shows how it handles a multi sheet workbook.
Comment 8 Jon Kåre Hellan 2006-01-24 12:50:59 UTC
This problem keeps biting people. From "http://mail.gnome.org/archives/gnumeric-list/2006-January/msg00022.html" (David Ronis): 

"A more pressing problem concerns where I work, a windows environment; spreadsheets that are circulated have migrated to a web page html format, something gnumeric doesn't know how to read. Is there a way to do this and if not, is this "feature" in the works?"

I've been in contact with David Ronis, and it turns out that this is another case of html pretending to be xls.
Comment 9 David Ronis 2006-01-24 15:34:34 UTC
I've been bitten by this same problem and hope you add a "magic" number fix.

David

Comment 10 Jon Kåre Hellan 2006-01-25 15:15:47 UTC
Fixed in CVS