GNOME Bugzilla – Bug 311879
Automatically detect html file named .xls
Last modified: 2006-01-25 15:15:47 UTC
Distribution/Version: Debian/Unstable Some people that send me spreadsheets take advantage of a `feature' of Excel where an html file named with .xls extention, or with an excel spreadsheet mime type is interperated as a spreadsheet. In Gnumeric, I can successfully use the Text HTML import to open such documents, but this is not automatically detected. Is it possible to have the automatic type selection handle the situation where an html file is mislabled as an excel file? OOCalc does this, as does Excel.
It looks like there is a mechanism for this, src/workbook-view.c:931 (1.5.2) Loop through probe levels starting with FILE_PROBE_FILE_NAME Maybe there is a way to tweak this to work for this case. Perhaps to not probe on name first?
You are right, this would be useful. The problem is making sure we don't break anything. You are almost right about how this could be solved. Matching on file name first is not a problem, since we also call the format's probe function (Excel in this case). If this does not report a match, we proceed to probe file contents. If we had a probe function for html, your file would be recognized. There is a proposed, very simple probe function in bugzilla: attachment 49187 [details] [review]. I'm not aware of any specification of "html-masquerading-as-xls", so we have to use heuristics. If the probe function reports a false match, other probe functions later in the chain will not get a chance to probe, and we would no longer be able to read content which we now can read. This is why we haven't yet committed the probe function.
*** Bug 304480 has been marked as a duplicate of this bug. ***
Some problems with the proposed probe function: + magic = (gchar *) "<table"; + if (g_strstr_len (ulstr, -1, magic) == ulstr) { + res = TRUE; + } else { this code checks the whole file for the magic string but considers success only if the magic string is at the very beginning. THat's a waste of time. On the other hand we should at least ignore leading whitespace in the file and make sure that the found starting table tag is in fact a tag, so there ought to be a > sometime soon. + magic = (gchar *) "<html"; + if (g_strstr_len (ulstr, -1, magic)) { + res = TRUE; + } else { We should prbably be checking for <html> rather than just <html> + magic = (gchar *) "<!DOCTYPE html"; + if (g_strstr_len (ulstr, -1, magic)) { + res = TRUE; + } This can never match! Perhaps we should do: magic = (gchar *) "<!doctype html";
Please do not cast constant strings to non-constant types for no good reason. Just make "magic" and const type. Andreas: it looks like only 200 bytes are checked. That is still a waste of time, but not anything anyone would notice. On a higher level, I am not happy with using more and more probes. It interferes with the lazy-loading on plugins.
Yeah, I missed that only 200 bytes are checked but rather than + magic = (gchar *) "<table"; + if (g_strstr_len (ulstr, -1, magic) == ulstr) { + res = TRUE; + } else { one should use (if we want to check anywhere in those 200 bytes) + magic = "<table"; + if (g_strstr_len (ulstr, -1, magic) != NULL) { + res = TRUE; + } else { or (if we want to check the beginning only) + magic = "<table"; + if (g_str_has_prefix (ulstr, magic)) { + res = TRUE; + } else { Morten: g_strstr_len even expects const gchar* and I share the worry about more and more probes.
Created attachment 51999 [details] Output of Microsoft Excel 11 save as web page, save as xml spreadsheet This is the output of saving pattern.xls, operators.xls and allignment-test.xls from the gnumeric samples files from Excel using the save as options, "Web page" and "XML Spreadsheet" I have not encountered any XML Spreadsheet being passed about, but the save as webpage seems to crop up from some XL users out there. Opening the web pages in Excel preserves formula and sheet cross references (There is an extention to the <td> EG: <td align=center x:bool="TRUE" x:fmla="=(1=1)">TRUE</td> And the operators example shows how it handles a multi sheet workbook.
This problem keeps biting people. From "http://mail.gnome.org/archives/gnumeric-list/2006-January/msg00022.html" (David Ronis): "A more pressing problem concerns where I work, a windows environment; spreadsheets that are circulated have migrated to a web page html format, something gnumeric doesn't know how to read. Is there a way to do this and if not, is this "feature" in the works?" I've been in contact with David Ronis, and it turns out that this is another case of html pretending to be xls.
I've been bitten by this same problem and hope you add a "magic" number fix. David
Fixed in CVS