GNOME Bugzilla – Bug 304007
troubles with some russian xls
Last modified: 2011-08-12 22:04:19 UTC
Distribution: Debian 3.1 Package: Gnumeric Severity: normal Version: GNOME2.8.3 1.4.x Gnome-Distributor: Debian Synopsis: troubles with some russian xls Bugzilla-Product: Gnumeric Bugzilla-Component: import/export MS Excel (tm) Bugzilla-Version: 1.4.x BugBuddy-GnomeVersion: 2.0 (2.8.1) Description: Description of the crash: Files, created in 1C program (it's very popular russian program), are viewed with bad charset. Steps to reproduce the crash: 1. Export some data from 1C in xls fromat 2. Open this file in Gnumeric 3. All cyrillic letters are bad (like Western charset instead of cp-1251). 4. After some work with this file Gnumeric crash Expected Results: How often does this happen? Additional Information: Microsoft Excel open this files without any problem. i attach file example. Debugging Information: Backtrace was generated from '/usr/bin/gnumeric' (no debugging symbols found) Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1". (no debugging symbols found) `system-supplied DSO at 0xffffe000' has disappeared; keeping its symbols. (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread -1222850496 (LWP 21564)] (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) 0xffffe410 in __kernel_vsyscall ()
+ Trace 59589
Thread 1 (Thread -1222850496 (LWP 21564))
------- Bug moved to this database by unknown@bugzilla.gnome.org 2005-05-13 07:14 UTC ------- Bugreport had an attachment. This cannot be imported to Bugzilla. Contact bugmaster@gnome.org if you are willing to write a patch for this. The original reporter of this bug does not have an account here. Reassigning to the person who moved it here, unknown@bugzilla.gnome.org. Previous reporter was spied@yandex.ru.
Created attachment 46387 [details] xls file example
Created attachment 46388 [details] screenshot from MS Excel
What generated that file ? XL-95 does not render it correctly. 2k/XP renders something similar to your screenshot, but can not export it and reload the result. OOo 2 does even worse than gnumeric.
> What generated that file ? 1C Enterprise, it's very popular russian bookkeeping program (for windows). i get some documents in this format via e-mail. > XL-95 does not render it correctly. what is wrong? afaik russian version of excel 95 must reder it correctly. > 2k/XP renders something similar to your screenshot, ;) > but can not export it and reload the result. i don't understand you. for me - i can open this file in excel, do "save as" and open in excel or gnumeric..
jody: while it is probably not the cause of the crash, it looks like excel_read_XF should set ->text_dir in all cases.
I fixed that issue. With that, nothing from Purify.
XL95 encoding is mostly, but every cell has 'wrap text' enabled which renders terribly. XL2k/XP encoding looks correct everywhere, but if I save and reload in 2k or XP the encoding and 'wrap text' is incorrect everywhere.
after six years bloat office 1 : 0 gnumeric
Created attachment 193115 [details] Modern example from '1C Enterprise' 'biff5' part of the CLP file was dumped from the file attached to this bug: https://bugs.freedesktop.org/show_bug.cgi?id=33100 Both LibO Calc and Gnumeric fail to convert Cyrillic text in it. Somehow Calc opens "123.xls" attached here correctly.
"123.xls" uses "0xCC" ('Cyrillic') in the 'charset' field of the 'Font' record. Could gnumeric use it? In addition it would be nice to have a UI to select encoding for biff5 files like we have for text import and configuration option switch(es) to use Codepage record, or Charset from the Font, or force customer encoding etc.
Created attachment 193281 [details] [review] This patch adds setting Codepage based on charset field in FONT record
Created attachment 193305 [details] [review] Patch with fixed codepage for Apple Roman encoding
https://bugzilla.gnome.org/show_bug.cgi?id=535473 is similar to this one and fixed (for old 1C files) by patch in #12.
Review of attachment 193305 [details] [review]: Please correct me if I am wrong, but it looks to me like you that the patch uses the font information to set the code page even if the file contains codepage information. In the case of a codepage record, that information has to govern. If we are guaranteed that the codepage record would come after the font information we are fine, otherwise we might replacing valid codepage info with the guess from the font information.
Yes, you are right. That's the reason why I've asked how to get codepage value from read_FONT. We are not guaranteed that codepage come after, in fact it seems to come before fonts in normal files. I think the better way would be to store charset within the font and utilize it later.
Because Excel 5 format is multilingual, the codepage from the font should override codepage record. See example.
Created attachment 193351 [details] Multilingual document
Created attachment 193352 [details] How it should look
(In reply to comment #16) > Because Excel 5 format is multilingual, the codepage from the font should > override codepage record. See example. No, it shouldn't. Current patch will override codepage with charset from every next font entry. Font entries are grouped together in the 'workbook' substream, so before you start to deal with text codepage would be set from the charset in the last font record. It works ok if a document could be handled as "one encoding per document". Your document _sets_ codepage (hence I guess it's not one generated by "1C") to 1251 and mixes Greek, CE and Turkish font records, but it ends up with Russian font record. So, end result would be "everything is Latin I + Cyrillic".
Created attachment 193360 [details] [review] This one seems to handle 'charsets.xls' properly. Read and store 'charset' from FONT record, use it for g_iconv from LABEL record.
NEWS and Changelog entries for the last patch?
Created attachment 193662 [details] [review] With NEWS and ChangeLog
I have committed the last patch. What is left from this bug report?
UI to force encoding. Also it would be nice to have a command line option for ssconvert.
UI to specify encoding is bug #535473. Command line option for ssconvert would be closely related to that (so I would also consider it part of bug #535473). So we do not need two such bugs, closing this one.
Created attachment 193731 [details] [review] As per discussion on IRC, store codepage, convert charset in read_FONT, call gnm_font_override_codepage for charset 0. This one seems to convert file from ubuntu#262777 properly.
Review of attachment 193731 [details] [review]: We should not call gnm_font_override_codepage twice but remember the result from the first call.
Review of attachment 193731 [details] [review]: committed with minor changes