GNOME Bugzilla – Bug 796586
QIF import incorrectly converts unicode characters from UTF8 encoded file
Last modified: 2018-06-30 00:11:51 UTC
Created attachment 372686 [details] QIF file containing czech characters in transaction descriptions After updating to gnucash 3.1 from 2.6.x the following problem appeared: I have a QIF UTF8 encoded file with transaction descriptions containing Czech characters (e.g. 'ř', a sample file is in the attachment). Import proceeds without any error messages, however, all the unicode characters becomes corrupted; for example 'Připsaný bonusový úrok' becomes 'PÅ™ipsaný bonusový úrok' (which seems like ANSI conversion). Transaction imported before the update looks fine, so it does not look like database or displaying issue. My OS is Windows 10, please let me know if any additional details are needed.
Same results even if language of GnuCash interface is changed to Czech.
Can you edit gnucash/import-export/qif-imp/qif-file.scm Line 132 to say (with-input-from-file #:guess-encoding #t path and see if that fixes it?
Em, sorry, that should be c:\Program Files (x86)\gnucash\share\gnucash\scm\qif-import\qif-file.scm. You'll need to run the editor with admin privs.
The fix you suggested unfortunately generates an error message during import (something like "there is a bug during import" or similar). However, after digging into some specs, I managed to make it work for me by changing line 519 instead to (line-loop))))) #:encoding "UTF-8") Unfortunately, (line-loop))))) #:guess-encoding #t) has no effect for some reason
Ah, right, after the thunk. Sorry. #:encoding "UTF-8" was my fallback, but I'm concerned that other sources may use other encodings. Does your QIF have a BOM?
Tried both with a BOM and without - if there is no explicitly specified encoding (#:encoding "UTF-8") - result is the same. I don't know how exactly smart the encoding detection algorithm is, but, may be it's caused by the fact that not every transaction has such non-English symbols. In fact, only some of them has it. However, it should not be the case in case of BOM presence... I also experimented with converting to ANSI Windows-1250 and it results in messing up of some characters (and some are fine). Well, ANSI is a mess anyway and I can hardly imagine anyone sane using it for Czech alphabet these days.
Guile's default encoding is CP1252 so it probably tried to use that to decode your CP1250 file resulting in misinterpreting some characters. Unfortunately I can easily see an ignorant programmer who's only ever worked with Microsoft products using an ANSI code page instead of UTF-8. Open up a CMD shell and type chcp. It's going to return 1250 unless you've changed the default setting.
Further study of the thunk finds that the not-UTF-8 is already covered in the line handling code, so I've pushed the #:encoding "UTF-8" fix. It will be in tomorrow's nightly and GnuCash 3.2. Thanks!
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=796586. Please update any external references or bookmarks.