Bug 796586 – QIF import incorrectly converts unicode characters from UTF8 encoded file

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 796586 - QIF import incorrectly converts unicode characters from UTF8 encoded file


Summary:	QIF import incorrectly converts unicode characters from UTF8 encoded file


Status:	RESOLVED FIXED

Product:	GnuCash
Classification:	Other
Component:	Import - QIF
Version:	3.1
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	future
Assigned To:	gnucash-import-maint
QA Contact:	gnucash-import-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2018-06-14 14:32 UTC by mrzreat
Modified:	2018-06-30 00:11 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
QIF file containing czech characters in transaction descriptions (1.21 KB, application/octet-stream) 2018-06-14 14:32 UTC, mrzreat	Details

Description mrzreat 2018-06-14 14:32:08 UTC

Created attachment 372686 [details]
QIF file containing czech characters in transaction descriptions

After updating to gnucash 3.1 from 2.6.x the following problem appeared:
I have a QIF UTF8 encoded file with transaction descriptions containing Czech characters (e.g. 'ř', a sample file is in the attachment). Import proceeds without any error messages, however, all the unicode characters becomes corrupted; for example 'Připsaný bonusový úrok' becomes 'PÅ™ipsanÃ½ bonusovÃ½ Ãºrok' (which seems like ANSI conversion). 
Transaction imported before the update looks fine, so it does not look like database or displaying issue.

My OS is Windows 10, please let me know if any additional details are needed.

Comment 1 mrzreat 2018-06-14 14:57:54 UTC

Same results even if language of GnuCash interface is changed to Czech.

Comment 2 John Ralls 2018-06-14 17:13:13 UTC

Can you edit gnucash/import-export/qif-imp/qif-file.scm Line 132 to say
  (with-input-from-file #:guess-encoding #t path
and see if that fixes it?

Comment 3 John Ralls 2018-06-14 17:15:33 UTC

Em, sorry, that should be c:\Program Files (x86)\gnucash\share\gnucash\scm\qif-import\qif-file.scm. You'll need to run the editor with admin privs.

Comment 4 mrzreat 2018-06-15 09:00:31 UTC

The fix you suggested unfortunately generates an error message during import (something like "there is a bug during import" or similar).
However, after digging into some specs, I managed to make it work for me by changing line 519 instead to 
(line-loop))))) #:encoding "UTF-8")

Unfortunately, (line-loop))))) #:guess-encoding #t) has no effect for some reason

Comment 5 John Ralls 2018-06-15 13:58:01 UTC

Ah, right, after the thunk. Sorry.
#:encoding "UTF-8" was my fallback, but I'm concerned that other sources may use other encodings. Does your QIF have a BOM?

Comment 6 mrzreat 2018-06-15 14:56:07 UTC

Tried both with a BOM and without - if there is no explicitly specified encoding (#:encoding "UTF-8") - result is the same. 

I don't know how exactly smart the encoding detection algorithm is, but, may be it's caused by the fact that not every transaction has such non-English symbols. In fact, only some of them has it. However, it should not be the case in case of BOM presence...

I also experimented with converting to ANSI Windows-1250 and it results in messing up of some characters (and some are fine). Well, ANSI is a mess anyway and I can hardly imagine anyone sane using it for Czech alphabet these days.

Comment 7 John Ralls 2018-06-15 15:20:21 UTC

Guile's default encoding is CP1252 so it probably tried to use that to decode your CP1250 file resulting in misinterpreting some characters.

Unfortunately I can easily see an ignorant programmer who's only ever worked with Microsoft products using an ANSI code page instead of UTF-8. Open up a CMD shell and type chcp. It's going to return 1250 unless you've changed the default setting.

Comment 8 John Ralls 2018-06-16 17:46:24 UTC

Further study of the thunk finds that the not-UTF-8 is already covered in the line handling code, so I've pushed the #:encoding "UTF-8" fix. It will be in tomorrow's nightly and GnuCash 3.2.

Thanks!

Comment 9 John Ralls 2018-06-30 00:11:51 UTC

GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=796586. Please update any external references or bookmarks.