GNOME Bugzilla – Bug 346535
QIF import with control character in account description creates bad datafile that cannot be reopened
Last modified: 2018-06-29 21:08:59 UTC
I exported a large 1.38 MB QIF file from Quicken for Mac 2004 and it imported successfully into GnuCash, however after saving and re-opening GnuCash all account balances showed zero and some accounts were missing. I found after much digging that one account in the QIF file had a mysterious x05 control character as the first character in the "D" (description) field. The GnuCash QIF importer reported no problem here but created the data file with this x05 character intact in the account description field, as in: <act:description><05>desc</act:description> Removing this character allowed the file to open normally. Even though the problem was initiated by the Quicken export program placing an invalid character in the QIF file, without an error message or an automatic string cleanup in GnuCash, it gives the impression of just losing all the data and leaves the user guessing what happened.
This has been reported before as bug#344170 , which was reportedly fixed in 1.9.8 but existed in any earlier version. Did you see this problem really in 1.9.8? Then we're (still) in trouble. In that case, could you attach a (very small) example QIF file that will show this problem? Thanks. Also related: bug#344841
Created attachment 68356 [details] QIF to reproduce bug Note it contains a non-printing character (0x05) in the account description field that is key to reproducing the bug.
What's the actual revision number you're using? Run: gnucash --version Looking at the QIF, it does have a non-printing control character, as you said. But it IS a valid UTF-8 Character, which is why it's let through. The question is why the XML parser barfs on it, and if there's something we can do to strip out those types of characters?
I built it from the provided Gentoo ebuild for 1.9.x: $ gnucash --version GnuCash 1.9.8 Built 2006-07-01 from r14384 I'm familiar with XML but not an expert, however I don't believe 0x05 is an allowed character in XML. See http://www.w3.org/TR/REC-xml/#NT-Char which defines the allowed characters as: 2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] which only allows tab, carriage return, line feed in the range below 0x20.
(targeting)
Created attachment 68365 [details] [review] Proposed Patch I'm wondering if this patch will fix the problem? This should change the way we validate UTF-8 such that it should ignore the invalid control characters. The only characters considered valid by g_utf8_validate() that are not considered valid by the (new) gnc_utf8_validate() are characters < 0x20 except 0x09, 0x0A, and 0x0D... So this should be "good enough". I haven't tested it, yet (except compiling), but I have to go.
I tested this patch and it seems to solve the problem. Commited as r14466.
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=346535. Please update any external references or bookmarks.