GNOME Bugzilla – Bug 691542
xml not read fully if there are special chars in strings
Last modified: 2018-06-29 23:12:52 UTC
Just installed gnucash yesterday (never used it before), started a new gnucash file and entered lots of data (took some hours), saved as xml and closed gnucash. After reopening the xml file all transactions after a certain date were missing. In the xml file the entries are ordered by date, so I searched for that "last good" date and looked at the next entry. This next entry contains strange chars (i assume it gets broken if i copy&paste it here, see attachment for binary file): <slot:value type="string">2369394493140900 Allos Bio Krï¿melmonster-Kekse (150 g)</slot:value> This text originally came by copy&paste, directly from the browser to gnucash. There was a related issue 2007: http://lists.gnucash.org/pipermail/gnucash-user/2007-February/019334.html It seems gnucash reads the xml file "as best as it can", and if there is something unexpected just skips that dataset or stops parsing at all. This leads to data corruption which needs to be detected by the user. The minimum I would have expected is an error message. I dont know if this is related to the xml parser or a higher abstraction layer, so i put this in 'General' and not 'XML Backend'. OS: WinXP SP3, german
Created attachment 233218 [details] excerpt from broken gnucash xml
Ah, it doesn't display, but when I copy and paste the strings into emacs, an extra byte shows up: 00000000: 4b72 c3af c2bf 1a6d 656c 6d6f 6e73 7465 Kr.....melmonste 00000010: 722d 4b65 6b73 65 r-Kekse That '1a' in byte 6 is an invalid UTF-8 character, which raises an XML error. At least in 2.5.9, it raises an error instead of silently dropping data on the floor, but we need to check the strings to ensure that they're valid before writing them into the file. BTW, the root cause of the problem is that you tried to copy a string encoded in a Windows code page into a text entry that's expecting UTF-8. The paste routine in Gtk should have transcoded it for you, but it failed for some reason.
Actually, 0x1a is legitimate UTF-8. It's one of the ASCII control characters. What it isn't is valid XML, and LibXML only converts to entities the XML-legal control characters LF, HT, and CR. Fixed in r23598.
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=691542. Please update any external references or bookmarks.