After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 691542 - xml not read fully if there are special chars in strings
xml not read fully if there are special chars in strings
Status: RESOLVED FIXED
Product: GnuCash
Classification: Other
Component: Backend - XML
git-master
Other Windows
: Normal normal
: ---
Assigned To: gnucash-core-maint
gnucash-core-maint
Depends on:
Blocks:
 
 
Reported: 2013-01-11 11:47 UTC by rawe
Modified: 2018-06-29 23:12 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
excerpt from broken gnucash xml (490 bytes, text/plain)
2013-01-11 11:56 UTC, rawe
Details

Description rawe 2013-01-11 11:47:02 UTC
Just installed gnucash yesterday (never used it before), started a new gnucash file and entered lots of data (took some hours), saved as xml and closed gnucash. After reopening the xml file all transactions after a certain date were missing.


In the xml file the entries are ordered by date, so I searched for that "last good" date and looked at the next entry. This next entry contains strange chars (i assume it gets broken if i copy&paste it here, see attachment for binary file):

      <slot:value type="string">2369394493140900 Allos Bio Krï¿melmonster-Kekse (150 g)</slot:value>


This text originally came by copy&paste, directly from the browser to gnucash.


There was a related issue 2007: http://lists.gnucash.org/pipermail/gnucash-user/2007-February/019334.html


It seems gnucash reads the xml file "as best as it can", and if there is something unexpected just skips that dataset or stops parsing at all. This leads to data corruption which needs to be detected by the user. The minimum I would have expected is an error message.


I dont know if this is related to the xml parser or a higher abstraction layer, so i put this in 'General' and not 'XML Backend'.


OS: WinXP SP3, german
Comment 1 rawe 2013-01-11 11:56:50 UTC
Created attachment 233218 [details]
excerpt from broken gnucash xml
Comment 2 John Ralls 2013-12-14 20:13:54 UTC
Ah, it doesn't display, but when I copy and paste the strings into emacs, an extra byte shows up:
00000000: 4b72 c3af c2bf 1a6d 656c 6d6f 6e73 7465  Kr.....melmonste
00000010: 722d 4b65 6b73 65                        r-Kekse

That '1a' in byte 6 is an invalid UTF-8 character, which raises an XML error. At least in 2.5.9, it raises an error instead of silently dropping data on the floor, but we need to check the strings to ensure that they're valid before writing them into the file.

BTW, the root cause of the problem is that you tried to copy a string encoded in a Windows code page into a text entry that's expecting UTF-8. The paste routine in Gtk should have transcoded it for you, but it failed for some reason.
Comment 3 John Ralls 2013-12-22 22:34:21 UTC
Actually, 0x1a is legitimate UTF-8. It's one of the ASCII control characters. What it isn't is valid XML, and LibXML only converts to entities the XML-legal control characters LF, HT, and CR.

Fixed in r23598.
Comment 4 John Ralls 2018-06-29 23:12:52 UTC
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=691542. Please update any external references or bookmarks.