After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 329202 - 1.8 datafile non-latin1 characters messed up in gnucash-2.0
1.8 datafile non-latin1 characters messed up in gnucash-2.0
Status: VERIFIED FIXED
Product: GnuCash
Classification: Other
Component: Backend - XML
git-master
Other Linux
: Normal critical
: ---
Assigned To: Andreas Köhler
Chris Lyttle
: 147717 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2006-01-30 13:04 UTC by Kostik
Modified: 2018-06-29 20:56 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Test file to reproduce bug (2.72 KB, application/xml)
2006-01-30 21:43 UTC, Kostik
Details
GnuCash-1.8.10-Accounts (12.85 KB, image/png)
2006-01-30 21:45 UTC, Kostik
Details
GnuCash-1.8.10-Report (15.70 KB, image/png)
2006-01-30 21:45 UTC, Kostik
Details
GnuCash-1.8.10-Transactions (16.63 KB, image/png)
2006-01-30 21:46 UTC, Kostik
Details
GnuCash-SVN_r12349-Accounts (32.21 KB, image/png)
2006-01-30 21:46 UTC, Kostik
Details
GnuCash-SVN_r12349-Report (32.84 KB, image/png)
2006-01-30 21:47 UTC, Kostik
Details
GnuCash-SVN_r12349-Transactions (40.70 KB, image/png)
2006-01-30 21:48 UTC, Kostik
Details

Description Kostik 2006-01-30 13:04:37 UTC
Distribution/Version: Debian Sarge

GnuCash 1.8 files with koi8-r, koi8-u account names and transaction descriptions are not being displayed correctly in GnuCash SVN_r12349.

This doesn't allow me to use G2 for working with the file I was using in 1.8.

This bug is a result of
http://bugzilla.gnome.org/show_bug.cgi?id=310102
bug. Details are described there.

To reproduce, run GnuCash 1.8 in koi8-u locale:

------
$ export LANG=ru_RU.koi8-u
$ export LC_ALL=ru_RU.koi8-u
$ gnucash
------

Then add some accounts with koi8-u names. Then and some transactions with koi8-u
descriptions. Save file.

And when open this file in G2 you'll see characters from iso8859-1 charset instead of Russian (koi8-u) characters.

ScreenShots and test file are sent to gnucash-devel@gnucash.org list.

BTW, this fact makes GnuCash G2 not forward-compatible!
Comment 1 Christian Stimming 2006-01-30 20:00:59 UTC
https://lists.gnucash.org/pipermail/gnucash-devel/2006-January/016032.html has the screenshots -- those should better be attached here.
Comment 2 Kostik 2006-01-30 21:43:36 UTC
Created attachment 58420 [details]
Test file to reproduce bug
Comment 3 Kostik 2006-01-30 21:45:18 UTC
Created attachment 58422 [details]
GnuCash-1.8.10-Accounts
Comment 4 Kostik 2006-01-30 21:45:53 UTC
Created attachment 58423 [details]
GnuCash-1.8.10-Report
Comment 5 Kostik 2006-01-30 21:46:17 UTC
Created attachment 58424 [details]
GnuCash-1.8.10-Transactions
Comment 6 Kostik 2006-01-30 21:46:39 UTC
Created attachment 58425 [details]
GnuCash-SVN_r12349-Accounts
Comment 7 Kostik 2006-01-30 21:47:41 UTC
Created attachment 58426 [details]
GnuCash-SVN_r12349-Report
Comment 8 Kostik 2006-01-30 21:48:04 UTC
Created attachment 58427 [details]
GnuCash-SVN_r12349-Transactions
Comment 9 Christian Stimming 2006-01-31 09:24:42 UTC
As a quick workaround, can you test the following one-liner to convert your 1.8 datafile into utf8?

  perl -pe's/&\#([0-9]+);/chr($1)/ge;' test1.xml | recode koi8-r..utf8 > test-utf8.xml

and see whether the resulting file "test-utf8.xml" is read correctly by gnucash-SVN? 

With your test file I can verify that the resulting file will have utf8 characters, but since I can't read cyrillic I don't know whether these are the correct characters :-)

I know that this needs to be fixed in gnucash, but this one-lines would at least be a usable workaround for the early testing.
Comment 10 Kostik 2006-01-31 09:51:24 UTC
Since I don't have recode, I used the following with iconv:

perl -pe's/&\#([0-9]+);/chr($1)/ge;' test1.xml | iconv -f koi8-u -t utf8 > test-utf8.xml

This SOLVED the problem, now cyrillic characters appear correctly.
Comment 11 Christian Stimming 2006-02-01 09:17:00 UTC
Thanks for this feedback. At least we got a workaround available. Actually someone some time ago said that libxml2 in SVN-HEAD should be able to detect the encoding of the input file automatically. In that case, the conversion should be done automatically as soon as we get rid of these weird { characters. Can you tell us whether a file like this:

  perl -pe's/&\#([0-9]+);/chr($1)/ge;' test1.xml > test-8bit.xml

will also be read correctly into gnucash SVN? (Again I can't tell which cyrillic characters will be the correct ones.) If it is, it means that the libxml2 will automatically do the correct conversion. And is it then written in UTF-8? (well, at least that I can check myself, but I don't have time right now)
Comment 12 Kostik 2006-02-01 10:31:37 UTC
After converting:

perl -pe's/&\#([0-9]+);/chr($1)/ge;' test1.xml > test-8bit.xml

I got koi8-u file. GnuCash SVN open it, but there are no accounts and no transactions. In terminal it complains:

(gnucash:14152): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

This complain sometimes happens on other good files too.
Comment 13 Christian Stimming 2006-02-02 12:46:31 UTC
Thanks for providing this very last information. This means that contrary to what we thought, libxml2 does *not* convert any non-latin1 non-utf8 file correctly into utf8 automatically. 

One last check: What happens if you edit the original 1.8 data file and insert the encoding attribute in the very first line yourself? I.e. change the very first line into

<?xml version="1.0" encoding="koi8-u"?>

Would that work correctly when opening the file in gnucash2? If it does, then we've got a forward- and backward compatible solution to the problem: gnucash2 would need to (eventually!) keep the encoding of the data file somewhere in its memory; when reading a data file with the encoding="" set, libxml2 will parse it correctly and convert it into utf-8 in memory; when writing a file, gnucash has to make sure that libxml2 will add this encoding="" attribute correctly (need to check how to achieve this in libxml2). Only for the case when there is no encoding="" attribute we would have to make something up -- we could either ask the user, or parse the file twice, first for autodetection of the encoding, then for actually reading it.
Comment 14 Christian Stimming 2006-02-02 15:31:12 UTC
Oh, and yet another question: What happens if you open the converted file back in gnucash-1.8? I expect: If you open the utf8-converted file without the encoding="..." attribute then gnucash-1.8 will probably complain; if you open the utf8-converted file, but add encoding="utf-8" in the first line it might work, but what is your actual observation? (Also note that unfortunately the encoding="..." attribute doesn't get written by the current gnucash, which is an error that we try to fix soon.)
Comment 15 Kostik 2006-02-03 09:07:00 UTC
(In reply to comment #13)
> One last check: What happens if you edit the original 1.8 data file and insert
> the encoding attribute in the very first line yourself? I.e. change the very
> first line into
> 
> <?xml version="1.0" encoding="koi8-u"?>
> 
> Would that work correctly when opening the file in gnucash2?

Inserting that line in test1.xml (original 1.8 file) doesn't do any change. My guess is: the file is actually in UTF, and GnuCash/libxml detects that, but the UTF itself is corrupt due to wrong convertation in 1.8.

BUT, Inserting that line in test1-8bit.xml (correct koi8-u file) makes it to be opened correctly by G2! After "Save" the file was converted to UFT-8 correctly.

BTW, I've just done some research, this is what I have:
In line from the test1.xml: Account1 - &#243;&#222;&#163;&#212;1
"&#243;" - is the "Decimal UTF-8 representation" with hexadecimal equivalent to: 0xC3 0xB3. (243 is the decimal code of NEEDED character in koi8-u)

1. SO, 1.8 file can't be normaly read without convertion. Also there are no automatic ways (without using dictionaries) to precisily determine which codeset the file is in. For example, There are 2 Russian codesets: koi8-u and cp1251 which both have same characters, but different order. Here I would suggest asking user for codeset like MS Office XP does (It lets you select codeset and shows result of convertion for you to see if you can read it).

2. It's good idea to place encoding in G2 file for future. And there may also be need to make export from G2 to 1.8 specifying codeset it will be used in.
Comment 16 Kostik 2006-02-03 09:11:21 UTC
(In reply to comment #14)
> Oh, and yet another question: What happens if you open the converted file back
> in gnucash-1.8? I expect: If you open the utf8-converted file without the
> encoding="..." attribute then gnucash-1.8 will probably complain; if you open
> the utf8-converted file, but add encoding="utf-8" in the first line it might
> work, but what is your actual observation? (Also note that unfortunately the
> encoding="..." attribute doesn't get written by the current gnucash, which is
> an error that we try to fix soon.)

GnuCash 1.8 doesn't check for encoding whatever is placed in encoding="", it always treats the file to be in the save charset as current locale is. As result, I get two not readable characters in 1.8 for every UTF-8 character.
Comment 17 Derek Atkins 2006-02-06 14:42:57 UTC
Yes, comment #16 is a result of 1.8 using libxml1 and (probably) not using the "New" parser.  Unfortunately there isn't much we can do about that at the moment; I suspect we're going to just have to live with the files not being backwards compatible.  :(
Comment 18 Kostik 2006-02-06 15:44:30 UTC
(In reply to comment #17)
> Yes, comment #16 is a result of 1.8 using libxml1 and (probably) not using the
> "New" parser.  Unfortunately there isn't much we can do about that at the
> moment; I suspect we're going to just have to live with the files not being
> backwards compatible.  :(

It's not a big purpose to make G2 files to be opened back in 1.8. But, it could be a nice feature, i.e. "export to 1.8 with charset...".

Opening 1.8 files in G2 is important, as to me.
Comment 19 Christian Stimming 2006-02-22 09:31:39 UTC
*** Bug 147717 has been marked as a duplicate of this bug. ***
Comment 20 Derek Atkins 2006-03-15 17:42:44 UTC
I've updating this bug to critical, because it's a data lossage problem when upgrading from 1.8->2.0.  We really need a good fix for this, even if it's something where gnucash has an import process to do the conversion.
Comment 21 Andreas Köhler 2006-04-19 18:00:47 UTC
There is a solution in 1.9.5. Please file new bugs if it does not work for you.

It will recognize files without encoding declaration and start a conversion druid, helping you to restore your file, even if it contains mixed encodings.
Comment 22 John Ralls 2018-06-29 20:56:27 UTC
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=329202. Please update any external references or bookmarks.