Bug 729476 – CSV Export/import can't deal with quote Char inside fields

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 729476 - CSV Export/import can't deal with quote Char inside fields


Summary:	CSV Export/import can't deal with quote Char inside fields


Status:	RESOLVED FIXED

Product:	GnuCash
Classification:	Other
Component:	Import - CSV
Version:	2.6.3
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gnucash-import-maint
QA Contact:	gnucash-import-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-05-03 20:25 UTC by dcdev
Modified:	2018-06-29 23:30 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch some csv bugs (95.67 KB, patch) 2014-09-16 16:02 UTC, Bob	needs-work	Details \| Review
Fixes quoting of quotes, newline and separator (39.69 KB, patch) 2014-09-22 09:17 UTC, Bob	committed	Details \| Review
Changes the line ending of exported files (6.27 KB, patch) 2014-09-22 09:19 UTC, Bob	committed	Details \| Review
Changes the white space in CSV source files so it is more consistent (167.91 KB, patch) 2014-09-22 09:22 UTC, Bob	committed	Details \| Review
Change some CSV strings (6.85 KB, patch) 2014-09-22 09:23 UTC, Bob	committed	Details \| Review

Description dcdev 2014-05-03 20:25:27 UTC

it seems that the following definiton for CSV Files is not yet implemented:

from: http://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization

"A (double) quote character in a field must be represented by two (double) quote characters."

Follow these steps to reproduce it:

1. create a single account inside a ".gnucash" File, containing a text using the CSV-quote-char (double quote == ASCII 0x22) inside the notes Field, e.g. 

say "Hello World"

2. use "export account tree to CSV ..."; Seperator comma; enter filename "test.csv"

3. delete the account

4. use "import accounts from CSV ..."; chose file "test.csv"; Seperator "comma seperated with quotes" & Number of Rows for the header: "1"

This won't import your account, the reasons are:

- export hasn't converted the note fom 
say "Hello World"
to 
say ""Hello World""

- if you would do this manually, import can't deal with this.

Comment 1 Michalis 2014-05-05 13:37:34 UTC

This bug affects both accounts, and transactions. Double quotes are not handled in neither export, nor import. 
Although the export problem can be easily solved with a small patch, import is not so easy. 

Import of accounts is done through a regex which cannot handle double double quotes. 
Import of transactions is NOT done through a regex, the preview works very correctly with double double quotes, but the actual import fails.

Comment 2 Bob 2014-09-16 16:02:35 UTC

Created attachment 286308 [details] [review]
Patch some csv bugs

This patch should make the CSV Account export and import more comparable to the specification linked above. It should take care off quoting quotes, newlines and the separator character.

Also in the patch is the ability to skip lines for transactions and a fix for the number of rows imported which was restricted to 1000 by the spin button.

Regards,
   Bob

Comment 3 Geert Janssens 2014-09-16 19:13:35 UTC

Comment on attachment 286308 [details] [review]
Patch some csv bugs

Hi bob, welcome back and nice work.

The patch needs some cleaning up though before it can be committed.

1. The single patch implements multiple features:
- proper quote handling around quotes and newlines
- the ability to skip lines in imports
Those two should be implemented in separate commits. All in one commit makes it very hard to follow what change was made for which feature.
There are also undocumented improvements apparently: you distinguish between '\n' and '\r\n' as line separator based on operating system. That also would be better if it was done in a separate commit.

2. At the same time you are improving strings. This is great, but should again go into a separate commit. The reason here is that string changes will go to master while your bugfixes will be applied to both maint *and* master.

3. The glade files have lots of unnecessary changes. I know this is due to how the glade application works. It's notoriously poor in making version control friendly changes. Can you manually go through the glade files please and only pick the real changes ? The "git add -i" command can be tremendously useful for such manual change picking.

4. I also find several white space improvements like changing
gfree( xyz );
into
gfree (xyz);
and variations thereof. Very good, that makes the code more readable. However - I guess the theme should become apparent by now - please do this in a separate commit.

The main purpose is to make the history easily readable for other developers now and in the future. Git comes with several tools to help you clean up your own history before merging it into the upstream branches. I already mentioned "git add -i". Similarly there is also "git rebase -i". If you need help getting started with cleaning up your commits, feel free to ping me on irc or on the mailing list. I'll be happy to help you out.

Also if you don't feel like uploading several patches to bugzilla, you can push your local git branch to a personal repository on github and generate a pull request there. Both ways are fine.

Thanks.

Comment 4 Bob 2014-09-18 08:35:38 UTC

I thought I would not get away with this, I will split out as requested.

Comment 5 Bob 2014-09-22 09:17:34 UTC

Created attachment 286786 [details] [review]
Fixes quoting of quotes, newline and separator

Patches on bugs 731519 and 726888 may need to be applied first for these patches to be applied cleanly.

This patch should make the CSV Account export and import more comparable to the
specification linked above. It should take care off quoting quotes, newlines
and the separator character.

Comment 6 Bob 2014-09-22 09:19:55 UTC

Created attachment 286787 [details] [review]
Changes the line ending of exported files

This patch changes the CSV export line endings to \r\n so it is more comparable to the specification linked above based on file system.

Comment 7 Bob 2014-09-22 09:22:30 UTC

Created attachment 286788 [details] [review]
Changes the white space in CSV source files so it is more consistent

This patch changes the white space in the CSV source files so it is more consistent and readable.

Comment 8 Bob 2014-09-22 09:23:47 UTC

Created attachment 286790 [details] [review]
Change some CSV strings

This patch changes some strings in the CSV import and export files to make them more readable.

Comment 9 Geert Janssens 2014-09-23 11:15:15 UTC

Comment on attachment 286787 [details] [review]
Changes the line ending of exported files

Bob, thank for splitting up the patches.

Just to be sure: isn't the code supposed to use \r\n on Windows and \n the other platforms ?

Or am I misreading your patch ?

Comment 10 Bob 2014-09-23 15:06:18 UTC

Geert,

According to the link above, csv files line endings should be \r\n, so building on windows changes \n to \r\n and on the other platforms I specify \r\n as the line ending. So the exported files should all end up with \r\n as line endings.

I have just checked this on my Windows XP and Linux machine VM's and the exported files both have the same endings.

Bob

Comment 11 Geert Janssens 2014-09-25 13:54:43 UTC

Ok that makes sense. Thanks.

Comment 12 Geert Janssens 2014-09-25 14:10:30 UTC

Comment on attachment 286786 [details] [review]
Fixes quoting of quotes, newline and separator

Bob, another question: what is the _("#eol") field you have added in the tree export file ? For what reason did you add this ?

Comment 13 Bob 2014-09-25 16:10:45 UTC

This was so that I could be sure to find the end of a line read is truly the end of the line and not just a newline in say a notes field, the only way I could think of doing it. This can be seen in a row example below which gets parsed by the regular expression...

BANK,Assets:Current Assets:Checking Account,Checking Account,123456,Checking Account,#3b8aba151476,"line 1
line, 2
line 3",GBP,CURRENCY,T,F,F,#eol

Comment 14 Geert Janssens 2014-09-26 15:01:15 UTC

Thanks for your feedback. I understand this choice better now.

At the same time it challenged me to try and tweak the regex enough so the extra field could be eliminated. I think I managed to do so.

I have committed a follow-up with my result. The trick really is to treat newlines just like any other character and feed the whole file as if it were one line to the regex parser. Then you can let the regex parser filter on newlines just like any other character and have it find match after match in the one big string.

The drawback of my change is that the whole file is read into memory at once. So it consumes a lot more memory than your solution. We'll have to see if there are people that will hit memory limits due to this. I don't expect so though. Account lists are usually not that huge.

Thanks for your work and for the extra challenge :)

Comment 15 Geert Janssens 2014-09-26 15:32:21 UTC

Bob, there was one detail in your regular expressions for which I failed to find the reason:

g_string_assign (info->regexp, "^(?<type>[^;]*);\
	?(?<full_name>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<name>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<code>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<description>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<color>[^;]*);\
	?(?<notes>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<commoditym>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<commodityn>\"(?:[^\"]|\"\")*\"|[^;]*);\
	?(?<hidden>[^;]*);?(?<tax>[^;]*);\
	?(?<place_holder>[^;]*);(?<endofline>[^;]*)$");

(I took your regex with semi_colon separators as example)
After each semicolon you have added a question mark. What purpose do they serve ?

Comment 16 Geert Janssens 2014-09-26 15:33:36 UTC

Comment on attachment 286790 [details] [review]
Change some CSV strings

String changes committed to master, thank you very much!

Comment 17 Bob 2014-09-27 13:15:35 UTC

Geert, I have been unable to find the reference I was using to change the regular expression, so the short answer is I do not know. Seems my changes are causing you additional work, maybe I should stick to one line fixes.

Bob

Comment 18 Geert Janssens 2014-09-27 14:06:43 UTC

(In reply to comment #17)
> Geert, I have been unable to find the reference I was using to change the
> regular expression, so the short answer is I do not know.
That's ok. I thought it may have been an attempt to make fields optional, but I couldn't think of any way that could have worked. That's why I preferred to ask. The expression as it is now seems to work for me in various tests. So I'll leave it at that for now.

> Seems my changes are
> causing you additional work, maybe I should stick to one line fixes.

On the contrary! If you hadn't submitted your patches I would not have seen the opportunity for improvement. I prefer to see it as building on each other's work. Some changes are complicated and working on them together gets us further than each working alone.

And I actually learn a lot by reading your code. That saves me tons of time digging through source code and references manuals. I hope in the same time you and others can learn from the improvements I propose or apply in response you your contributions.

Anyway I consider this bug sufficiently fixed. Thanks again for your contributions.

Comment 19 John Ralls 2018-06-29 23:30:36 UTC

GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=729476. Please update any external references or bookmarks.