GNOME Bugzilla – Bug 729476
CSV Export/import can't deal with quote Char inside fields
Last modified: 2018-06-29 23:30:36 UTC
it seems that the following definiton for CSV Files is not yet implemented: from: http://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization "A (double) quote character in a field must be represented by two (double) quote characters." Follow these steps to reproduce it: 1. create a single account inside a ".gnucash" File, containing a text using the CSV-quote-char (double quote == ASCII 0x22) inside the notes Field, e.g. say "Hello World" 2. use "export account tree to CSV ..."; Seperator comma; enter filename "test.csv" 3. delete the account 4. use "import accounts from CSV ..."; chose file "test.csv"; Seperator "comma seperated with quotes" & Number of Rows for the header: "1" This won't import your account, the reasons are: - export hasn't converted the note fom say "Hello World" to say ""Hello World"" - if you would do this manually, import can't deal with this.
This bug affects both accounts, and transactions. Double quotes are not handled in neither export, nor import. Although the export problem can be easily solved with a small patch, import is not so easy. Import of accounts is done through a regex which cannot handle double double quotes. Import of transactions is NOT done through a regex, the preview works very correctly with double double quotes, but the actual import fails.
Created attachment 286308 [details] [review] Patch some csv bugs This patch should make the CSV Account export and import more comparable to the specification linked above. It should take care off quoting quotes, newlines and the separator character. Also in the patch is the ability to skip lines for transactions and a fix for the number of rows imported which was restricted to 1000 by the spin button. Regards, Bob
Comment on attachment 286308 [details] [review] Patch some csv bugs Hi bob, welcome back and nice work. The patch needs some cleaning up though before it can be committed. 1. The single patch implements multiple features: - proper quote handling around quotes and newlines - the ability to skip lines in imports Those two should be implemented in separate commits. All in one commit makes it very hard to follow what change was made for which feature. There are also undocumented improvements apparently: you distinguish between '\n' and '\r\n' as line separator based on operating system. That also would be better if it was done in a separate commit. 2. At the same time you are improving strings. This is great, but should again go into a separate commit. The reason here is that string changes will go to master while your bugfixes will be applied to both maint *and* master. 3. The glade files have lots of unnecessary changes. I know this is due to how the glade application works. It's notoriously poor in making version control friendly changes. Can you manually go through the glade files please and only pick the real changes ? The "git add -i" command can be tremendously useful for such manual change picking. 4. I also find several white space improvements like changing gfree( xyz ); into gfree (xyz); and variations thereof. Very good, that makes the code more readable. However - I guess the theme should become apparent by now - please do this in a separate commit. The main purpose is to make the history easily readable for other developers now and in the future. Git comes with several tools to help you clean up your own history before merging it into the upstream branches. I already mentioned "git add -i". Similarly there is also "git rebase -i". If you need help getting started with cleaning up your commits, feel free to ping me on irc or on the mailing list. I'll be happy to help you out. Also if you don't feel like uploading several patches to bugzilla, you can push your local git branch to a personal repository on github and generate a pull request there. Both ways are fine. Thanks.
I thought I would not get away with this, I will split out as requested.
Created attachment 286786 [details] [review] Fixes quoting of quotes, newline and separator Patches on bugs 731519 and 726888 may need to be applied first for these patches to be applied cleanly. This patch should make the CSV Account export and import more comparable to the specification linked above. It should take care off quoting quotes, newlines and the separator character.
Created attachment 286787 [details] [review] Changes the line ending of exported files This patch changes the CSV export line endings to \r\n so it is more comparable to the specification linked above based on file system.
Created attachment 286788 [details] [review] Changes the white space in CSV source files so it is more consistent This patch changes the white space in the CSV source files so it is more consistent and readable.
Created attachment 286790 [details] [review] Change some CSV strings This patch changes some strings in the CSV import and export files to make them more readable.
Comment on attachment 286787 [details] [review] Changes the line ending of exported files Bob, thank for splitting up the patches. Just to be sure: isn't the code supposed to use \r\n on Windows and \n the other platforms ? Or am I misreading your patch ?
Geert, According to the link above, csv files line endings should be \r\n, so building on windows changes \n to \r\n and on the other platforms I specify \r\n as the line ending. So the exported files should all end up with \r\n as line endings. I have just checked this on my Windows XP and Linux machine VM's and the exported files both have the same endings. Bob
Ok that makes sense. Thanks.
Comment on attachment 286786 [details] [review] Fixes quoting of quotes, newline and separator Bob, another question: what is the _("#eol") field you have added in the tree export file ? For what reason did you add this ?
This was so that I could be sure to find the end of a line read is truly the end of the line and not just a newline in say a notes field, the only way I could think of doing it. This can be seen in a row example below which gets parsed by the regular expression... BANK,Assets:Current Assets:Checking Account,Checking Account,123456,Checking Account,#3b8aba151476,"line 1 line, 2 line 3",GBP,CURRENCY,T,F,F,#eol
Thanks for your feedback. I understand this choice better now. At the same time it challenged me to try and tweak the regex enough so the extra field could be eliminated. I think I managed to do so. I have committed a follow-up with my result. The trick really is to treat newlines just like any other character and feed the whole file as if it were one line to the regex parser. Then you can let the regex parser filter on newlines just like any other character and have it find match after match in the one big string. The drawback of my change is that the whole file is read into memory at once. So it consumes a lot more memory than your solution. We'll have to see if there are people that will hit memory limits due to this. I don't expect so though. Account lists are usually not that huge. Thanks for your work and for the extra challenge :)
Bob, there was one detail in your regular expressions for which I failed to find the reason: g_string_assign (info->regexp, "^(?<type>[^;]*);\ ?(?<full_name>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<name>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<code>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<description>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<color>[^;]*);\ ?(?<notes>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<commoditym>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<commodityn>\"(?:[^\"]|\"\")*\"|[^;]*);\ ?(?<hidden>[^;]*);?(?<tax>[^;]*);\ ?(?<place_holder>[^;]*);(?<endofline>[^;]*)$"); (I took your regex with semi_colon separators as example) After each semicolon you have added a question mark. What purpose do they serve ?
Comment on attachment 286790 [details] [review] Change some CSV strings String changes committed to master, thank you very much!
Geert, I have been unable to find the reference I was using to change the regular expression, so the short answer is I do not know. Seems my changes are causing you additional work, maybe I should stick to one line fixes. Bob
(In reply to comment #17) > Geert, I have been unable to find the reference I was using to change the > regular expression, so the short answer is I do not know. That's ok. I thought it may have been an attempt to make fields optional, but I couldn't think of any way that could have worked. That's why I preferred to ask. The expression as it is now seems to work for me in various tests. So I'll leave it at that for now. > Seems my changes are > causing you additional work, maybe I should stick to one line fixes. On the contrary! If you hadn't submitted your patches I would not have seen the opportunity for improvement. I prefer to see it as building on each other's work. Some changes are complicated and working on them together gets us further than each working alone. And I actually learn a lot by reading your code. That saves me tons of time digging through source code and references manuals. I hope in the same time you and others can learn from the improvements I propose or apply in response you your contributions. Anyway I consider this bug sufficiently fixed. Thanks again for your contributions.
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=729476. Please update any external references or bookmarks.