After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 658738 - [PATCH] fixes for several problems in bill/invoice import
[PATCH] fixes for several problems in bill/invoice import
Status: RESOLVED FIXED
Product: GnuCash
Classification: Other
Component: Business
git-master
Other Linux
: Normal normal
: ---
Assigned To: Derek Atkins
Christian Stimming
Depends on:
Blocks:
 
 
Reported: 2011-09-11 10:57 UTC by Rafał Krzewski
Modified: 2018-06-29 23:00 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Use billing_id consistently (fixed billing_id/billingid/biing_id mixup) (10.05 KB, patch)
2011-09-11 10:58 UTC, Rafał Krzewski
committed Details | Review
Pass document type to gnc_bi_import_fix_bis function and use it to perform correct document owner check: vendor for bill, customer for invoice. (4.95 KB, patch)
2011-09-11 10:58 UTC, Rafał Krzewski
committed Details | Review
Inverted conditions for gnc_search_bill_on_id / gnc_search_invoice_on_id. g_ascii_strcasecmp returns 0 when arguments are equal. (1.38 KB, patch)
2011-09-11 10:59 UTC, Rafał Krzewski
committed Details | Review
Check correct field when testing if date_opened was set for a document. (1.18 KB, patch)
2011-09-11 11:00 UTC, Rafał Krzewski
committed Details | Review
Open invoice in a new tab only if it has not been posted yet. (1.34 KB, patch)
2011-09-11 11:01 UTC, Rafał Krzewski
committed Details | Review
Changed 3rd import regexp to actually accept quotes on all fields, and 4rd to accept commas and quotes, as advertised in the UI. (3.35 KB, patch)
2011-09-11 11:02 UTC, Rafał Krzewski
committed Details | Review
minor formating fix (1.70 KB, patch)
2011-09-11 11:03 UTC, Rafał Krzewski
committed Details | Review
Improved regexps for CSV files with quotes - quotes are allowed but not required on all fields. (5.15 KB, patch)
2011-09-14 23:26 UTC, Rafał Krzewski
committed Details | Review
Non-quoted regexp require complete match. <id> characters restriction removed. Default regexp was brought in line with GUI option #1. (4.38 KB, patch)
2011-09-14 23:28 UTC, Rafał Krzewski
committed Details | Review
A test file for checking regexps, data is semantically invalid but illustrates various separator / quote combinations. (437 bytes, text/csv)
2011-09-14 23:31 UTC, Rafał Krzewski
  Details
A test file illustrating OOo Calc CSV quoting convention. (351 bytes, text/csv)
2011-09-14 23:33 UTC, Rafał Krzewski
  Details
A test file illustrating MS Excel CSV quoting convention. (300 bytes, text/csv)
2011-09-14 23:33 UTC, Rafał Krzewski
  Details
Tweaked alignment of import type radio button group for better consistency and dialog resize behavior. (1.33 KB, patch)
2011-09-15 23:00 UTC, Rafał Krzewski
committed Details | Review
Opening new bills / invoices made configurable from UI. (10.27 KB, patch)
2011-09-15 23:03 UTC, Rafał Krzewski
none Details | Review

Description Rafał Krzewski 2011-09-11 10:57:08 UTC
[PATCH] fixes for several problems in bill/invoice import

I am trying to import a few years of history of my small company into GnuCash. It proves to be quite a challenging task :) Among other things, I need to import several hundred invoices and bills.

I am very glad that bill/invoice import plugin is available. I was able to hack together a tool that inserts/updates customers and vendors data straight into GnuCash XML file (updating TaxTable refcounts and so on) but importing bills and invoices is a lot more complex than this.

Unfortunately I had no luck importing any invoices and bills import had several quirks, so soon enough I found myself checking out GnuCash sources and investigating the problem. I haven't been using C in over a decade, but fortunately the code was straightforward enough for me to follow and tweak.

After getting the import plugin in workable state, I've promptly run into bugs 612957 and 502853, but before I get busy hacking around those (fixing them in a way that will work for everyone under any legislation is way beyond my ambitions here), I'd like to share my fixes for the import plugin.

My initial contribution to this bug report is 7 patches, generated with git format-patch against git trunk, commit 85d9ee123d2025fc6ca0988ff572b12b7b88fc0b (September 1st). Here's what they do:

0001-Use-billing_id-consistently-fixed-billing_id-billing.patch

There is an inconsistency about naming billing_id field. bi_import/gui.c is using billingid (and UI label billing_id, which is incorrect because the single underscore causes i in id to be underlined), whereas bi_import.c is using biing_id - rather self-evident typo.

This can be worked around in the unpatched code, by using a custom regexp with (?<biing_id>[^;]*) field declaration, otherwise billing_id column is not imported.

0002-Pass-document-type-to-gnc_bi_import_fix_bis-function.patch

This is a fix for a bug that turned to by a showstopper for my invoice import attempts. In the current code, no matter the type of documents being imported, owner_id column is checked for existence of corresponding vendor. This can be worked around by adding each customer as a vendor also, but that was not something that I wanted to do.

The patch solves this by passing document type argument to gnc_bi_import_fix_bis, and checking existence of vendor or customer as necessary. I've also fixed some vendor/customer confusion in the comments.

I've touched error messages, but unfortunately I have zero experience with gettext, so I had no idea how to update relevant i18n files. This might need further work.  

0003-Inverted-conditions-for-gnc_search_bill_on_id.patch

This patch fixes a problem with document update attempts. The current code is looking for an existing invoice with a given id when a bill is imported and vice versa. At best, it means that document update functionality will never work, at worst it means that the code will attempt to change existing invoice into a bill etc. 

0004-Check-correct-field-when-testing-if-date_opened-was-.patch

This patch fixes a problem with document opening date never being imported. The existing code is looking at the variable type (ie the document type), which is never "", and also something's wrong with the conditions logic, because contrary to my intuition it appears to be alwasys false.

Without pondering on it too much, I've change it to use strlen() on date_opened variable. Afterwards date_opened field was imported correctly.

0005-Open-invoice-in-a-new-tab-only-if-it-has-not-been-po.patch

The current code opens each imported document in a new tab. I was trying to import hundreds of already posted documents, so I didn't need to have them opened and secondly I suspect that it might have crashed GnuCash thorough resource exhaustion.

This patch adds a condition - a bill/invoice is opened only if it has not been posted. I think it's quite sensible default behavior, but adding an UI control on the import dialog is definitely a correct way to go about that.

0006-Changed-3rd-import-regexp-to-actually-accept-quotes-.patch

The data I was trying to import contained both ; and , characters in description fields, so I had to use quotes in the CSV file. I was quite puzzled to find out the the predefined regexps did not accept quotes at all! The "comma separated with quotes" regexp did not accept commas neither, for that matter.

I've updated the regexps to accept/require quotes on all fields. One could argue that quotes on numeric fields are not necessary, and even on the string fields they could be made optional, so this bit definitely can be improved further.   

The patch depends on patch 0001 above, as it touches the same lines in bi_import/gui.c

0007-minor-formating-fix.patch

This is amendment to patch 0002 - I've noticed that I got the indentation wrong in two places while verifying my changes.

Cheers,
Rafał
Comment 1 Rafał Krzewski 2011-09-11 10:58:07 UTC
Created attachment 196199 [details] [review]
Use billing_id consistently (fixed billing_id/billingid/biing_id mixup)
Comment 2 Rafał Krzewski 2011-09-11 10:58:51 UTC
Created attachment 196200 [details] [review]
Pass document type to gnc_bi_import_fix_bis function and use it to perform correct document owner check: vendor for bill, customer for invoice.
Comment 3 Rafał Krzewski 2011-09-11 10:59:27 UTC
Created attachment 196201 [details] [review]
Inverted conditions for gnc_search_bill_on_id / gnc_search_invoice_on_id. g_ascii_strcasecmp returns 0 when arguments are equal.
Comment 4 Rafał Krzewski 2011-09-11 11:00:24 UTC
Created attachment 196202 [details] [review]
Check correct field when testing if date_opened was set for a document.
Comment 5 Rafał Krzewski 2011-09-11 11:01:56 UTC
Created attachment 196203 [details] [review]
Open invoice in a new tab only if it has not been posted yet.
Comment 6 Rafał Krzewski 2011-09-11 11:02:24 UTC
Created attachment 196204 [details] [review]
Changed 3rd import regexp to actually accept quotes on all fields, and 4rd to accept commas and quotes, as advertised in the UI.
Comment 7 Rafał Krzewski 2011-09-11 11:03:01 UTC
Created attachment 196205 [details] [review]
minor formating fix
Comment 8 Geert Janssens 2011-09-11 14:24:28 UTC
Comment on attachment 196199 [details] [review]
Use billing_id consistently (fixed billing_id/billingid/biing_id mixup)

Committed as r21247. Thank you very much !
Comment 9 Geert Janssens 2011-09-11 14:24:43 UTC
Comment on attachment 196200 [details] [review]
Pass document type to gnc_bi_import_fix_bis function and use it to perform correct document owner check: vendor for bill, customer for invoice. 

Committed as r21248. Thank you very much !
Comment 10 Geert Janssens 2011-09-11 14:25:00 UTC
Comment on attachment 196201 [details] [review]
Inverted conditions for gnc_search_bill_on_id / gnc_search_invoice_on_id. g_ascii_strcasecmp returns 0 when arguments are equal.

Committed as r21249. Thank you very much !
Comment 11 Geert Janssens 2011-09-11 14:25:33 UTC
Comment on attachment 196202 [details] [review]
Check correct field when testing if date_opened was set for a document.

Committed as r21250. Thank you very much !
Comment 12 Geert Janssens 2011-09-11 14:25:52 UTC
Comment on attachment 196203 [details] [review]
Open invoice in a new tab only if it has not been posted yet.

Committed as r21251. Thank you very much !
Comment 13 Geert Janssens 2011-09-11 14:26:16 UTC
Comment on attachment 196204 [details] [review]
Changed 3rd import regexp to actually accept quotes on all fields, and 4rd to accept commas and quotes, as advertised in the UI.

Committed as r21252. Thank you very much !
Comment 14 Geert Janssens 2011-09-11 14:26:32 UTC
Comment on attachment 196205 [details] [review]
minor formating fix

Committed as r21253. Thank you very much !
Comment 15 Geert Janssens 2011-09-11 14:39:58 UTC
Some remarks on your original message:

> 0002-Pass-document-type-to-gnc_bi_import_fix_bis-function.patch
> ...
> I've touched error messages, but unfortunately I have zero experience with
> gettext, so I had no idea how to update relevant i18n files. This might need
> further work.  
> 
The strings you changed will appear in the i18n files the next time they are regenerated. But as the development tree is not being translated until we are coming closer to a future release this doesn't matter right now.

> 0005-Open-invoice-in-a-new-tab-only-if-it-has-not-been-po.patch
> ...
> This patch adds a condition - a bill/invoice is opened only if it has not been
> posted. I think it's quite sensible default behavior, but adding an UI control
> on the import dialog is definitely a correct way to go about that.
> 
True, an option would be nicer. There seemed to be a comment in that direction in the code you changed. Perhaps you should add a similar comment again to remember, unless you intend to provide a patch for such an option also ?

> 0006-Changed-3rd-import-regexp-to-actually-accept-quotes-.patch
> ...
> I've updated the regexps to accept/require quotes on all fields. One could
> argue that quotes on numeric fields are not necessary, and even on the string
> fields they could be made optional, so this bit definitely can be improved
> further.   
I didn't really understand your first part here: are quotes now required on all fields ? Or only on some, but they are accepted on all ?
Comment 16 Rafał Krzewski 2011-09-11 22:21:39 UTC
(In reply to comment #15)

> > This patch adds a condition - a bill/invoice is opened only if it has not been
> > posted. I think it's quite sensible default behavior, but adding an UI control
> > on the import dialog is definitely a correct way to go about that.
> > 
> True, an option would be nicer. There seemed to be a comment in that direction
> in the code you changed. Perhaps you should add a similar comment again to
> remember, unless you intend to provide a patch for such an option also ?

I'll take look at glade and see hard would it be to add the option. I'll get back to you.
 
> > I've updated the regexps to accept/require quotes on all fields. One could
> > argue that quotes on numeric fields are not necessary, and even on the string
> > fields they could be made optional, so this bit definitely can be improved
> > further.   
> I didn't really understand your first part here: are quotes now required on all
> fields ? Or only on some, but they are accepted on all ?

The patched code requires quotes on all fields. That was the simplest thing that worked for my particular dataset, which I am exporting from a relational db. I have a feeling though that this is not what someone exporting CSV from MS Excel or OOo Calc would get.

I can take a swing at writing more flexible regexps, but they sure will be long and hairy... Again, I'll get back to you about this in a few days.

Cheers,
Rafał
Comment 17 Rafał Krzewski 2011-09-12 20:11:06 UTC
I tried improving the quote-supporting regexps, unfortunately without success.

I did a little experiment first, regarding the quoting behavior of MS Excel and OOo Calc. It turns out that Calc quotes everything that looks like a string, ie is not a number or a date. Excel, on the other hand is quoting only those fields that contain a field separator, therefore in a typical file most strings would not be quoted. Yet another CSV source that I have at hand is pgAdmin III. The SQL window has an option for exporting query results to a CSV file with the following quoting options: all fields, only strings, none. When I select 'only strings', date fields get quoted, probably because the actual value that goes to the file is the result of implicit to_char(date_value) function call.

The conclusion is this - to be safe, the import plugin should accept, but not require quotes on all fields.

Now the regexps. My naive attempt "((?<id>[^;]*)|\"(?<id>[^\"]*)\");..." failed miserably:
Error in regular expression ... at char 21: two named subpatterns have the same name

Maybe someone with deeper knowledge of glib regular expressions could do better. On the other hand, I think that regexps are not an adequate tool for handling more complex CSV quoting scenarios: quote / separator mixture, especially if you consider hand-crafted files with errors.
IMO a proper implementation of RFC 4180 requires a finite state machine. And such are most readily created using tools like Flex+Bison. However that's not something I'd volunteer to do.

On the second though, the solution that I've implemented in my patch does not look so bad. The required format is rigid, and therefore not so error prone. The downside is that the users will most likely need to their own text processing to quote all the fields.
Comment 18 Geert Janssens 2011-09-13 20:23:35 UTC
(In reply to comment #17)
> The conclusion is this - to be safe, the import plugin should accept, but not
> require quotes on all fields.
> 
Probably indeed.

> Now the regexps. My naive attempt "((?<id>[^;]*)|\"(?<id>[^\"]*)\");..." failed
> miserably:
> Error in regular expression ... at char 21: two named subpatterns have the same
> name
I think you need two modifications:
1. Add the compiler flag G_REGEX_DUPNAMES to eliminate the above errors. I believe this should be added in bi_import.c, line 117.
2. Guarantee that the two alternate subregexes are truly mutually exclusive, otherwise it's very hard to predict which match the regex will prefer.
That's not yet the case in your attempt. For example "Name";... (with quotes included) will be matched by both (?<id>[^;]*) and \"(?<id>[^\"]*)\"
Yet the resulting backreference is different: it's "Name" in the first case and Name (without the quotes) for the second subpattern.
So how about:
"((?<id>[^;\"]*)|\"(?<id>[^\"]*)\");..."
The first regex will not match if there's a quote anywhere in the field (separated by ; hence ; is a forbidden char as well). The second regex on the other hand will match anything between quotes in a field, if they are there.

Can you try this with your test files ? I don't have any test data for this part of GnuCash myself.
Comment 19 Rafał Krzewski 2011-09-14 23:25:27 UTC
Hello Geert, thanks for your suggestions. G_REGEX_DUPNAMES worked like a charm. 
You are also absolutely right about my expression being ambiguous. I've implemented your change I was able to parse the test files correctly, with both quoted and unqouted files. 

I've run into one more problem: The final sequence

"...;((?<accu_splits>[^;\"]*)|\"(?<accu_splits>[^\"]*)\")" 

matched only an unquoted string. I did a blind shot and added a $ at the very end, thus adding a requiring a complete match - rows with extra fields will be rejected. After this change, both quoted and unquoted string in the last field were picked up correctly. Frankly, I don't understand why it works this way...

To be consistent, I've also added the $ at the end of the previous two regexps. The User's Guide in section 17.1.1. states  "In order for the import to succeed the number of fields must be adhered to, so the trailing commas are important." so I guess this is OK.

I've also removed the allowed characters restriction for the id field. The old regexp had the following one: "^(?<id>[^!#+^,]*);" If I am reading this correctly using characters ! # + or ^ before the first separator would cause the whole line to be rejected. This was an undocumented feature. I understand that it might be convenient for skipping comment lines, or disabling specific lines, but since an import file is usualy used only once, one could just as well delete/filter out the unnecessary lines. Therefore I think that this restrinction provided little value with potentially confusing behaviour.

I've noticed one other thing. The default regexp, that is used when the user does not click any of the format radio-buttons in the GUI was NOT the same as regexp associated with the first, initially selected radio button! The difference was that id had a cascade of optional group clauses, like this:

(field;(field;(field...)?)?)?

at the end. Effectively, only first 11 fields, up to <price> were required, and the remaining 11 fields were optional - any number of fields in the range 11 .. 22 was accepted. Now, I see that it could make sense to ommit last 5 fields as a block - accept only 17 or 22 fields per line. But it was not implemented in any other regexp, and it explicity contradicts the Users's Guide section quoted above.

I'm attaching two patches and my set of test files. Let me know what you think.
Comment 20 Rafał Krzewski 2011-09-14 23:26:51 UTC
Created attachment 196565 [details] [review]
Improved regexps for CSV files with quotes - quotes are allowed but not required on all fields.
Comment 21 Rafał Krzewski 2011-09-14 23:28:43 UTC
Created attachment 196566 [details] [review]
Non-quoted regexp require complete match. <id> characters restriction removed. Default regexp was brought in line with GUI option #1.
Comment 22 Rafał Krzewski 2011-09-14 23:31:49 UTC
Created attachment 196567 [details]
A test file for checking regexps, data is semantically invalid but illustrates various separator / quote combinations.
Comment 23 Rafał Krzewski 2011-09-14 23:33:13 UTC
Created attachment 196568 [details]
A test file illustrating OOo Calc CSV quoting convention.
Comment 24 Rafał Krzewski 2011-09-14 23:33:36 UTC
Created attachment 196569 [details]
A test file illustrating MS Excel CSV quoting convention.
Comment 25 Rafał Krzewski 2011-09-15 23:00:14 UTC
Created attachment 196679 [details] [review]
Tweaked alignment of import type radio button group for better consistency and dialog resize behavior.

The second frame was missing expand=False pack property which caused it to occupy much more space than needed when the dialog was enlarged.
I also checked button alignment to vertical for consistency.
Comment 26 Rafał Krzewski 2011-09-15 23:03:37 UTC
Created attachment 196681 [details] [review]
Opening new bills / invoices made configurable from UI.

UI for selecting new bill / invoice opening mode was added, as discussed in the comments above.

That's all from me right now - I am satisfied with b/i import functionality.
Comment 27 Geert Janssens 2011-09-17 16:53:45 UTC
Comment on attachment 196565 [details] [review]
Improved regexps for CSV files with quotes - quotes are allowed but not required on all fields.

Committed as r21262. Thanks a lot !

Regarding the removal of the comment markers from the id column regex: I think you are right to remove it. Although I find it could be useful to have something in the regexes that strips comment lines, the way it was implemented would not do that at all. So it's better to remove it now. If someone has a need for it again later, he/she can request it, preferably with a patch.
Comment 28 Geert Janssens 2011-09-17 16:54:46 UTC
Comment on attachment 196566 [details] [review]
Non-quoted regexp require complete match. <id> characters restriction removed. Default regexp was brought in line with GUI option #1.

Committed as r21263, thanks a lot !
Comment 29 Geert Janssens 2011-09-17 17:20:05 UTC
Comment on attachment 196679 [details] [review]
Tweaked alignment of import type radio button group for better consistency and dialog resize behavior.

Committed as r21264, thanks a lot !
Comment 30 Geert Janssens 2011-09-17 17:20:29 UTC
Comment on attachment 196681 [details] [review]
Opening new bills / invoices made configurable from UI.

Committed as r21265, thanks a lot !
Comment 31 Geert Janssens 2011-09-17 17:26:51 UTC
Thanks for all your fixes.

I forgot once more to attribute you while committing the first 2 patches. That's because it's so incredibly easy to apply your patches directly with git am and not changing the comments...

Normally in a git environment this would be no problem as there is a data field for author and committer. But all our commits pass via svn, which doesn't have the concept of an author so that information gets lost.

So if you think about it, can you add a small clause to future patches like
"Patch by Rafał Krzewski" ? Me from my side will try to think about checking this before applying the patches as well.
Comment 32 Rafał Krzewski 2011-09-19 21:45:05 UTC
Thanks for handling my patches quickly, and for your constructive feedback!

Re attribution notices in commit messages - I'll make sure to do that, but it's probably worth to mention it on http://wiki.gnucash.org/wiki/Git
Comment 33 Geert Janssens 2011-09-19 22:09:19 UTC
(In reply to comment #32)
> Re attribution notices in commit messages - I'll make sure to do that, but it's
> probably worth to mention it on http://wiki.gnucash.org/wiki/Git

Good idea. If you like you can add something to that page yourself (after registration). It's a wiki, intended to be edited by anyone :)
Comment 34 John Ralls 2018-06-29 23:00:57 UTC
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=658738. Please update any external references or bookmarks.