After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 336211 - QIF Import generates spurious duplicate transactions
QIF Import generates spurious duplicate transactions
Status: VERIFIED FIXED
Product: GnuCash
Classification: Other
Component: Import - QIF
unspecified
Other All
: Normal enhancement
: ---
Assigned To: Derek Atkins
Derek Atkins
Depends on:
Blocks: backport
 
 
Reported: 2006-03-27 16:39 UTC by Mark Johnson
Modified: 2018-06-29 21:00 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
QIF file that reproduces problem (1.82 KB, text/plain)
2006-07-26 03:46 UTC, Mark Johnson
  Details
Proposed patch (13.93 KB, patch)
2008-01-11 04:00 UTC, Charles Day
committed Details | Review
Friendlier diff for patch (-w to ignore whitespace) (9.96 KB, text/plain)
2008-01-11 04:15 UTC, Charles Day
  Details

Description Mark Johnson 2006-03-27 16:39:24 UTC
Please describe the problem:
When importing a single QIF file, many "duplicate" transactions are found.  I 
have checked the QIF file, and found that at least the first candidate 
duplicate, was not, in fact, a duplicate. 
 
Since the single QIF file does not duplicate transactions, the QIF import druid 
should not find any duplicates. 

Steps to reproduce:
1. Import a single QIF 
2. Observe that many duplicates are found. 
3.  
 

Actual results:
Many duplicates are found from a single QIF. 

Expected results:
One does not expect any duplicates to be found. 

Does this happen every time?
Yes. 

Other information:
SVN revision is 13702.
Comment 1 Derek Atkins 2006-03-27 16:46:37 UTC
Unlike OFX, QIF transactions don't have any transaction ID.  Duplicate checking is performed via heuristics.  Does this happen with ANY qif file, or only a particular qif file?
Comment 2 Mark Johnson 2006-03-30 14:37:01 UTC
Still, given a SINGLE qif file being imported, no duplicates should be found.   The file does not start out containing duplicates.  These duplicates are spuriously generated by gnucash.  I have tried it with another QIF file and found gnucash asking me to check huge numbers of duplicates.  So that is at least two QIF files.

I wondered if it could be related to the size of these QIF files.  Both are for frequently used accounts with 7 years of data.  I tried another QIF file with less data.  It still found spurious duplicates.

At the very least, when importing multiple QIFs, if the date, description, and amounts are the same, it is very unlikely that two transactions are not one and the same.
Comment 3 Josh Sled 2006-03-30 14:57:05 UTC
I agree, though I'd raise the identical-by-value threshold a bit to include some non-"empty" description.   And it might just mark them as duplicates automagically -- and perhaps seperate them out from the normal duplicate matching -- in case that heuristic is wrong.  But it's certainly an enhancement at least.
Comment 4 Derek Atkins 2006-03-30 15:34:21 UTC
Quoting Mark in Comment #2:

  Still, given a SINGLE qif file being imported, no duplicates should be found.

That's just not true.  If there are transactions already in the destination account, the importer can find duplicates there.  The duplicate checking is not checking within the qif file -- it's comparing the qif to-be-imported transactions to the list of transactions already in the gnucash account.

a single qif file importing into an empty account should never find any duplicates.  But a single qif file importing into a non-empty account may certainly find duplicates.
Comment 5 Mark Johnson 2006-03-30 19:36:48 UTC
My apologies.  My comment was incomplete.  

I started gnucash with the --nofile option and then imported the single QIF file.  This is the situation in Derek's last paragraph in Comment #4.  I encountered a large number of spurious duplicates in doing this import.
Comment 6 Mark Johnson 2006-07-14 02:32:04 UTC
This continues to be a problem in 2.0.0.  I will try to come up with a reasonably small QIF that reproduces the problem within a week or so.
Comment 7 Mark Johnson 2006-07-26 03:46:17 UTC
Created attachment 69628 [details]
QIF file that reproduces problem

start gnucash with --nofile option.
select File->import->Import QIF
select attached file.
Keep clicking on next in the druid until you come to finding duplicates.  Each transaction has at least one duplicate listed - itself!
Comment 8 Phil Richards 2006-10-21 14:35:48 UTC
Yep, same problem for me.
A thousand or so duplicates for transactions whose only duplicate is identical (in all respects) to the original.  And for which there isn't a second entry in the QIF file.
Comment 9 Charles Day 2008-01-07 22:52:46 UTC
This problem still exists in 2.2.2.  If I start with --nofile, then import a QIF, it finds many duplicate transactions that look exactly like what is being imported. This would seem to me to be nonsense, as how can there be duplicates if I started with --nofile?

From a cursory examination of the code, I believe that the "check duplicates" process is supposed to be checking the new transactions (those being imported) against the old (those existing prior to import). Instead, I think new transactions are being compared to a combined set of both new AND old.

This theory was supported by a few debugging statements that I inserted into the code. I started GnuCash with --nofile and imported a QIF. At the beginning of duplicate checking, the debugging code showed me how many new transactions existed (78) and how many old transactions existed (the same: 78!).
Comment 10 Derek Atkins 2008-01-08 12:58:19 UTC
If you go back in time, I suspect this was introduced when the query code was changed.  It used to use xaccQuerySetAcountGroup() but that was changed at the time because we couldn't (any longer) support that.  Now, however, with the advent of Root accounts, it's possible that we might be able to re-implement that which should solve this problem.
Comment 11 Charles Day 2008-01-08 22:24:29 UTC
Yeah I think you're on to something there... I briefly looked at some of the Account code in C but got interrupted. I found some documentation on the Account and AccountGroup API's but it is (as you suggest) out of date as there was no mention of collections that I could find. I don't mind trying to debug this but the relationships between books, collections, accounts, and root accounts aren't clear to me yet. Is there any relatively up-to-date place to read about this stuff other than the C code itself?
Comment 12 Derek Atkins 2008-01-08 22:29:33 UTC
http://cvs.gnucash.org/docs/HEAD/
Comment 13 Charles Day 2008-01-09 01:19:31 UTC
That was the first place I looked, and the only information I could find about the book and account API's was a link to the out-of-date documentation here: http://code.neil.williamsleesmill.me.uk/texi/gnucash-design.html

That document is still written in terms of account groups. If you know of any more recent stuff, let me know, thanks.
Comment 14 Derek Atkins 2008-01-09 01:28:30 UTC
An "Account Group" is just a set of peer accounts.  It used to be:

AccountGroup
  |
  +- Account / AccountGroup
  |                 |
  |                 +- Account / Account Group
  |                 ...                +- ...
  |
  +- Account / AccountGroup
  ...

Now, the AccountGroup has been collapsed back into the Account, and there's a new "Root Account" that's the top of the tree.
Comment 15 Charles Day 2008-01-09 12:21:03 UTC
I've found the bug. One critical piece of missing information is that for this bug to occur, you must be importing to a brand new file that has NO ACCOUNTS prior to import. If you set up any accounts beforehand, such as through the druid, the import completes normally.

The bug is at line 65 of qif-merge-groups.scm because the call to xaccQueryAddAccountMatch can be passed an empty list of accounts. Therefore the call will fail silently and the query will have no account criteria, causing imported transactions to be able to match themselves.

It occurred to me that if there are no transactions prior to import then there is no need to perform duplicate checking. What is a cheap way in Scheme to determine that an Account (or any of its subaccounts) has at least one split? Build a query? Use gnc_account_foreach_child_until?
Comment 16 Derek Atkins 2008-01-09 14:09:22 UTC
Does it matter if there are transactions at all?  Or only if there are no accounts?  You should be able to tell if there are any accounts in the tree by testing the account list for null?.  Getting the count of transactions would be... harder.
Comment 17 Charles Day 2008-01-09 21:53:48 UTC
To fix this bug, only the presence of accounts matters and I can check that very easily and cheaply.

But for performance, I was thinking of checking for the case where someone has gone through and set up accounts (through the druid, perhaps), but hasn't entered any transactions. In this case there is no need to check for duplicates, but the current QIF importer does it anyway, meaning one pointless query executes per imported transaction. As an experiment, I started GnuCash with --nofile, created the default account tree with the druid, then ran the import on 15 years of data. The totally unnecessary "check duplicates" stage alone took more than TWO HOURS to complete. So the savings can be quite significant.

If written properly, checking for the presence of Splits in existing accounts should be very cheap, i.e. walk the account tree until one Split is seen and then stop. I am wondering how to write this most efficiently in Scheme. Are there existing C functions that could be leveraged, and if so, which ones? I was hoping there would be an easy boolean API call, such as a "gnc_account_tree_has_splits (Account *)" or something but no such luck apparently!

On another subject, I found an additional optimization that would affect ALL imports of QIF. Currently there is a call to gnc-account-get-descendants-sorted INSIDE the transaction loop (around line 65 of qif-merge-groups.scm) which could be done outside the loop instead (one time instead of n times).
Comment 18 Derek Atkins 2008-01-10 13:27:48 UTC
Wow!  Two Hours!  Yeah, we should try to fix that.  There is an API xaccAccountGetSplitList() that you could use.  If there are no transactions then the list would be empty.  So you could do something like:

  (define (has-any-txns acctlist)
    (if (null? acctlist)
         #f
        (let ((splits (xaccAccountGetSplitList (car acctlist))))
          (if (null? splits)
              (has-any-txns (cdr acctlist))
              #t)))))

As for the get-descendents-sorted..  Yeah, that should be done once outside the loop.
Comment 19 Charles Day 2008-01-10 22:12:05 UTC
Putting get-descendants-sorted outside the loop has cut the time by more than half, to one hour... which is great because it will speed up all QIF imports and not just the special cases discussed here.

As far as getting rid of the duplicates checking altogether for those special cases, I think that the code you've suggested above looks like a winner. I'll give it a go.

Hopefully I'll be posting a patch in a few hours.
Comment 20 Charles Day 2008-01-11 04:00:18 UTC
Created attachment 102568 [details] [review]
Proposed patch

Seems to work like a charm... I tested it against:
(a) with no existing accounts, import a QIF
(b) with existing accounts but no transactions, import a QIF
(c) with existing accounts and transactions, import a QIF

For (a) and (b), duplicate checking gets skipped. For (c) the duplicate checking runs as usual.
Comment 21 Charles Day 2008-01-11 04:15:09 UTC
Created attachment 102570 [details]
Friendlier diff for patch (-w to ignore whitespace)

Here is the output of "svn diff -x -w" which is a bit easier to read for us poor humans, because the most of the changes to whitespace have been dropped.

The fix for this bug required wholesale changes to indentation because of additional let and if statements inserted very early in the procedure. There was enough indentation that some lines of codes also had to be split into two or more for readability.
Comment 22 Derek Atkins 2008-01-20 17:36:30 UTC
Applied to trunk as r16873.
Awaiting backport.
Comment 23 Andreas Köhler 2008-01-30 20:28:31 UTC
Applied to branches/2.2 as r16893 for GnuCash 2.2.4.
Thanks a lot!
Comment 24 John Ralls 2018-06-29 21:00:32 UTC
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=336211. Please update any external references or bookmarks.