GNOME Bugzilla – Bug 336211
QIF Import generates spurious duplicate transactions
Last modified: 2018-06-29 21:00:32 UTC
Please describe the problem: When importing a single QIF file, many "duplicate" transactions are found. I have checked the QIF file, and found that at least the first candidate duplicate, was not, in fact, a duplicate. Since the single QIF file does not duplicate transactions, the QIF import druid should not find any duplicates. Steps to reproduce: 1. Import a single QIF 2. Observe that many duplicates are found. 3. Actual results: Many duplicates are found from a single QIF. Expected results: One does not expect any duplicates to be found. Does this happen every time? Yes. Other information: SVN revision is 13702.
Unlike OFX, QIF transactions don't have any transaction ID. Duplicate checking is performed via heuristics. Does this happen with ANY qif file, or only a particular qif file?
Still, given a SINGLE qif file being imported, no duplicates should be found. The file does not start out containing duplicates. These duplicates are spuriously generated by gnucash. I have tried it with another QIF file and found gnucash asking me to check huge numbers of duplicates. So that is at least two QIF files. I wondered if it could be related to the size of these QIF files. Both are for frequently used accounts with 7 years of data. I tried another QIF file with less data. It still found spurious duplicates. At the very least, when importing multiple QIFs, if the date, description, and amounts are the same, it is very unlikely that two transactions are not one and the same.
I agree, though I'd raise the identical-by-value threshold a bit to include some non-"empty" description. And it might just mark them as duplicates automagically -- and perhaps seperate them out from the normal duplicate matching -- in case that heuristic is wrong. But it's certainly an enhancement at least.
Quoting Mark in Comment #2: Still, given a SINGLE qif file being imported, no duplicates should be found. That's just not true. If there are transactions already in the destination account, the importer can find duplicates there. The duplicate checking is not checking within the qif file -- it's comparing the qif to-be-imported transactions to the list of transactions already in the gnucash account. a single qif file importing into an empty account should never find any duplicates. But a single qif file importing into a non-empty account may certainly find duplicates.
My apologies. My comment was incomplete. I started gnucash with the --nofile option and then imported the single QIF file. This is the situation in Derek's last paragraph in Comment #4. I encountered a large number of spurious duplicates in doing this import.
This continues to be a problem in 2.0.0. I will try to come up with a reasonably small QIF that reproduces the problem within a week or so.
Created attachment 69628 [details] QIF file that reproduces problem start gnucash with --nofile option. select File->import->Import QIF select attached file. Keep clicking on next in the druid until you come to finding duplicates. Each transaction has at least one duplicate listed - itself!
Yep, same problem for me. A thousand or so duplicates for transactions whose only duplicate is identical (in all respects) to the original. And for which there isn't a second entry in the QIF file.
This problem still exists in 2.2.2. If I start with --nofile, then import a QIF, it finds many duplicate transactions that look exactly like what is being imported. This would seem to me to be nonsense, as how can there be duplicates if I started with --nofile? From a cursory examination of the code, I believe that the "check duplicates" process is supposed to be checking the new transactions (those being imported) against the old (those existing prior to import). Instead, I think new transactions are being compared to a combined set of both new AND old. This theory was supported by a few debugging statements that I inserted into the code. I started GnuCash with --nofile and imported a QIF. At the beginning of duplicate checking, the debugging code showed me how many new transactions existed (78) and how many old transactions existed (the same: 78!).
If you go back in time, I suspect this was introduced when the query code was changed. It used to use xaccQuerySetAcountGroup() but that was changed at the time because we couldn't (any longer) support that. Now, however, with the advent of Root accounts, it's possible that we might be able to re-implement that which should solve this problem.
Yeah I think you're on to something there... I briefly looked at some of the Account code in C but got interrupted. I found some documentation on the Account and AccountGroup API's but it is (as you suggest) out of date as there was no mention of collections that I could find. I don't mind trying to debug this but the relationships between books, collections, accounts, and root accounts aren't clear to me yet. Is there any relatively up-to-date place to read about this stuff other than the C code itself?
http://cvs.gnucash.org/docs/HEAD/
That was the first place I looked, and the only information I could find about the book and account API's was a link to the out-of-date documentation here: http://code.neil.williamsleesmill.me.uk/texi/gnucash-design.html That document is still written in terms of account groups. If you know of any more recent stuff, let me know, thanks.
An "Account Group" is just a set of peer accounts. It used to be: AccountGroup | +- Account / AccountGroup | | | +- Account / Account Group | ... +- ... | +- Account / AccountGroup ... Now, the AccountGroup has been collapsed back into the Account, and there's a new "Root Account" that's the top of the tree.
I've found the bug. One critical piece of missing information is that for this bug to occur, you must be importing to a brand new file that has NO ACCOUNTS prior to import. If you set up any accounts beforehand, such as through the druid, the import completes normally. The bug is at line 65 of qif-merge-groups.scm because the call to xaccQueryAddAccountMatch can be passed an empty list of accounts. Therefore the call will fail silently and the query will have no account criteria, causing imported transactions to be able to match themselves. It occurred to me that if there are no transactions prior to import then there is no need to perform duplicate checking. What is a cheap way in Scheme to determine that an Account (or any of its subaccounts) has at least one split? Build a query? Use gnc_account_foreach_child_until?
Does it matter if there are transactions at all? Or only if there are no accounts? You should be able to tell if there are any accounts in the tree by testing the account list for null?. Getting the count of transactions would be... harder.
To fix this bug, only the presence of accounts matters and I can check that very easily and cheaply. But for performance, I was thinking of checking for the case where someone has gone through and set up accounts (through the druid, perhaps), but hasn't entered any transactions. In this case there is no need to check for duplicates, but the current QIF importer does it anyway, meaning one pointless query executes per imported transaction. As an experiment, I started GnuCash with --nofile, created the default account tree with the druid, then ran the import on 15 years of data. The totally unnecessary "check duplicates" stage alone took more than TWO HOURS to complete. So the savings can be quite significant. If written properly, checking for the presence of Splits in existing accounts should be very cheap, i.e. walk the account tree until one Split is seen and then stop. I am wondering how to write this most efficiently in Scheme. Are there existing C functions that could be leveraged, and if so, which ones? I was hoping there would be an easy boolean API call, such as a "gnc_account_tree_has_splits (Account *)" or something but no such luck apparently! On another subject, I found an additional optimization that would affect ALL imports of QIF. Currently there is a call to gnc-account-get-descendants-sorted INSIDE the transaction loop (around line 65 of qif-merge-groups.scm) which could be done outside the loop instead (one time instead of n times).
Wow! Two Hours! Yeah, we should try to fix that. There is an API xaccAccountGetSplitList() that you could use. If there are no transactions then the list would be empty. So you could do something like: (define (has-any-txns acctlist) (if (null? acctlist) #f (let ((splits (xaccAccountGetSplitList (car acctlist)))) (if (null? splits) (has-any-txns (cdr acctlist)) #t))))) As for the get-descendents-sorted.. Yeah, that should be done once outside the loop.
Putting get-descendants-sorted outside the loop has cut the time by more than half, to one hour... which is great because it will speed up all QIF imports and not just the special cases discussed here. As far as getting rid of the duplicates checking altogether for those special cases, I think that the code you've suggested above looks like a winner. I'll give it a go. Hopefully I'll be posting a patch in a few hours.
Created attachment 102568 [details] [review] Proposed patch Seems to work like a charm... I tested it against: (a) with no existing accounts, import a QIF (b) with existing accounts but no transactions, import a QIF (c) with existing accounts and transactions, import a QIF For (a) and (b), duplicate checking gets skipped. For (c) the duplicate checking runs as usual.
Created attachment 102570 [details] Friendlier diff for patch (-w to ignore whitespace) Here is the output of "svn diff -x -w" which is a bit easier to read for us poor humans, because the most of the changes to whitespace have been dropped. The fix for this bug required wholesale changes to indentation because of additional let and if statements inserted very early in the procedure. There was enough indentation that some lines of codes also had to be split into two or more for readability.
Applied to trunk as r16873. Awaiting backport.
Applied to branches/2.2 as r16893 for GnuCash 2.2.4. Thanks a lot!
GnuCash bug tracking has moved to a new Bugzilla host. This bug has been copied to https://bugs.gnucash.org/show_bug.cgi?id=336211. Please update any external references or bookmarks.