Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not use the raw_csv hexdigest to detect duplicates? #134

Open
msmart opened this issue Oct 6, 2019 · 5 comments
Open

Why not use the raw_csv hexdigest to detect duplicates? #134

msmart opened this issue Oct 6, 2019 · 5 comments

Comments

@msmart
Copy link

msmart commented Oct 6, 2019

I import csv files which which will produce duplicate postings (in terms of desc, credit, debit).

With the --skip-dupes option set these postings are ignored. This problem can be resolved by using the md5sum of the raw_csv line.

This observation is also noted at:

icsv2ledger/icsv2ledger.py

Lines 558 to 560 in 55629c7

# We also record this - in future we may use it to avoid duplication
#self.md5sum = hashlib.md5(self.raw_csv.encode('utf-8')).hexdigest()
self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.date,self.desc,self.credit,self.debit,self.credit_account)).encode('utf-8')).hexdigest()

Why not un-comment that line? Or provide an option to use the raw_csv input as a skip-dupes criteria?

@antler5
Copy link

antler5 commented Jul 23, 2021

Hi! First of all, thanks to everyone who's contributed to this project! It's been invaluable. Anyways.
I also ran into this issue, believed that the original CSV line would be checked, and have done some digging.

The change was introduced back in 2017 by mondjef, in a pull request intended to avoid false-positives caused by the confusion of otherwise identical transactions affecting different source accounts and unintentionally introducing a new class of it's own.

"Changed duplication detection to use MD5sum field and included source account so that duplication detection is more accurate."
#100

"The source account is included to avoid false postives on generic transaction descriptions when the source account is different and thus should not be considered a duplicate."
89adf85#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L295-R298

  • One could solve this by including whatever unique fields differentiate the raw CSV lines in the desc option of .icsv2ledgerrc, but the identification of unique transactions shouldn't rely on these settings or ignore un-included fields in the first place.

  • An improved MD5Sum construct would include the entire CSV line as well as the unique self.credit_account field, which is (I believe?) the only property that can't be found in the CSV itself.

  • Here's a tricky one: any change to the way MD5Sums are currently constructed will cause existing MD5 hashes to change (ie. not match anymore). This is my least favorite part of having moved away from just hashing raw_csv to begin with.

    • We could check for both formats and prompt for migration (just filling in the new hash)
    • We could leave the MD5 as-is and adjust the dupe detection to check both the hash and the CSV comments. Several of the fields used to construct the MD5Sum would still become redundant, but this might be less disruptive to existing entries.

So, using both self.raw_csv and self.credit_account to make the MD5 hash:

# self.md5sum = hashlib.md5(self.raw_csv.encode('utf-8')).hexdigest()
# self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.date,self.desc,self.credit,self.debit,self.credit_account)).encode('utf-8')).hexdigest()
self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.raw_csv,self.credit_account)).encode('utf-8')).hexdigest()

Or adjusting the dupe detection to check raw_csv too:

# I don't usually write python, this might not work
# if (options.skip_dupes or options.confirm_dupes) and entry.md5sum in md5sum_hashes:
if (options.skip_dupes or options.confirm_dupes) and (entry.md5sum in md5sum_hashes) and (entry.raw_csv in csv_comments):

First option makes the most sense, but I'm nagged by the lack of a migration path. Would be sweet to check if the hash of a transaction matches the old MD5 construct and prompt to update it. I don't know if it bothers anyone else, and it's still the right move to have a better piece of software, so I'll submit a pull request anyways. I don't have the time to do anything fancy.

Just to be sure, there are no other fields like self.credit_account (not found in the CSV itself) to include in the MD5Sum?

antler5 pushed a commit to antler5/icsv2ledger that referenced this issue Aug 1, 2021
Update to quentinsf#100,
see quentinsf#134.

Note that this will cause current transactions to become unique.
Appropriate adjustments have been made to existing tests.
@antler5
Copy link

antler5 commented Aug 1, 2021

Re-submitted #139 as #140 because I was naively working on master and, as it happens, Github doesn't let you change your default branch, only rename it.

@areynoua
Copy link

We should also consider here a problem of another nature: my bank has the unpleasant habit of changing the format of CSV exports (even though it is the 7th largest bank in the world in some rankings). I added two options to control the checksum generation: the first one is the list of fields to consider, the second one is a regex to remove a part of the desc field. It is this second option that I use to remove what the bank adds to the communication and which is variable from one export to another. The modification is minimal, I let you look at it here https://github.com/areynoua/icsv2ledger/commit/3ca0ff5722083f41baabbbe34ff2af04ebc79dc1.
(There are also two other commits on my fork, which may be of interest, I let you judge).

@areynoua
Copy link

I was in too much of a hurry to share the change when I couldn't test (how awful!). The mentioned commit contains errors but is easy to fix. https://github.com/areynoua/icsv2ledger/commit/34fb3e698741b5d336076ce7508283dcefe9c8dc

@antler5
Copy link

antler5 commented Aug 10, 2022

That is an important consideration, and a strong argument for baseing the hash off of portions of the extracted data after all-- I just happen to work with an archive of monthly statements, which aren't effected by format changes quite as badly. Another pathological example is the only somewhat unlikely case of two identical transactions without fine grained datetime or account balance fields to differentiate them, which raises questions about the applicability of duplicate detection and the workflows users want support for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants