Why not use the raw_csv hexdigest to detect duplicates? #134

msmart · 2019-10-06T19:14:45Z

I import csv files which which will produce duplicate postings (in terms of desc, credit, debit).

With the --skip-dupes option set these postings are ignored. This problem can be resolved by using the md5sum of the raw_csv line.

This observation is also noted at:

icsv2ledger/icsv2ledger.py

Lines 558 to 560 in 55629c7

    
           # We also record this - in future we may use it to avoid duplication 
        
           #self.md5sum = hashlib.md5(self.raw_csv.encode('utf-8')).hexdigest() 
        
           self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.date,self.desc,self.credit,self.debit,self.credit_account)).encode('utf-8')).hexdigest()

Why not un-comment that line? Or provide an option to use the raw_csv input as a skip-dupes criteria?

antler5 · 2021-07-23T03:20:46Z

Hi! First of all, thanks to everyone who's contributed to this project! It's been invaluable. Anyways.
I also ran into this issue, believed that the original CSV line would be checked, and have done some digging.

The change was introduced back in 2017 by mondjef, in a pull request intended to avoid false-positives caused by the confusion of otherwise identical transactions affecting different source accounts and unintentionally introducing a new class of it's own.

"Changed duplication detection to use MD5sum field and included source account so that duplication detection is more accurate."
#100

"The source account is included to avoid false postives on generic transaction descriptions when the source account is different and thus should not be considered a duplicate."
89adf85#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L295-R298

One could solve this by including whatever unique fields differentiate the raw CSV lines in the desc option of .icsv2ledgerrc, but the identification of unique transactions shouldn't rely on these settings or ignore un-included fields in the first place.
An improved MD5Sum construct would include the entire CSV line as well as the unique self.credit_account field, which is (I believe?) the only property that can't be found in the CSV itself.
Here's a tricky one: any change to the way MD5Sums are currently constructed will cause existing MD5 hashes to change (ie. not match anymore). This is my least favorite part of having moved away from just hashing raw_csv to begin with.
- We could check for both formats and prompt for migration (just filling in the new hash)
- We could leave the MD5 as-is and adjust the dupe detection to check both the hash and the CSV comments. Several of the fields used to construct the MD5Sum would still become redundant, but this might be less disruptive to existing entries.

So, using both self.raw_csv and self.credit_account to make the MD5 hash:

# self.md5sum = hashlib.md5(self.raw_csv.encode('utf-8')).hexdigest()
# self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.date,self.desc,self.credit,self.debit,self.credit_account)).encode('utf-8')).hexdigest()
self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.raw_csv,self.credit_account)).encode('utf-8')).hexdigest()

Or adjusting the dupe detection to check raw_csv too:

# I don't usually write python, this might not work
# if (options.skip_dupes or options.confirm_dupes) and entry.md5sum in md5sum_hashes:
if (options.skip_dupes or options.confirm_dupes) and (entry.md5sum in md5sum_hashes) and (entry.raw_csv in csv_comments):

First option makes the most sense, but I'm nagged by the lack of a migration path. Would be sweet to check if the hash of a transaction matches the old MD5 construct and prompt to update it. I don't know if it bothers anyone else, and it's still the right move to have a better piece of software, so I'll submit a pull request anyways. I don't have the time to do anything fancy.

Just to be sure, there are no other fields like self.credit_account (not found in the CSV itself) to include in the MD5Sum?

Update to quentinsf#100, see quentinsf#134. Note that this will cause current transactions to become unique. Appropriate adjustments have been made to existing tests.

antler5 · 2021-08-01T01:54:14Z

Re-submitted #139 as #140 because I was naively working on master and, as it happens, Github doesn't let you change your default branch, only rename it.

areynoua · 2022-07-11T09:13:43Z

We should also consider here a problem of another nature: my bank has the unpleasant habit of changing the format of CSV exports (even though it is the 7th largest bank in the world in some rankings). I added two options to control the checksum generation: the first one is the list of fields to consider, the second one is a regex to remove a part of the desc field. It is this second option that I use to remove what the bank adds to the communication and which is variable from one export to another. The modification is minimal, I let you look at it here https://github.com/areynoua/icsv2ledger/commit/3ca0ff5722083f41baabbbe34ff2af04ebc79dc1.
(There are also two other commits on my fork, which may be of interest, I let you judge).

areynoua · 2022-07-12T16:25:21Z

I was in too much of a hurry to share the change when I couldn't test (how awful!). The mentioned commit contains errors but is easy to fix. https://github.com/areynoua/icsv2ledger/commit/34fb3e698741b5d336076ce7508283dcefe9c8dc

antler5 · 2022-08-10T19:28:50Z

That is an important consideration, and a strong argument for baseing the hash off of portions of the extracted data after all-- I just happen to work with an archive of monthly statements, which aren't effected by format changes quite as badly. Another pathological example is the only somewhat unlikely case of two identical transactions without fine grained datetime or account balance fields to differentiate them, which raises questions about the applicability of duplicate detection and the workflows users want support for.

antler5 mentioned this issue Jul 23, 2021

#134: Change MD5 hash construction to avoid false-positives #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not use the raw_csv hexdigest to detect duplicates? #134

Why not use the raw_csv hexdigest to detect duplicates? #134

msmart commented Oct 6, 2019 •

edited

antler5 commented Jul 23, 2021 •

edited

antler5 commented Aug 1, 2021

areynoua commented Jul 11, 2022

areynoua commented Jul 12, 2022

antler5 commented Aug 10, 2022

Why not use the raw_csv hexdigest to detect duplicates? #134

Why not use the raw_csv hexdigest to detect duplicates? #134

Comments

msmart commented Oct 6, 2019 • edited

antler5 commented Jul 23, 2021 • edited

antler5 commented Aug 1, 2021

areynoua commented Jul 11, 2022

areynoua commented Jul 12, 2022

antler5 commented Aug 10, 2022

msmart commented Oct 6, 2019 •

edited

antler5 commented Jul 23, 2021 •

edited