Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(#2562): Prevent transaction deduplication for imported transactions #2618

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ttlgeek
Copy link
Contributor

@ttlgeek ttlgeek commented Apr 16, 2024

Hello,

After looking into the issue, it seems that the transaction deduplication is due to the two transactions having the same amount and account and the second transaction happened within 7 days of the first one.

After reading the import transaction code, the fuzzy matching matching uses the 7 days method to try to match similar transactions and merge them.

Proposed solution: prevent the fuzzy matching from matching imported transactions (where import_id is not null) to prevent this behavior.

@MatissJanis what do you think ?

@trafico-bot trafico-bot bot added the 🔍 Ready for Review Pull Request is not reviewed yet label Apr 16, 2024
@github-actions github-actions bot changed the title fix(#2562): Prevent transaction deduplication for imported transactions [WIP] fix(#2562): Prevent transaction deduplication for imported transactions Apr 16, 2024
Copy link

netlify bot commented Apr 16, 2024

Deploy Preview for actualbudget ready!

Name Link
🔨 Latest commit b54fab0
🔍 Latest deploy log https://app.netlify.com/sites/actualbudget/deploys/663b56fa5374fd000809d798
😎 Deploy Preview https://deploy-preview-2618.demo.actualbudget.org
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@trafico-bot trafico-bot bot added 🚧 WIP Still work-in-progress, please don't review and don't merge and removed 🔍 Ready for Review Pull Request is not reviewed yet labels Apr 16, 2024
Copy link
Contributor

github-actions bot commented Apr 16, 2024

Bundle Stats — desktop-client

Hey there, this message comes from a GitHub action that helps you and reviewers to understand how these changes affect the size of this project's bundle.

As this PR is updated, I'll keep you updated on how the bundle size is impacted.

Total

Files count Total bundle size % Changed
9 4.71 MB 0%

Changeset

No files were changed

View detailed bundle breakdown

Added

No assets were added

Removed

No assets were removed

Bigger

No assets were bigger

Smaller

No assets were smaller

Unchanged

Asset File Size % Changed
static/js/indexeddb-main-thread-worker-e59fee74.js 13.5 kB 0%
static/js/resize-observer.js 18.37 kB 0%
static/js/usePreviewTransactions.js 790 B 0%
static/js/BackgroundImage.js 122.29 kB 0%
static/js/AppliedFilters.js 21.16 kB 0%
static/js/narrow.js 59.81 kB 0%
static/js/wide.js 262.38 kB 0%
static/js/ReportRouter.js 1.23 MB 0%
static/js/index.js 3 MB 0%

Copy link
Contributor

github-actions bot commented Apr 16, 2024

Bundle Stats — loot-core

Hey there, this message comes from a GitHub action that helps you and reviewers to understand how these changes affect the size of this project's bundle.

As this PR is updated, I'll keep you updated on how the bundle size is impacted.

Total

Files count Total bundle size % Changed
1 1.2 MB → 1.2 MB (+24 B) +0.00%
Changeset
File Δ Size
packages/loot-core/src/server/accounts/sync.ts 📈 +24 B (+0.11%) 22.13 kB → 22.16 kB
View detailed bundle breakdown

Added

No assets were added

Removed

No assets were removed

Bigger

Asset File Size % Changed
kcab.worker.js 1.2 MB → 1.2 MB (+24 B) +0.00%

Smaller

No assets were smaller

Unchanged

No assets were unchanged

@@ -492,7 +492,7 @@ export async function reconcileTransactions(
// fields.
fuzzyDataset = await db.all(
`SELECT id, is_parent, date, imported_id, payee, category, notes, reconciled FROM v_transactions
WHERE date >= ? AND date <= ? AND amount = ? AND account = ?`,
WHERE date >= ? AND date <= ? AND amount = ? AND account = ? AND imported_id IS NULL`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 thought: ‏this is an interesting proposal, but I think it would break an existing use-case.

There are folks that do both bank-sync + manual imports (don't ask me why.. I couldn't explain that. It's just my observation from the questions on discord). So for these folks the bank-sync imports would work fine with your patch. However, then the subsequent manual csv imports would break as they would create duplicates.

(also semi-related note - lately I've been considering enabling the "import" button even for accounts that have bank-sync enabled)

There is a new addition to the import modal to skip the reconciliation logic entirely. It's a checkbox you can turn on. Would using that solve your problem?

Copy link
Contributor

@jlsjonas jlsjonas Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be able to shed some light on the reasons:

  1. not everyone will be interacting with actual every single month & several banks heavily restrict how far syncs go back, so there might be gaps with just sync (me, sometimes :) )
  2. some people might not trust the syncing (from either Actual or their bank's side) enough and want to manually import their year overview (f.e.)

(the checkbox would work perfect for my use-case, as I rely on it to avoid duplicates)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a new addition to the import modal to skip the reconciliation logic entirely. It's a checkbox you can turn on.

Disabling all reconciliation / dedup is too aggressive of a fix for #2562 .

In cases where people trust the data they're importing, and they trust that the imported IDs will correctly match, they just want to avoid magic heuristics from silently deleting their transactions.

@jacopo-j
Copy link

jacopo-j commented Apr 20, 2024

Hi, I took a look at this PR (context: I'm the author of #2562). I understand MatissJanis' point, and I believe the issue could be approached in a different way to avoid breaking existing use cases.

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic. Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

(On a side note, I believe reconciled transactions should be excluded from deduplication altogether, instead.)

I think this would both solve the issue I reported and avoid breaking the other use case. Hopefully it makes sense.

@MatissJanis
Copy link
Member

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic. Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

I think this would work. Can't come up with any counter arguments for this at the moment.

@strazto
Copy link

strazto commented May 7, 2024

Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

This sounds good to me.
When I import from an OFX file, what I'm expecting is an option to disable deduplication heuristics, not to disable deduplication entirely.

@pmoon00
Copy link

pmoon00 commented May 10, 2024

@ttlgeek Just wondering if you were going to continue this work as suggested by the comments above. If not I can give it a go. It would be my first contribution so would probably take longer, but got to start somewhere. Let me know what your plans are.

@strazto
Copy link

strazto commented May 16, 2024

@pmoon00

@ttlgeek Just wondering if you were going to continue this work as suggested by the comments above. If not I can give it a go.

Looks like for whatever reason he's not coming back to it anytime soon. Possibly he's deployed his forked branch to his own instance and is happy with its' behaviour.

It would be my first contribution so would probably take longer, but got to start somewhere.

This PR / branch is probably still good basis for your work btw - Take the work here and then apply @MatissJanis + @jacopo-j 's comments (Should mainly be a change to the SQL query that Matiss commented on) and you shouldn't have to do too much to get approved.

@strazto
Copy link

strazto commented May 16, 2024

In fact, I may also have a go at this over my lunch break now

@strazto
Copy link

strazto commented May 16, 2024

@jacopo-j re-reading your comment

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic.

Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

I think @ttlgeek 's PR already does what you're suggesting, it's just not obvious from the default diff.
Expanding the context of the diff yields the following:

// First, match with an existing transaction's imported_id. This
// is the highest fidelity match and should always be attempted
// first.
if (trans.imported_id) {
match = await db.first(
'SELECT * FROM v_transactions WHERE imported_id = ? AND account = ?',
[trans.imported_id, acctId],
);
if (match) {
hasMatched.add(match.id);
}
}
// If it didn't match, query data needed for fuzzy matching
if (!match) {
// Look 7 days ahead and 7 days back when fuzzy matching. This
// needs to select all fields that need to be read from the
// matched transaction. See the final pass below for the needed
// fields.
fuzzyDataset = await db.all(
`SELECT id, is_parent, date, imported_id, payee, category, notes, reconciled FROM v_transactions
WHERE date >= ? AND date <= ? AND amount = ? AND account = ?`,

To rephrase your specification:

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic.
Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

if tx_in has imported_id:
  if tx_in.imported_id corresponds to existing_tx:
    dedup using IDs
    exit
    
# In the above, both transactions have imported ID, but the IDs match, so no skip

# The following logic must hold for the the rest of this flow:
# tx_in has no imported_id
# OR
# tx_in's imported_id does not match (ids are different)

if an existing_tx has no imported_id:
  dedup using heuristics
  exit

# (
# tx_in has no imported_id
# OR
# tx_in.imported_id does not match (ids are different)
#  )
# AND existing_tx has no imported_id 

This behaviour is consistent with @ttlgeek 's PR.

Copy link
Member

@MatissJanis MatissJanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to poke holes in the logic, but couldn't find any issues. Thanks for correcting me!

I'm happy to merge this, but I'd like another maintainer to review it since it's a risky change to the reconciliation logic.

@MatissJanis MatissJanis changed the title [WIP] fix(#2562): Prevent transaction deduplication for imported transactions fix(#2562): Prevent transaction deduplication for imported transactions May 16, 2024
@trafico-bot trafico-bot bot added ✅ Approved Pull Request has been approved and can be merged and removed 🚧 WIP Still work-in-progress, please don't review and don't merge labels May 16, 2024
@strazto
Copy link

strazto commented May 17, 2024

I was trying to poke holes in the logic, but couldn't find any issues. Thanks for correcting me!

This morning I thought about it and had a feeling I may be overlooking something.
It's definitely tricky inverting the logic between what the SQL statements are doing and the idea of "when to skip".

I think possibly there's an issue with this being the skip case:

# (
# tx_in has no imported_id
# OR
# tx_in.imported_id does not match (ids are different)
#  )
# AND existing_tx has no imported_id 

Because that includes tx_in.imported_id == NULL && existing_tx.imported_id == NULL which is NOT a skip case.
Your instinct to double check is valid.

I think the change is small, however, we can just add another condition to ttlgeek's original SQL modification.

In terms of style, I think we should have linebreaks that delineate the logic more clearly in the SQL statements, and comments,

`SELECT id, is_parent, date, imported_id, payee, category, notes, reconciled FROM v_transactions
   WHERE date >= ? AND date <= ? AND amount = ? AND account = ? 
   -- Only perform it if either ID is null
   AND ( imported_id IS NULL OR ? IS NULL)`, //  tx_in.imported_id should be last parameter

@pmoon00
Copy link

pmoon00 commented May 17, 2024

I was trying to poke holes in the logic, but couldn't find any issues. Thanks for correcting me!

This morning I thought about it and had a feeling I may be overlooking something.

It's definitely tricky inverting the logic between what the SQL statements are doing and the idea of "when to skip".

I think possibly there's an issue with this being the skip case:


# (

# tx_in has no imported_id

# OR

# tx_in.imported_id does not match (ids are different)

#  )

# AND existing_tx has no imported_id 

Because that includes tx_in.imported_id == NULL && existing_tx.imported_id == NULL which is NOT a skip case.

Your instinct to double check is valid.

I think the change is small, however, we can just add another condition to ttlgeek's original SQL modification.

In terms of style, I think we should have linebreaks that delineate the logic more clearly in the SQL statements, and comments,

`SELECT id, is_parent, date, imported_id, payee, category, notes, reconciled FROM v_transactions

   WHERE date >= ? AND date <= ? AND amount = ? AND account = ? 

   -- Only perform it if either ID is null

   AND ( imported_id IS NULL OR ? IS NULL)`, //  tx_in.imported_id should be last parameter

Would it be clearer if the AND imported_id IS NULL predicate is only included if the tx_in.imported_id is set/exists?

I'm not sure what the conventions around dynamic SQL predicates are in this project though.

@strazto
Copy link

strazto commented May 17, 2024

Would it be clearer if the AND imported_id IS NULL predicate is only included if the tx_in.imported_id is set/exists?

I'm not sure what the conventions around dynamic SQL predicates are in this project though.

Different branching logic overall might be preferable and more readable, however my assumption is that dynamic SQL should be avoided.
It's more likely to lead to bugs and could potentially become a vector for injection vulns (not very likely in this case).

It's true that this statement slightly violates DRY with the code branch that checks that tx_in.imported_id is null, but I don't think it's a big problem.

AND ( imported_id IS NULL OR ? IS NULL)

@strazto
Copy link

strazto commented May 17, 2024

Hi @MatissJanis @pmoon00 I've taken a stab at this in my own PR #2770 which bases off of @ttlgeek 's branch (though without all the master -> feature merge commits because I think those are ugly).

The only difference is the logic of the SQL statement, which now allows the dedup to occur if an existing transaction OR the input transaction are null.
There's also a formatting difference in the SQL statement (I've re-ordered it and added linebreaks + comments to logically separate the fuzzy logic from the null conditioning )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✅ Approved Pull Request has been approved and can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants