Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipped event with concurrency enabled #556

Open
pirvudoru opened this issue Apr 17, 2024 · 8 comments
Open

Skipped event with concurrency enabled #556

pirvudoru opened this issue Apr 17, 2024 · 8 comments

Comments

@pirvudoru
Copy link

I have the following setup:

Use Postgresql as an event store.
Have 150 events in the events table.
Created a TestHandler with concurrency = 5
Make TestHandler crash at event 40
The TestHandler is registered in the superviser with restart: :temporary.

Observed behavior:

  • error is logged
  • 2 pids died
  • the handler entry in the subscriptions table has last_seen = 150
  • when restarting the app/handler, all is fine, event is skipped

What is the course of action to avoid skipping events? What happens if i delpoy while the handlers are processing events. It may also kill pids and events are loaded based on the last_seen from the event store subscriptions table.

@drteeth
Copy link
Contributor

drteeth commented Apr 18, 2024

Are you able to reproduce this easily? Could you share an example or test case?

You are sure the effect of running TestHandler wasn't run? Maybe it was run by one of the other 4 instances of your handler?

@drteeth
Copy link
Contributor

drteeth commented Apr 18, 2024

Here is a an example project that replicates what you are seeing:

https://github.com/drteeth/commanded_issue_556

Run the tests to see the failure.

My question is how should this work? Should the remaining handlers stop? Should the instance that failed start from 5 when it starts back up?

@pirvudoru
Copy link
Author

I'm thinking the behavior for the partition of the failed handler should be the same as for a single thread -e.g. it does not process any other events. Other handlers can continue processing the events corresponding to their partitions.

When fixing the code issue and restarting the failed partition handler, it should resume from where it left off.

@drteeth
Copy link
Contributor

drteeth commented Apr 19, 2024

Can you confirm for me that the failure in the handler is permanent? Meaning this is no a temporary failure, no amount of retries is going to fix it? Am I assuming correctly here?

Other handlers can continue processing the events corresponding to their partitions.

What would this mean though?

If I had 2 concurrent handlers, the first processing odd events, the second even ones, when the first one encounters an error and ultimately dies, should the second one only process even events still? Should it take over from the failed on? Presumably not as it would also die.

@drteeth
Copy link
Contributor

drteeth commented Apr 22, 2024

The event store tracks the last seen event per subscription, not per partition. This is mostly to support dynamic partition sizing where you can adjust the concurrency over time.

If an event cannot be processed within one partition then the last seen event checkpoint should not move past that event. On restart the same problematic event should be retried, along with any later events that may already have been processed by other partitions. This is the at-least-once guarantee which can mean events may be processed more than once. (edited)

My example uses the in-memory store. I think it should use the PG adapter to be worth anything. I'll try to update it to see if that changes things.

@drteeth
Copy link
Contributor

drteeth commented Apr 22, 2024

Updated my example to use PG adapter and to expect the subscription to not pass the failed event.

drteeth added a commit to drteeth/commanded that referenced this issue Apr 22, 2024
@drteeth
Copy link
Contributor

drteeth commented Apr 22, 2024

@slashdotdash here's a failing test for the issue: drteeth@e43c899

@pirvudoru
Copy link
Author

Can you confirm for me that the failure in the handler is permanent? Meaning this is no a temporary failure, no amount of retries is going to fix it? Am I assuming correctly here?

Yes, I confirm the failure is permanent, meaning it would need a code change to process the event successfully.

Other handlers can continue processing the events corresponding to their partitions.

What would this mean though?

If I had 2 concurrent handlers, the first processing odd events, the second even ones, when the first one encounters an error and ultimately dies, should the second one only process even events still? Should it take over from the failed on? Presumably not as it would also die.

It is probably a nice to have to continue processing the partitions where it can.

Sorry for the late reply, you're moving too fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants