Skipped event with concurrency enabled #556

pirvudoru · 2024-04-17T10:47:52Z

I have the following setup:

Use Postgresql as an event store.
Have 150 events in the events table.
Created a TestHandler with concurrency = 5
Make TestHandler crash at event 40
The TestHandler is registered in the superviser with restart: :temporary.

Observed behavior:

error is logged
2 pids died
the handler entry in the subscriptions table has last_seen = 150
when restarting the app/handler, all is fine, event is skipped

What is the course of action to avoid skipping events? What happens if i delpoy while the handlers are processing events. It may also kill pids and events are loaded based on the last_seen from the event store subscriptions table.

drteeth · 2024-04-18T03:09:12Z

Are you able to reproduce this easily? Could you share an example or test case?

You are sure the effect of running TestHandler wasn't run? Maybe it was run by one of the other 4 instances of your handler?

drteeth · 2024-04-18T04:45:39Z

Here is a an example project that replicates what you are seeing:

https://github.com/drteeth/commanded_issue_556

Run the tests to see the failure.

My question is how should this work? Should the remaining handlers stop? Should the instance that failed start from 5 when it starts back up?

pirvudoru · 2024-04-19T10:25:52Z

I'm thinking the behavior for the partition of the failed handler should be the same as for a single thread -e.g. it does not process any other events. Other handlers can continue processing the events corresponding to their partitions.

When fixing the code issue and restarting the failed partition handler, it should resume from where it left off.

drteeth · 2024-04-19T19:17:30Z

Can you confirm for me that the failure in the handler is permanent? Meaning this is no a temporary failure, no amount of retries is going to fix it? Am I assuming correctly here?

Other handlers can continue processing the events corresponding to their partitions.

What would this mean though?

If I had 2 concurrent handlers, the first processing odd events, the second even ones, when the first one encounters an error and ultimately dies, should the second one only process even events still? Should it take over from the failed on? Presumably not as it would also die.

drteeth · 2024-04-22T14:31:58Z

The event store tracks the last seen event per subscription, not per partition. This is mostly to support dynamic partition sizing where you can adjust the concurrency over time.

If an event cannot be processed within one partition then the last seen event checkpoint should not move past that event. On restart the same problematic event should be retried, along with any later events that may already have been processed by other partitions. This is the at-least-once guarantee which can mean events may be processed more than once. (edited)

My example uses the in-memory store. I think it should use the PG adapter to be worth anything. I'll try to update it to see if that changes things.

drteeth · 2024-04-22T16:21:06Z

Updated my example to use PG adapter and to expect the subscription to not pass the failed event.

drteeth · 2024-04-22T23:48:16Z

@slashdotdash here's a failing test for the issue: drteeth@e43c899

pirvudoru · 2024-04-23T16:05:53Z

Can you confirm for me that the failure in the handler is permanent? Meaning this is no a temporary failure, no amount of retries is going to fix it? Am I assuming correctly here?

Yes, I confirm the failure is permanent, meaning it would need a code change to process the event successfully.

Other handlers can continue processing the events corresponding to their partitions.

What would this mean though?

If I had 2 concurrent handlers, the first processing odd events, the second even ones, when the first one encounters an error and ultimately dies, should the second one only process even events still? Should it take over from the failed on? Presumably not as it would also die.

It is probably a nice to have to continue processing the partitions where it can.

Sorry for the late reply, you're moving too fast.

drteeth added a commit to drteeth/commanded that referenced this issue Apr 22, 2024

Add failing test for commanded#556

e43c899

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skipped event with concurrency enabled #556

Skipped event with concurrency enabled #556

pirvudoru commented Apr 17, 2024

drteeth commented Apr 18, 2024

drteeth commented Apr 18, 2024

pirvudoru commented Apr 19, 2024

drteeth commented Apr 19, 2024

drteeth commented Apr 22, 2024

drteeth commented Apr 22, 2024

drteeth commented Apr 22, 2024 •

edited

pirvudoru commented Apr 23, 2024

Skipped event with concurrency enabled #556

Skipped event with concurrency enabled #556

Comments

pirvudoru commented Apr 17, 2024

drteeth commented Apr 18, 2024

drteeth commented Apr 18, 2024

pirvudoru commented Apr 19, 2024

drteeth commented Apr 19, 2024

drteeth commented Apr 22, 2024

drteeth commented Apr 22, 2024

drteeth commented Apr 22, 2024 • edited

pirvudoru commented Apr 23, 2024

drteeth commented Apr 22, 2024 •

edited