Enable safe retry of aggregate commands in the event of (perceived) failures #2554

jvanheesch · 2023-01-09T13:34:01Z

Proposal

This is a draft proposal. It aims to enable safe retry of aggregate commands in the event of (perceived) failures, without changing current default behavior (i.e. opt-in).

Context

Command failures are often transient (e.g. service down -> NoHandlerForCommandException). As such, it makes sense to introduce the concept of a RetryScheduler.
Problem: retrying commands is not always a safe operation, as the initial "failed" command may already have produced side effects, in which case a redelivery might wrongfully produce those side effects a second time (this is particularly common in case of COMMAND_TIMEOUT)
To enable safe retries, the command handler (CH) must be idempotent
Message handlers can be made idempotent at the infrastructure layer by filtering out duplicate messages

The problem and the proposal are demonstrated in the following repository: https://github.com/jvanheesch/axon-command-idempotency

Approach

When a command results in an aggregate event, store the command id (Message.getIdentifier()) in the event
When sourcing an aggregate from events (+ snapshot), keep track of command ids
Upon receiving a command: if the command has already resulted in an aggregate event (i.e. command id is stored in some event or snapshot in the event stream), bypass @CommandHandler

Limitations

This approach cannot be used to make the following CHs idempotent:

CH constructor
external CH

I'm not entirely happy about this, but I do believe the benefits largely outweigh these limitations. For CH constructors, this problem can be mitigated by including a @TargetAggregateIdentifier in the command message and selecting an event store implementation that enforces (aggregate_identifier, sequence_number) uniqueness (e.g. when using axon server, duplicate creation commands result in OUT_OF_RANGE: [AXONIQ-2000] Invalid sequence number 0 for aggregate a1, expected 1).

Final thoughts

There is definitely room for improvement. Some current questions/concerns:

AggregateSnapshotter stores ALL previous correlationIds (for idempotent aggregates), which presumably makes snapshot size unbounded. Is such naive implementation sufficient, or is a more complex strategy (retention policy) required?
Is return null; the most appropriate way of handling duplicate commands? Would some kind of exception make more sense?
The value of the idempotent attribute of the @Aggregate annotation is not always taken into account. This value should be available to every piece of code that instantiates an AnnotatedAggregate. AggregateConfigurer passes this value to EventSourcingRepository and as such, every AnnotatedAggregate returned by EventSourcingRepository will have the correct value for its idempotent property. There are however several other places where an AnnotatedAggregate is sourced (from DomainEventStream or state) that currently do not have access to this newly added property, e.g.:
- AggregateSnapshotter does not have access to this value, and resorts to providing the method shouldAddCorrelationIds(), enabling SpringAggregateSnapshotter to mitigate this problem
- CommandHandlerInvoker.DisruptorRepository does not have access to this value, and as such DisruptorCommandBus will not filter out redeliveries
- GenericJpaRepository does not have access to this value. On top of that, additional changes would be required to take this value into account, as GenericJpaRepository currently does not store (and hence cannot restore) command ids.
The default value of idempotent should be globally configurable, eliminating the need for idempotent = true in every @Aggregate annotation

I believe safe retries would greatly improve command failure handling, and I hope they will be available some day. I'd be happy to improve upon this draft, but would like to hear your thoughts first.

CLAassistant · 2023-01-09T13:34:07Z

All committers have signed the CLA.

smcvb · 2023-01-12T13:11:13Z

First and foremost, thanks for making an effort to provide a pull request, @jvanheesch!

I have taken some time to go over this pull request and its description.
Based on that, I think this seems useful for Axon Framework to provide.

However, as it stands, I do not have sufficient time to do a thorough enough review of the changes you've made.
Hence, I will come back to this pull request once I'm able too.
In the meantime, I have added other peers from the team to provide their two cents.

smcvb · 2023-04-21T10:33:31Z

You've gone the stretch to construct a pull request with a very thorough explanation (which is very helpful).
Although on the 12th of January, I pointed out I was rather busy, I must admit I forgot about this PR a little...
I think that's unfair toward your efforts, so my apologies for keeping you waiting, @jvanheesch.

However, I've just spent some time on your implementation, and I have a recommendation that may solve all predicaments.
If you would set the idempotency flag in the AggregateModel, you should be able to access it from the Snapshotter and DisruptorRepository.
Furthermore, I think you'll be able to maintain the current support by utilizing the AggregateModel.

The only misser left with that suggestion would be the GenericJpaRepository.
However, as you've already described, this will take some additional effort anyhow.
Honestly, I actually think it's fine to leave it as an afterthought, as it'll minimize the scope of this PR, simplifying the review process.

Aside from AggregateModel suggestion, I've got quite some more detailed pointers like documentation, naming, and placement of certain operations.
Conceptually speaking, the use of the messageIdentifier of the CommandMessage is, in my opinion, the most straightforward way to implement command handling idempotency.
With some adjustments to the API (I'm thinking of the Aggregate interface implemented by the AnnotatedAggregate), I think we could even allow it for external command handlers.

Now, let me conclude a little:

Again, my apologies for coming back to this so late. I do hope it didn't obstruct you in any fashion.
If you do have the time, would you mind trying out the AggregateModel suggestion? I'm curious whether there are any roadblocks to that idea that I didn't think of.
I'm going to gauge within the Axon Framework team what the rest thinks about this approach. Anticipate either responses from the team members or a combined response from me in the near future.

jvanheesch added 4 commits January 5, 2023 23:05

add property AnnotatedAggregate.idempotent

b1362a9

make AnnotatedAggregate source correlationIds if idempotent

1c29e5f

add ability to preserve correlationIds when using snapshots

69e9d6c

make AnnotatedAggregate ignore duplicate command message if idempotent

8de17c2

jvanheesch changed the title ~~PR idempotency~~ Enable safe retry of aggregate commands in the event of (perceived) failures Jan 9, 2023

smcvb requested review from a team, gklijs, CodeDrivenMitch, smcvb and abuijze and removed request for a team January 12, 2023 13:08

smcvb assigned jvanheesch Jan 12, 2023

smcvb added Type: Feature Use to signal an issue is completely new to the project. Priority 3: Could Low priority. Issues that are nice to have but have a straightforward workaround. Status: In Progress Use to signal this issue is actively worked on. labels Jan 12, 2023

abuijze marked this pull request as draft March 29, 2024 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable safe retry of aggregate commands in the event of (perceived) failures #2554

Enable safe retry of aggregate commands in the event of (perceived) failures #2554

jvanheesch commented Jan 9, 2023

CLAassistant commented Jan 9, 2023 •

edited

smcvb commented Jan 12, 2023

smcvb commented Apr 21, 2023

Enable safe retry of aggregate commands in the event of (perceived) failures #2554

Are you sure you want to change the base?

Enable safe retry of aggregate commands in the event of (perceived) failures #2554

Conversation

jvanheesch commented Jan 9, 2023

Proposal

Context

Approach

Limitations

Final thoughts

CLAassistant commented Jan 9, 2023 • edited

smcvb commented Jan 12, 2023

smcvb commented Apr 21, 2023

CLAassistant commented Jan 9, 2023 •

edited