Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable safe retry of aggregate commands in the event of (perceived) failures #2554

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jvanheesch
Copy link

Proposal

This is a draft proposal. It aims to enable safe retry of aggregate commands in the event of (perceived) failures, without changing current default behavior (i.e. opt-in).

Context

  • Command failures are often transient (e.g. service down -> NoHandlerForCommandException). As such, it makes sense to introduce the concept of a RetryScheduler.
  • Problem: retrying commands is not always a safe operation, as the initial "failed" command may already have produced side effects, in which case a redelivery might wrongfully produce those side effects a second time (this is particularly common in case of COMMAND_TIMEOUT)
  • To enable safe retries, the command handler (CH) must be idempotent
  • Message handlers can be made idempotent at the infrastructure layer by filtering out duplicate messages

The problem and the proposal are demonstrated in the following repository: https://github.com/jvanheesch/axon-command-idempotency

Approach

  1. When a command results in an aggregate event, store the command id (Message.getIdentifier()) in the event
  2. When sourcing an aggregate from events (+ snapshot), keep track of command ids
  3. Upon receiving a command: if the command has already resulted in an aggregate event (i.e. command id is stored in some event or snapshot in the event stream), bypass @CommandHandler

Limitations

This approach cannot be used to make the following CHs idempotent:

  • CH constructor
  • external CH

I'm not entirely happy about this, but I do believe the benefits largely outweigh these limitations. For CH constructors, this problem can be mitigated by including a @TargetAggregateIdentifier in the command message and selecting an event store implementation that enforces (aggregate_identifier, sequence_number) uniqueness (e.g. when using axon server, duplicate creation commands result in OUT_OF_RANGE: [AXONIQ-2000] Invalid sequence number 0 for aggregate a1, expected 1).

Final thoughts

There is definitely room for improvement. Some current questions/concerns:

  • AggregateSnapshotter stores ALL previous correlationIds (for idempotent aggregates), which presumably makes snapshot size unbounded. Is such naive implementation sufficient, or is a more complex strategy (retention policy) required?
  • Is return null; the most appropriate way of handling duplicate commands? Would some kind of exception make more sense?
  • The value of the idempotent attribute of the @Aggregate annotation is not always taken into account. This value should be available to every piece of code that instantiates an AnnotatedAggregate. AggregateConfigurer passes this value to EventSourcingRepository and as such, every AnnotatedAggregate returned by EventSourcingRepository will have the correct value for its idempotent property. There are however several other places where an AnnotatedAggregate is sourced (from DomainEventStream or state) that currently do not have access to this newly added property, e.g.:
    • AggregateSnapshotter does not have access to this value, and resorts to providing the method shouldAddCorrelationIds(), enabling SpringAggregateSnapshotter to mitigate this problem
    • CommandHandlerInvoker.DisruptorRepository does not have access to this value, and as such DisruptorCommandBus will not filter out redeliveries
    • GenericJpaRepository does not have access to this value. On top of that, additional changes would be required to take this value into account, as GenericJpaRepository currently does not store (and hence cannot restore) command ids.
  • The default value of idempotent should be globally configurable, eliminating the need for idempotent = true in every @Aggregate annotation

I believe safe retries would greatly improve command failure handling, and I hope they will be available some day. I'd be happy to improve upon this draft, but would like to hear your thoughts first.

@CLAassistant
Copy link

CLAassistant commented Jan 9, 2023

CLA assistant check
All committers have signed the CLA.

@jvanheesch jvanheesch changed the title PR idempotency Enable safe retry of aggregate commands in the event of (perceived) failures Jan 9, 2023
@smcvb smcvb requested review from a team, gklijs, CodeDrivenMitch, smcvb and abuijze and removed request for a team January 12, 2023 13:08
@smcvb smcvb added Type: Feature Use to signal an issue is completely new to the project. Priority 3: Could Low priority. Issues that are nice to have but have a straightforward workaround. Status: In Progress Use to signal this issue is actively worked on. labels Jan 12, 2023
@smcvb
Copy link
Member

smcvb commented Jan 12, 2023

First and foremost, thanks for making an effort to provide a pull request, @jvanheesch!

I have taken some time to go over this pull request and its description.
Based on that, I think this seems useful for Axon Framework to provide.

However, as it stands, I do not have sufficient time to do a thorough enough review of the changes you've made.
Hence, I will come back to this pull request once I'm able too.
In the meantime, I have added other peers from the team to provide their two cents.

@smcvb
Copy link
Member

smcvb commented Apr 21, 2023

You've gone the stretch to construct a pull request with a very thorough explanation (which is very helpful).
Although on the 12th of January, I pointed out I was rather busy, I must admit I forgot about this PR a little...
I think that's unfair toward your efforts, so my apologies for keeping you waiting, @jvanheesch.

However, I've just spent some time on your implementation, and I have a recommendation that may solve all predicaments.
If you would set the idempotency flag in the AggregateModel, you should be able to access it from the Snapshotter and DisruptorRepository.
Furthermore, I think you'll be able to maintain the current support by utilizing the AggregateModel.

The only misser left with that suggestion would be the GenericJpaRepository.
However, as you've already described, this will take some additional effort anyhow.
Honestly, I actually think it's fine to leave it as an afterthought, as it'll minimize the scope of this PR, simplifying the review process.

Aside from AggregateModel suggestion, I've got quite some more detailed pointers like documentation, naming, and placement of certain operations.
Conceptually speaking, the use of the messageIdentifier of the CommandMessage is, in my opinion, the most straightforward way to implement command handling idempotency.
With some adjustments to the API (I'm thinking of the Aggregate interface implemented by the AnnotatedAggregate), I think we could even allow it for external command handlers.

Now, let me conclude a little:

  1. Again, my apologies for coming back to this so late. I do hope it didn't obstruct you in any fashion.
  2. If you do have the time, would you mind trying out the AggregateModel suggestion? I'm curious whether there are any roadblocks to that idea that I didn't think of.
  3. I'm going to gauge within the Axon Framework team what the rest thinks about this approach. Anticipate either responses from the team members or a combined response from me in the near future.

@abuijze abuijze marked this pull request as draft March 29, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority 3: Could Low priority. Issues that are nice to have but have a straightforward workaround. Status: In Progress Use to signal this issue is actively worked on. Type: Feature Use to signal an issue is completely new to the project.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants