Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate snapshotter stops working #2770

Open
unisteven opened this issue Jun 27, 2023 · 1 comment
Open

Aggregate snapshotter stops working #2770

unisteven opened this issue Jun 27, 2023 · 1 comment
Assignees
Labels
Status: Information Required Use to signal this issue is waiting for information to be provided in the issue's description. Type: Question Use to signal the issue is a question of how the project works and thus does not require development

Comments

@unisteven
Copy link

Basic information

  • Axon Framework version: 4.5.8
  • JDK version: 17
  • Complete executable reproducer if available (e.g. GitHub Repo): Not available, code is not open source.

Steps to reproduce

I cannot share our codebase since it is closed source, but here's the scenario:

We have a microservices setup with Spring Boot on the Axon Framework. We have been running this in production for about a year and we have seen this setup working throughout that period.

In our code base there is a big aggregate that handles a lot of events (100K+) and this takes a while to event-source, to speed this up we added the default aggregate snapshotter

@Bean
    public SpringAggregateSnapshotterFactoryBean snapshotter() {
        var springAggregateSnapshotterFactoryBean = new SpringAggregateSnapshotterFactoryBean();
        springAggregateSnapshotterFactoryBean.setExecutor(Executors.newSingleThreadExecutor());
        return springAggregateSnapshotterFactoryBean;
    }


    @Bean
    public SnapshotTriggerDefinition aggregateSnapshotTrigger(AggregateSnapshotter snapshotter) {
        return new EventCountSnapshotTriggerDefinition(snapshotter, snapshotThresholdInSeconds);
    }

Expected behaviour

We would expect that the aggregate snapshotter would create a snapshot and use this to build up the aggregate from the last state it was in when the snapshot was taken (every 10 events in our case)

Actual behaviour

This works fine for a while, but once our database gets filled with loads of events (100K+) it will stop working entirely and throw the following error:

An attempt to create and store a snapshot resulted in an exception. Exception summary: An event for aggregate [SINGLETON] at sequence [1791] was already inserted

What we have tried

  • We have tried to delete the snapshot from the database, and it will still give the same exception.
  • When we try to see if the record exists in our database with the same sequence number no results are found.

Once we reset the entire database it will start working again (for a while)

What could be causing such an error, and what would be a possible solution?

@unisteven unisteven added the Type: Bug Use to signal issues that describe a bug within the system. label Jun 27, 2023
@smcvb
Copy link
Member

smcvb commented Jun 27, 2023

Cause

I think this happens because you have a massive aggregate, @unisteven.

The fact it's big very likely means it handles numerous commands, potentially at the same time.
Due to this, there's a window of opportunity that several snapshots are created concurrently.

This wouldn't impact snapshot creation necessarily, as they should be unique based on the sequence.
However, purely event sourcing an aggregate will already trigger snapshot creation.
So, if a command is handled and the aggregate is sourced, but no events are published, you would end up with a snapshot that will go to the same spot.

How this idea correlates with "...once our database gets filled with loads of events (100K+) it will stop working entirely..." isn't clear to me though.
Furthermore, that sentence makes it sound like you periodically clear out a production event store, which isn't recommended to begin with.

Solution

As I am working on hunches here, I'd be hard-pressed to give a solution.
Being able to see the implementation, redacted as necessary in your scenario, would help of course.
Otherwise, we can do a back-and-forth here.
Although that may take longer.

One thought that crosses my mind is introducing a form of snapshot warm-up service (as described in this issue).
Through that, you can construct the snapshot outside of the main application loop when you need to do so from scratch.
In doing so you wouldn't imped production to much.

Other pointers

I do have a couple of other recommendations concerning the scenario you describe:

  1. A snapshot trigger definition that creates a snapshot every 10 events is frequent. Typically values used as the count are 100 to 150. Most databases should be more than fine to retrieve one snapshot and anywhere between 0 to 150 events efficiently enough. Furthermore, snapshotting also impeds your setup, so not doing it to frequently helps your application in general.
  2. An aggregate called "SINGLETON" raises alarm bells, to be honest. The name suggests you're dealing with one big aggregate to rule a certain portion of the entire application. Or, a system-aggregate as I'd like to call it. Such a system aggregate may form a bottleneck for the application, as commands need to wait before they can interact with the "SINGLETON." If at all possible, I would steer away from such a design.
  3. An aggregate with 100k+ events doesn't strike me as a massive predicament. Yet. I have seen systems where the aggregate grew to 1.5 million events, which definitely does have issues when it is required to be reconstructed. The aforementioned warm-up service because a necessity for such cases. A more streamlined approach is to have an end to any aggregate instance. So, if the referenced SINGLETON aggregate is going to live forever, I would recommend adjusting the design to be able to switch a, for example, new instance eventually.
  4. Axon Framework is already on release 4.8.0 (as of last week). I would recommend to upgrade to a more recent version to (1) benefit from the bug fixes and (2) benefit from the new features.

Issue management

I am not overly certain this is a bug in Axon Framework.
As such, I've replaced the label with "Question."
If I am proven wrong, the label will obviously be reverted.

@smcvb smcvb added Status: Information Required Use to signal this issue is waiting for information to be provided in the issue's description. Type: Question Use to signal the issue is a question of how the project works and thus does not require development and removed Type: Bug Use to signal issues that describe a bug within the system. labels Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Information Required Use to signal this issue is waiting for information to be provided in the issue's description. Type: Question Use to signal the issue is a question of how the project works and thus does not require development
Projects
None yet
Development

No branches or pull requests

2 participants