Aggregate snapshotter stops working #2770

unisteven · 2023-06-27T09:24:16Z

Basic information

Axon Framework version: 4.5.8
JDK version: 17
Complete executable reproducer if available (e.g. GitHub Repo): Not available, code is not open source.

Steps to reproduce

I cannot share our codebase since it is closed source, but here's the scenario:

We have a microservices setup with Spring Boot on the Axon Framework. We have been running this in production for about a year and we have seen this setup working throughout that period.

In our code base there is a big aggregate that handles a lot of events (100K+) and this takes a while to event-source, to speed this up we added the default aggregate snapshotter

@Bean
    public SpringAggregateSnapshotterFactoryBean snapshotter() {
        var springAggregateSnapshotterFactoryBean = new SpringAggregateSnapshotterFactoryBean();
        springAggregateSnapshotterFactoryBean.setExecutor(Executors.newSingleThreadExecutor());
        return springAggregateSnapshotterFactoryBean;
    }


    @Bean
    public SnapshotTriggerDefinition aggregateSnapshotTrigger(AggregateSnapshotter snapshotter) {
        return new EventCountSnapshotTriggerDefinition(snapshotter, snapshotThresholdInSeconds);
    }

Expected behaviour

We would expect that the aggregate snapshotter would create a snapshot and use this to build up the aggregate from the last state it was in when the snapshot was taken (every 10 events in our case)

Actual behaviour

This works fine for a while, but once our database gets filled with loads of events (100K+) it will stop working entirely and throw the following error:

An attempt to create and store a snapshot resulted in an exception. Exception summary: An event for aggregate [SINGLETON] at sequence [1791] was already inserted

What we have tried

We have tried to delete the snapshot from the database, and it will still give the same exception.
When we try to see if the record exists in our database with the same sequence number no results are found.

Once we reset the entire database it will start working again (for a while)

What could be causing such an error, and what would be a possible solution?

The text was updated successfully, but these errors were encountered:

smcvb · 2023-06-27T09:54:53Z

Cause

I think this happens because you have a massive aggregate, @unisteven.

The fact it's big very likely means it handles numerous commands, potentially at the same time.
Due to this, there's a window of opportunity that several snapshots are created concurrently.

This wouldn't impact snapshot creation necessarily, as they should be unique based on the sequence.
However, purely event sourcing an aggregate will already trigger snapshot creation.
So, if a command is handled and the aggregate is sourced, but no events are published, you would end up with a snapshot that will go to the same spot.

How this idea correlates with "...once our database gets filled with loads of events (100K+) it will stop working entirely..." isn't clear to me though.
Furthermore, that sentence makes it sound like you periodically clear out a production event store, which isn't recommended to begin with.

Solution

As I am working on hunches here, I'd be hard-pressed to give a solution.
Being able to see the implementation, redacted as necessary in your scenario, would help of course.
Otherwise, we can do a back-and-forth here.
Although that may take longer.

One thought that crosses my mind is introducing a form of snapshot warm-up service (as described in this issue).
Through that, you can construct the snapshot outside of the main application loop when you need to do so from scratch.
In doing so you wouldn't imped production to much.

Other pointers

I do have a couple of other recommendations concerning the scenario you describe:

A snapshot trigger definition that creates a snapshot every 10 events is frequent. Typically values used as the count are 100 to 150. Most databases should be more than fine to retrieve one snapshot and anywhere between 0 to 150 events efficiently enough. Furthermore, snapshotting also impeds your setup, so not doing it to frequently helps your application in general.
An aggregate called "SINGLETON" raises alarm bells, to be honest. The name suggests you're dealing with one big aggregate to rule a certain portion of the entire application. Or, a system-aggregate as I'd like to call it. Such a system aggregate may form a bottleneck for the application, as commands need to wait before they can interact with the "SINGLETON." If at all possible, I would steer away from such a design.
An aggregate with 100k+ events doesn't strike me as a massive predicament. Yet. I have seen systems where the aggregate grew to 1.5 million events, which definitely does have issues when it is required to be reconstructed. The aforementioned warm-up service because a necessity for such cases. A more streamlined approach is to have an end to any aggregate instance. So, if the referenced SINGLETON aggregate is going to live forever, I would recommend adjusting the design to be able to switch a, for example, new instance eventually.
Axon Framework is already on release 4.8.0 (as of last week). I would recommend to upgrade to a more recent version to (1) benefit from the bug fixes and (2) benefit from the new features.

Issue management

I am not overly certain this is a bug in Axon Framework.
As such, I've replaced the label with "Question."
If I am proven wrong, the label will obviously be reverted.

unisteven added the Type: Bug Use to signal issues that describe a bug within the system. label Jun 27, 2023

smcvb assigned unisteven Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate snapshotter stops working #2770

Aggregate snapshotter stops working #2770

unisteven commented Jun 27, 2023

smcvb commented Jun 27, 2023

Aggregate snapshotter stops working #2770

Aggregate snapshotter stops working #2770

Comments

unisteven commented Jun 27, 2023

Basic information

Steps to reproduce

Expected behaviour

Actual behaviour

What we have tried

smcvb commented Jun 27, 2023

Cause

Solution

Other pointers

Issue management