Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Processors Hanging #3015

Open
alelkhoury opened this issue Apr 22, 2024 · 9 comments
Open

Event Processors Hanging #3015

alelkhoury opened this issue Apr 22, 2024 · 9 comments
Assignees
Labels
Status: Information Required Use to signal this issue is waiting for information to be provided in the issue's description. Type: Bug Use to signal issues that describe a bug within the system.

Comments

@alelkhoury
Copy link

alelkhoury commented Apr 22, 2024

We've been trying to upgrade a service from Spring Boot v2.7, JDK 11, and Axon v4.5.15 to Spring Boot 3.2, JDK 17, and Axon 4.9.3.

We had a problem with XStream since it's not directly supported with JDK 17, and we didn't want to use JVM flags as a workaround. We decided to archive the old events in our system (~6M events), start with a new PostgreSQL database, use Jackson as a serializer, and keep our read model.

We have encountered a situation that causes the Axon GapAwareTrackingToken to fail to advance and process new events. It’s worth noting that we did not face this problem before the upgrade.

I've already asked this question on Axon Discuss: https://discuss.axoniq.io/t/event-processors-hanging-in-axon-framework-v4-9-0/5453

Basic information

  • Axon Framework version: v4.9.3/v4.9.0/v4.7.6
  • JDK version: 17
  • Event Store: PostgreSQL v13.1
  • Configuration: No special configuration, we're using the default configs
  • Complete executable reproducer if available (e.g. GitHub Repo): NA

Expected behavior

The event processors should advance as usual and process new events the same way they were working before the upgrade.

Actual behavior

The application works normally for ~1 week until we reach ~90,000 events in the event store; then the application freezes and stops processing any event, however, the commands and queries were dispatched and handled normally.

We can see that the event is persisted correctly in the event store, but it's not handled.

We enabled Axon framework debug mode for logs, but we can't find any error messages.

The application starts processing events again when we restart it, but it will freeze again after a couple of hours.

We tried to downgrade to v4.9.0 then to v4.7.6 and started with a new database with each version, but we still faced the same problem. Eventually, we had to revert the upgrade.

Steps to reproduce

  • Use JDK 17, Spring Boot 3.2, and Axon v4.9.3
  • Use PostgreSQL as an event store
  • Use JacksonSerializer
  • Deploy 4 instances of the service
  • Try to reach ~90,000 events in a short period

Additional Info

It's true that we have 4 instances of the application, but when we look at the token entry table, we can see that the workload is not distributed on all the instances, only one event stream is split into 6 segments and event processing is happening on 3 instances of the application, but the other processors are all owned by the same instance.

image

Also, here's one tracking token data:

{"index":120277,"gaps":[117157,117158,117265,117366,117367,117372,117406,117645,117697,117725,117762,117808,117880,117884,117930,117957,117964,117996,118000,118210,118218,118224,118488,118490,118610,118612,118615,118734,118737,118854,118857,118858,119068,119070,119133,119152,119153,119155,119183,119195,119196,119236,119239,119258,119276,119279,119326,119348,119363,119379,119380,119415,119424,119448,119450,119458,119460,119469,119685,119834,119882,119913,119914,119933,119978,119979,120033,120135,120136,120137,120138,120139,120140,120141,120142,120143,120144,120145,120146,120147,120148,120149,120150,120151,120190,120191,120192,120193,120194,120195,120196,120197,120198,120199,120200,120201,120232,120233,120234,120235,120236,120237,120238,120239,120240,120241,120242,120243,120244,120245,120246,120247,120248,120249,120250,120251]}

PostgreSQL TOAST Problem

We run Vacuumlo everyday to remove any orphaned large objects in the database.

PostgreSQL Metrics

We noticed that one of the queries sometimes is taking too much time (max execution time ~30 mins), and we suspect that it might be causing the issue.

image

Also, we noticed that the update token entry query is executed ~1.5M times in less than a day; is this normal behavior?

image

And here's the average queries duration

image

Questions

  • What can cause the stoppage of event processing, and why does the application work after restarting it?
  • Is it a bug in the GapAwareTrackingToken?
  • Could the issue be related to using PostgreSQL as an event store? if so, why wasn't it occurring with the older version?
  • is it an infra problem? if so, why wasn't it occurring with the older version?
  • Are there any additional configurations we should consider from our side?
  • How can we debug these types of issues effectively?
  • What do the gaps represent in this context? is there any documentation about this topic?

Let me know if you need any additional info.

Thank you

@alelkhoury alelkhoury added the Type: Bug Use to signal issues that describe a bug within the system. label Apr 22, 2024
@alelkhoury
Copy link
Author

alelkhoury commented Apr 23, 2024

After downgrading again to Axon v4.5.15, we observed a significant improvement in query time consumption when analyzing PostgreSQL statistics for the same period. For instance, considering the select token entry query, the maximum duration time decreased from ~30 minutes to ~5 minutes. Also, the number of executions increased from 714 to ~900,000 times, and the average duration decreased from 13 seconds to 15 milliseconds.

image

With Spring Boot 3.2 and Axon v4.9.x, we noticed that queries are using "for no key update" instead of "for update" because of a change introduced in Hibernate v6.x: hibernate/hibernate-orm@de21820

Has Axon framework considered this change, and could it be causing our issue?

@smcvb
Copy link
Member

smcvb commented Apr 25, 2024

Main response and request

First and foremost, I want to commend you on the shear amount of information you're sharing with us, @alelkhoury.
Being as specific as you are comes a long way with helping us deduce what the predicament is.

Checking your description, I quickly halted on the pointer to validate if you're not accidentally using Axon BOM 4.9.3 instead of Axon Framework 4.9.3
However, reading your Discuss thread, it became clear you're really using 4.9.3 of Axon Framework.
That rules out your fix is done by issue #2977

Nonetheless, I assume the chances are high the predicament might've been caused by a memory optimization we've done on GapAwareTrackingTokens in #2936, introduced in Axon Framework 4.9.1.
If this is the case, you ideally shouldn't see the issues you're facing when using Axon Framework 4.9.0 or below.
What isn't entirely clear to me, is whether you've validated if the problem occurs with other versions of Axon Framework.
Hence, I would like to ask you if you can update your application to 4.9.0 of Axon Framework to check if the issue persists.

When it does, I think it would be extremely helpful if you could share some thread dumbs with us.
As you describe, command and query handling still goes as expected.
Furthermore, there's no (debug) logging at all stating anything's amiss
Hence, I anticipate the issue lies somewhere within the Event Processing threads getting stuck for whatever reason.
With a thread dumb, we should be able to better deduce where it gets stuck.

Questions

What can cause the stoppage of event processing, and why does the application work after restarting it?

I anticipate your threads hang through some process. Hence the request for the thread dumb to validate if that's the case.

Is it a bug in the GapAwareTrackingToken?

Just might be. But to be frank, I cannot be sure at this moment in time.

Could the issue be related to using PostgreSQL as an event store? if so, why wasn't it occurring with the older version?

This question can be answered with both a yes and a no.
Yes, because PostgreSQL is an RDBMS, and any RDBMS is not optimized to be an off-the-shelve Event Store (that is what you would use EventStore or Axon Server for)
No, because Axon Framework, although it works better with a dedicated Event Store, should work with an RDBMS as well.

is it an infra problem? if so, why wasn't it occurring with the older version?

I am guessing this is not the issue at this stage.

Are there any additional configurations we should consider from our side?

I would very much like you to share the full configuration of your Event Processors.
So, if you could share which Event Processor(s) you're using, their segment counts, thread counts, batch sizes, and other parameters you may have set.
Any of those may help us to better understand the situation you're in, so they would be very helpful to know.

How can we debug these types of issues effectively?

I think your best bet to get some form of a stack trace of the hanging Event Processors, is the aforementioned thread dumbs.

What do the gaps represent in this context? is there any documentation about this topic?

The gaps should represent the possibility that the commit and insert order of events differ, causing the globalIndex to contain gaps.

Another scenario that may cause gaps, is when the sequence generator for the domain_event_entry table is reused by other tables maintained in the same database.
Hence, the sequence used to set the globalIndex has gaps for the domain_event_entry, as other tables use, and thus increment, the same sequence generator.

Thirdly, it may be caused by you/your team actually deleting events from the domain_event_entry table.
Thus, they're man-made gaps.

Or (lastly), what I've also noticed recently with the shift from Javax to Jakarta, is that Hibernate's default increment for the sequences is 50.
If that is what's happening (which seems to be the case if I check the last gaps collection from the token you're sharing, but sadly enough not from the start of the gaps collection), then it would be best to adjust the sequence generator there.

Side notes

With the above said, I want to react to a couple of pointers you've made throughout your description.

Segment Distribution

It's true that we have 4 instances of the application, but when we look at the token entry table, we can see that the workload is not distributed on all the instances, only one event stream is split into 6 segments and event processing is happening on 3 instances of the application, but the other processors are all owned by the same instance.

The segments enable distribution, but they do not cause the distribution of themselves over the Event Processors.
To achieve that, you will need to balance the segments over the available instances.
This is best achieved by a third tool that is aware of your Axon Framework applications.

A free-to-use version would be AxonIQ Console: a Axon Framework management platform we've released start of this year.
The Event Processors page allows you to trigger all operations you perform on an Event Processor, while it knows it needs to distribute these among the nodes.

If you prefer not to use AxonIQ Console, you can use the APIs on the StreamingEventProcessor directly to release segments.

PostgreSQL

We run Vacuumlo everyday to remove any orphaned large objects in the database.

This is an extremely smart move to make, and an extremely sad side effect of the OID columns PostgreSQL creates for columns of type LOB/CLOB/BLOB.
Note that you can also remove this default OID column for large objects, by disabling the TOAST logic entirely there.

This blog we've written explains how you can achieve that.
Note that the blog's written based on Hibernate 5.
So, as you're moving to Hibernate 6, some of the import statements should be adjusted.

Long running queries

We noticed that one of the queries sometimes is taking too much time (max execution time ~30 mins), and we suspect that it might be causing the issue.

This is a problematic side effect of the requirement to keep gaps in the TrackingToken.
Any RDBMS solution has the window of opportunity that the commit and insert order may differ.
Because of this, there is an option that an event with a higher global sequence number is inserted while one with a lower global sequence number is still pending.
Since the global sequence number is the ordering for Event Processing, you'd thus have a gap caused by the RDBMS.

This predicament is one of the reasons among many why we've constructed Axon Server.
Thus, if your company allows it, know that all the described issues among the GapAwareTrackingToken are non-existent if you move to Axon Server.

Simply because it cannot have any gaps in the events through it's optimization for event storage.

@smcvb smcvb added the Status: Information Required Use to signal this issue is waiting for information to be provided in the issue's description. label Apr 25, 2024
@abuijze
Copy link
Member

abuijze commented Apr 25, 2024

Hi, just a few checks to understand what the exact difference is between the old situation and the new.

When you say updated to Axon 4.9.3, is that the version of the BOM that you use, or the version of Axon Framework core modules?

Do you have any configuration for the processors, or is it all left to defaults?

@alelkhoury
Copy link
Author

Hello Steven,

Thank you for your detailed response.

Main response and request

Nonetheless, I assume the chances are high the predicament might've been caused by a memory optimization we've done on GapAwareTrackingTokens in #2936, introduced in Axon Framework 4.9.1. If this is the case, you ideally shouldn't see the issues you're facing when using Axon Framework 4.9.0 or below. What isn't entirely clear to me, is whether you've validated if the problem occurs with other versions of Axon Framework. Hence, I would like to ask you if you can update your application to 4.9.0 of Axon Framework to check if the issue persists.

We're using the Axon Spring Boot Starter dependency. Initially, we upgraded to v4.9.3, which caused this issue. Then, we downgraded to v4.9.0 and started with a new database, only to find that we faced the same issue again. Subsequently, we downgraded to 4.7.6 and started with a new database again, but we still encountered the same issue.

When it does, I think it would be extremely helpful if you could share some thread dumbs with us. As you describe, command and query handling still goes as expected. Furthermore, there's no (debug) logging at all stating anything's amiss Hence, I anticipate the issue lies somewhere within the Event Processing threads getting stuck for whatever reason. With a thread dumb, we should be able to better deduce where it gets stuck.

Unfortunately, we rolled back the upgrade and didn't generate the thread dumps of the service when it was not processing events.
However, we're trying to reproduce the issue on a separate environment but it will take some time to reproduce it.

Once we are able to reproduce it, we will generate the thread dumps and share them with you.

I would very much like you to share the full configuration of your Event Processors.
So, if you could share which Event Processor(s) you're using, their segment counts, thread counts, batch sizes, and other parameters you may have set.

All processor configurations are left to defaults, so I believe we're using the Streaming Event Processor. Only one of the event streams is split into 6 segments to be able to parallelize the load.

Here's the full configuration that we have:

@Autowired
    public void addInterceptor1(Configurer configurer, IReadRepository readRepository) {
        configurer.eventProcessing().registerTrackingEventProcessor("Aprojection1")
                .registerHandlerInterceptor("Aprojection1", configuration -> new FilterInterceptor(readRepository));
    }

    @Autowired
    public void addInterceptor2(Configurer configurer, ARepository readRepository) {
        configurer.eventProcessing().registerTrackingEventProcessor("AProjection2")
                  .registerHandlerInterceptor("AProjection2", configuration -> new FilterInceptor(readRepository));
    }

    @Autowired
    public void addLoggingInterceptor(Configurer configurer, LoggingContextScanner contextScanner) {
        configurer.eventProcessing().registerDefaultHandlerInterceptor((configuration, s) -> new EventLoggingInterceptor(contextScanner));
    }

    @Autowired
    public void addInterceptor3(Configurer configurer, ARepository readRepository) {
        configurer.eventProcessing().registerTrackingEventProcessor("Aprojection3")
                .registerHandlerInterceptor("Aprojection3", configuration -> new interceptor(readRepository));
    }

    @Autowired
    public void configureProcessor(Configurer configurer) {
        configurer.eventProcessing().registerTrackingEventProcessor("processorName1", org.axonframework.config.Configuration::eventStore,
            configuration -> TrackingEventProcessorConfiguration.forParallelProcessing(2).andInitialSegmentsCount(6)
                                                                .andInitialTrackingToken(StreamableMessageSource::createHeadToken));
    }

This is a problematic side effect of the requirement to keep gaps in the TrackingToken

With Spring Boot 3.2 and Axon v4.9.x, we noticed that queries are using "for no key update" instead of "for update" because of a change introduced in Hibernate v6.x: hibernate/hibernate-orm@de21820

Has Axon framework considered this change, and could it be causing our issue?

@alelkhoury
Copy link
Author

Hello Allard, thank you for your reply.

When you say updated to Axon 4.9.3, is that the version of the BOM that you use, or the version of Axon Framework core modules?

We're using directly the Axon Spring Boot Starter dependency v4.9.3

Do you have any configuration for the processors, or is it all left to defaults?

All processor configurations are left to defaults, so I believe we're using the Streaming Event Processor. Only one of the event streams is split into 6 segments to be able to parallelize the load.

Here's the full configuration that we have:

@Autowired
    public void addInterceptor1(Configurer configurer, IReadRepository readRepository) {
        configurer.eventProcessing().registerTrackingEventProcessor("Aprojection1")
                .registerHandlerInterceptor("Aprojection1", configuration -> new FilterInterceptor(readRepository));
    }

    @Autowired
    public void addInterceptor2(Configurer configurer, ARepository readRepository) {
        configurer.eventProcessing().registerTrackingEventProcessor("AProjection2")
                  .registerHandlerInterceptor("AProjection2", configuration -> new FilterInceptor(readRepository));
    }

    @Autowired
    public void addLoggingInterceptor(Configurer configurer, LoggingContextScanner contextScanner) {
        configurer.eventProcessing().registerDefaultHandlerInterceptor((configuration, s) -> new EventLoggingInterceptor(contextScanner));
    }

    @Autowired
    public void addInterceptor3(Configurer configurer, ARepository readRepository) {
        configurer.eventProcessing().registerTrackingEventProcessor("Aprojection3")
                .registerHandlerInterceptor("Aprojection3", configuration -> new interceptor(readRepository));
    }

    @Autowired
    public void configureProcessor(Configurer configurer) {
        configurer.eventProcessing().registerTrackingEventProcessor("processorName1", org.axonframework.config.Configuration::eventStore,
            configuration -> TrackingEventProcessorConfiguration.forParallelProcessing(2).andInitialSegmentsCount(6)
                                                                .andInitialTrackingToken(StreamableMessageSource::createHeadToken));
    }

@abuijze
Copy link
Member

abuijze commented Apr 30, 2024

I remember a similar issue where the cause was the change in how hibernate deals with auto-increment values since Hibernate 6. Steven already commented on this, but maybe this was overlooked:

Or (lastly), what I've also noticed recently with the shift from Javax to Jakarta, is that Hibernate's default increment for the sequences is 50.

Given the increment is 50 by default, you get an immense amount of gaps, which the GapAwareTrackingToken was never designed for. Setting the default to 1 is best for event storage.

@alelkhoury
Copy link
Author

Hello Allard, thanks for bringing this up.

Yes, you're right, but I'm not aware of a way to change the default increment value using application properties when using postgres. Could you please share if there's a way to do it using application properties? Otherwise, I assume I need to do it manually by altering the sequence for now, right?

@abuijze
Copy link
Member

abuijze commented May 1, 2024

The easiest way to define specific ORM behavior if you can't access the classes themselves, is to use an ORM.xml file. It's part of the JPA specification. Hibernate's documentation on this is rather minimal, but Datanucleus has a reference page that could be helpful: https://www.datanucleus.org/products/accessplatform_6_0/jakarta/metadata_xml.html

It's been a while since I've had to edit an ORM.xml file (as you can imagine, I avoid using JPA/Relational Databases as an event store), so I can't easily give you the exact configuration to alter to reconfigure the sequence generator. Make sure you indicate that the ORM.xml is a "metadata-complete=false (default)" configuration, so that it still takes the annotations into account.

It might be that Hibernate already considers the sequence to be created, and won't alter its configuration when you just add the ORM.xml override. In that case, you'll need to alter the sequence using a SQL statement.

@alelkhoury
Copy link
Author

Thanks for sharing the info. We're trying to reproduce and debug the issue on a separate environment. I will try this approach and get back to you soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Information Required Use to signal this issue is waiting for information to be provided in the issue's description. Type: Bug Use to signal issues that describe a bug within the system.
Projects
None yet
Development

No branches or pull requests

4 participants