New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event Processors Hanging #3015
Comments
After downgrading again to Axon v4.5.15, we observed a significant improvement in query time consumption when analyzing PostgreSQL statistics for the same period. For instance, considering the select token entry query, the maximum duration time decreased from ~30 minutes to ~5 minutes. Also, the number of executions increased from 714 to ~900,000 times, and the average duration decreased from 13 seconds to 15 milliseconds. With Spring Boot 3.2 and Axon v4.9.x, we noticed that queries are using "for no key update" instead of "for update" because of a change introduced in Hibernate v6.x: hibernate/hibernate-orm@de21820 Has Axon framework considered this change, and could it be causing our issue? |
Main response and requestFirst and foremost, I want to commend you on the shear amount of information you're sharing with us, @alelkhoury. Checking your description, I quickly halted on the pointer to validate if you're not accidentally using Axon BOM 4.9.3 instead of Axon Framework 4.9.3 Nonetheless, I assume the chances are high the predicament might've been caused by a memory optimization we've done on When it does, I think it would be extremely helpful if you could share some thread dumbs with us. Questions
I anticipate your threads hang through some process. Hence the request for the thread dumb to validate if that's the case.
Just might be. But to be frank, I cannot be sure at this moment in time.
This question can be answered with both a yes and a no.
I am guessing this is not the issue at this stage.
I would very much like you to share the full configuration of your Event Processors.
I think your best bet to get some form of a stack trace of the hanging Event Processors, is the aforementioned thread dumbs.
The gaps should represent the possibility that the commit and insert order of events differ, causing the Another scenario that may cause gaps, is when the sequence generator for the Thirdly, it may be caused by you/your team actually deleting events from the Or (lastly), what I've also noticed recently with the shift from Javax to Jakarta, is that Hibernate's default increment for the sequences is 50. Side notesWith the above said, I want to react to a couple of pointers you've made throughout your description. Segment Distribution
The segments enable distribution, but they do not cause the distribution of themselves over the Event Processors. A free-to-use version would be AxonIQ Console: a Axon Framework management platform we've released start of this year. If you prefer not to use AxonIQ Console, you can use the APIs on the PostgreSQL
This is an extremely smart move to make, and an extremely sad side effect of the This blog we've written explains how you can achieve that. Long running queries
This is a problematic side effect of the requirement to keep gaps in the This predicament is one of the reasons among many why we've constructed Axon Server. Simply because it cannot have any gaps in the events through it's optimization for event storage. |
Hi, just a few checks to understand what the exact difference is between the old situation and the new. When you say updated to Axon 4.9.3, is that the version of the BOM that you use, or the version of Axon Framework core modules? Do you have any configuration for the processors, or is it all left to defaults? |
Hello Steven, Thank you for your detailed response.
We're using the Axon Spring Boot Starter dependency. Initially, we upgraded to v4.9.3, which caused this issue. Then, we downgraded to v4.9.0 and started with a new database, only to find that we faced the same issue again. Subsequently, we downgraded to 4.7.6 and started with a new database again, but we still encountered the same issue.
Unfortunately, we rolled back the upgrade and didn't generate the thread dumps of the service when it was not processing events. Once we are able to reproduce it, we will generate the thread dumps and share them with you.
All processor configurations are left to defaults, so I believe we're using the Streaming Event Processor. Only one of the event streams is split into 6 segments to be able to parallelize the load. Here's the full configuration that we have:
With Spring Boot 3.2 and Axon v4.9.x, we noticed that queries are using "for no key update" instead of "for update" because of a change introduced in Hibernate v6.x: hibernate/hibernate-orm@de21820 Has Axon framework considered this change, and could it be causing our issue? |
Hello Allard, thank you for your reply.
We're using directly the Axon Spring Boot Starter dependency v4.9.3
All processor configurations are left to defaults, so I believe we're using the Streaming Event Processor. Only one of the event streams is split into 6 segments to be able to parallelize the load. Here's the full configuration that we have:
|
I remember a similar issue where the cause was the change in how hibernate deals with auto-increment values since Hibernate 6. Steven already commented on this, but maybe this was overlooked:
Given the increment is 50 by default, you get an immense amount of gaps, which the GapAwareTrackingToken was never designed for. Setting the default to 1 is best for event storage. |
Hello Allard, thanks for bringing this up. Yes, you're right, but I'm not aware of a way to change the default increment value using application properties when using postgres. Could you please share if there's a way to do it using application properties? Otherwise, I assume I need to do it manually by altering the sequence for now, right? |
The easiest way to define specific ORM behavior if you can't access the classes themselves, is to use an ORM.xml file. It's part of the JPA specification. Hibernate's documentation on this is rather minimal, but Datanucleus has a reference page that could be helpful: https://www.datanucleus.org/products/accessplatform_6_0/jakarta/metadata_xml.html It's been a while since I've had to edit an ORM.xml file (as you can imagine, I avoid using JPA/Relational Databases as an event store), so I can't easily give you the exact configuration to alter to reconfigure the sequence generator. Make sure you indicate that the ORM.xml is a "metadata-complete=false (default)" configuration, so that it still takes the annotations into account. It might be that Hibernate already considers the sequence to be created, and won't alter its configuration when you just add the ORM.xml override. In that case, you'll need to alter the sequence using a SQL statement. |
Thanks for sharing the info. We're trying to reproduce and debug the issue on a separate environment. I will try this approach and get back to you soon |
We've been trying to upgrade a service from Spring Boot v2.7, JDK 11, and Axon v4.5.15 to Spring Boot 3.2, JDK 17, and Axon 4.9.3.
We had a problem with XStream since it's not directly supported with JDK 17, and we didn't want to use JVM flags as a workaround. We decided to archive the old events in our system (~6M events), start with a new PostgreSQL database, use Jackson as a serializer, and keep our read model.
We have encountered a situation that causes the Axon GapAwareTrackingToken to fail to advance and process new events. It’s worth noting that we did not face this problem before the upgrade.
I've already asked this question on Axon Discuss: https://discuss.axoniq.io/t/event-processors-hanging-in-axon-framework-v4-9-0/5453
Basic information
Expected behavior
The event processors should advance as usual and process new events the same way they were working before the upgrade.
Actual behavior
The application works normally for ~1 week until we reach ~90,000 events in the event store; then the application freezes and stops processing any event, however, the commands and queries were dispatched and handled normally.
We can see that the event is persisted correctly in the event store, but it's not handled.
We enabled Axon framework debug mode for logs, but we can't find any error messages.
The application starts processing events again when we restart it, but it will freeze again after a couple of hours.
We tried to downgrade to v4.9.0 then to v4.7.6 and started with a new database with each version, but we still faced the same problem. Eventually, we had to revert the upgrade.
Steps to reproduce
Additional Info
It's true that we have 4 instances of the application, but when we look at the token entry table, we can see that the workload is not distributed on all the instances, only one event stream is split into 6 segments and event processing is happening on 3 instances of the application, but the other processors are all owned by the same instance.
Also, here's one tracking token data:
{"index":120277,"gaps":[117157,117158,117265,117366,117367,117372,117406,117645,117697,117725,117762,117808,117880,117884,117930,117957,117964,117996,118000,118210,118218,118224,118488,118490,118610,118612,118615,118734,118737,118854,118857,118858,119068,119070,119133,119152,119153,119155,119183,119195,119196,119236,119239,119258,119276,119279,119326,119348,119363,119379,119380,119415,119424,119448,119450,119458,119460,119469,119685,119834,119882,119913,119914,119933,119978,119979,120033,120135,120136,120137,120138,120139,120140,120141,120142,120143,120144,120145,120146,120147,120148,120149,120150,120151,120190,120191,120192,120193,120194,120195,120196,120197,120198,120199,120200,120201,120232,120233,120234,120235,120236,120237,120238,120239,120240,120241,120242,120243,120244,120245,120246,120247,120248,120249,120250,120251]}
PostgreSQL TOAST Problem
We run Vacuumlo everyday to remove any orphaned large objects in the database.
PostgreSQL Metrics
We noticed that one of the queries sometimes is taking too much time (max execution time ~30 mins), and we suspect that it might be causing the issue.
Also, we noticed that the update token entry query is executed ~1.5M times in less than a day; is this normal behavior?
And here's the average queries duration
Questions
Let me know if you need any additional info.
Thank you
The text was updated successfully, but these errors were encountered: