New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: ASO pod gets OOMKilled #3934
Comments
After the OOMKill it got into crashloop? Have any logs from that time or know why it was crashlooping? OOMKilled again? |
Did raising the limits via the hook fix the issue? |
yes. we bumped to 1024 Mi and rolled out the change to all clusters
I extracted some logs from our LAW. Containers were hitting OOM in during initialization, when workers were being started (looks like ASO is using the controller runtime). But the logs don't contain any useful indication. These are the last lines in 2 of the OOMed containers
|
What configuration do you have for I'm wondering if our default memory request/limit values are becoming inadequate for the growing number of supported resource kinds/versions. |
we use a fairly small set of CRDs:
What puzzled me was the failure during upgrades specifically on AKS.
but we only caught the problem in our dev environment on AKS 🤷♂️ |
The fact it's only in your dev environment is very intriguing. Does your dev cluster have any additional ASO CRDs installed that aren't listed in One of the design decisions we made around If you'd previously configured ASO in your dev cluster with a larger set of CRDs (maybe even |
No, we try to keep a firm grip on what is present in the clusters:
This is a very reasonable design decision 👏 , to err on the conservative side and prevent inadvertently cleaning up CRDs, watching how the GC controller deletes all the CRs 😅 . There is a funny little story behind how we acquired this knowledge 😬 We plan to add more monitoring tools to the clusters, and maybe have a granular graph of memory usage during pod startup. We'll update the issue when we have it |
If you're willing to share the name of your dev cluster and a rough time when you've seen this happen (ideally recently) with us, we can also look from the service side and see if we can tell if anything funky is going on. It probably won't be as good as graphs of memory usage from your end but we might be able to glean something. If you're not comfortable sharing that information publicly on GitHub, you can also message it to us privately on Kubernetes Slack (I'm matthchr and @theunrepentantgeek is Bevan Arps on k8s Slack), or just not tell us at all and we can wait for better graphs on your end (while also looking into making the limits configurable). |
Version of Azure Service Operator
v2.4.0
Describe the bug
During startup, ASO pods get
OOMKilled
:and then crashloop
To Reproduce
Not exactly clear, the operator does not have a very significant load. We run multiple environments and only one of them experienced this
The steady state is over the requested memory, but under the limit
Expected behavior
Would expect the pod to not get OOMKilled
Alternatively, expose limits through the helm chart. We had to use a post-renderer to patch it
Screenshots
Nodes in the node pool memory working set are in the 50% range, so there is no issue with the underlying VM
The text was updated successfully, but these errors were encountered: