Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: ASO pod gets OOMKilled #3934

Open
AndreiBarbuOz opened this issue Apr 11, 2024 · 8 comments
Open

Bug: ASO pod gets OOMKilled #3934

AndreiBarbuOz opened this issue Apr 11, 2024 · 8 comments
Labels
good-first-issue issues which would be a good starting point for newcomers to the codebase waiting-on-user-response Waiting on more information from the original user before progressing.

Comments

@AndreiBarbuOz
Copy link

Version of Azure Service Operator
v2.4.0

Describe the bug
During startup, ASO pods get OOMKilled:

$ kubectl get events -A | grep -i killed
default                       7m31s       Warning   OOMKilling                node/aks-nodepool-36894423-vmss000001                                             (combined from similar events): Memory cgroup out of memory: Killed process 4010806 (aso-controller) total-vm:1688296kB, anon-rss:522204kB, file-rss:57916kB, shmem-rss:0kB, UID:65532 pgtables:1504kB oom_score_adj:984
default                       55m         Warning   OOMKilling                node/aks-nodepool-36894423-vmss000001                                             Memory cgroup out of memory: Killed process 3948607 (aso-controller) total-vm:1488108kB, anon-rss:521664kB, file-rss:58428kB, shmem-rss:0kB, UID:65532 pgtables:1392kB oom_score_adj:984

and then crashloop

To Reproduce
Not exactly clear, the operator does not have a very significant load. We run multiple environments and only one of them experienced this
The steady state is over the requested memory, but under the limit

$ kubectl top pods -A
NAMESPACE                     NAME                                                       CPU(cores)   MEMORY(bytes)
azureserviceoperator-system   azureserviceoperator-controller-manager-5bd7f6f5d5-7gjt9   5m           316Mi

Expected behavior
Would expect the pod to not get OOMKilled
Alternatively, expose limits through the helm chart. We had to use a post-renderer to patch it

Screenshots
Nodes in the node pool memory working set are in the 50% range, so there is no issue with the underlying VM

image

@matthchr
Copy link
Member

and then crashloop

After the OOMKill it got into crashloop? Have any logs from that time or know why it was crashlooping? OOMKilled again?

@matthchr
Copy link
Member

Did raising the limits via the hook fix the issue?

@AndreiBarbuOz
Copy link
Author

AndreiBarbuOz commented Apr 12, 2024

Did raising the limits via the hook fix the issue?

yes. we bumped to 1024 Mi and rolled out the change to all clusters

Have any logs from that time or know why it was crashlooping?

I extracted some logs from our LAW. Containers were hitting OOM in during initialization, when workers were being started (looks like ASO is using the controller runtime). But the logs don't contain any useful indication. These are the last lines in 2 of the OOMed containers

I0411 15:49:42.219812 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1api20220131previewstorage.FederatedIdentityCredential"
I0411 15:49:42.219837 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1.ConfigMap"
I0411 15:49:42.219849 1 controller.go:186] "msg"="Starting Controller" "logger"="controllers"
I0411 15:49:42.342226 1 controller.go:220] "msg"="Starting workers" "logger"="controllers" "worker count"=1
I0411 15:44:11.364598 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1api20211101storage.NamespacesTopic"
I0411 15:44:11.364613 1 controller.go:186] "msg"="Starting Controller" "logger"="controllers"
I0411 15:44:11.364667 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1api20211101storage.NamespacesTopicsSubscriptionsRule"
I0411 15:44:11.364694 1 controller.go:186] "msg"="Starting Controller" "logger"="controllers"

@theunrepentantgeek
Copy link
Member

What configuration do you have for crdPattern, to specify which of the available CRDs are made available by ASO? Do you have this set to * (all CRDs)?

I'm wondering if our default memory request/limit values are becoming inadequate for the growing number of supported resource kinds/versions.

@matthchr matthchr added waiting-on-user-response Waiting on more information from the original user before progressing. good-first-issue issues which would be a good starting point for newcomers to the codebase and removed needs-triage 🔍 labels Apr 15, 2024
@AndreiBarbuOz
Copy link
Author

we use a fairly small set of CRDs:

$ helm get values -n azureserviceoperator-system aso2
USER-SUPPLIED VALUES:
crdPattern: resources.azure.com/*;containerservice.azure.com/*;keyvault.azure.com/*;managedidentity.azure.com/*;eventhub.azure.com/*;servicebus.azure.com/*;authorization.azure.com/*;cache.azure.com/*;documentdb.azure.com/*;containerregistry.azure.com/*
createAzureOperatorSecret: false

What puzzled me was the failure during upgrades specifically on AKS.
We use Kind clusters as PR gates where we perform in-place upgrades of controllers, including ASO, and we had zero failures across tens of executions, validating various scenarios:

  • installing ASO v2.4.0 and then adding new CRDs
  • installing ASO v2.4.0 and then upgrading to v2.5.0 without adding any new CRDs
  • installing ASO v2.5.0 and then adding new CRDs

but we only caught the problem in our dev environment on AKS 🤷‍♂️

@theunrepentantgeek
Copy link
Member

The fact it's only in your dev environment is very intriguing.

Does your dev cluster have any additional ASO CRDs installed that aren't listed in crdPattern?

One of the design decisions we made around crdPattern is that we never orphan ASO CRDs that are already present in the cluster. Effectively, crdPattern specifies a minimum set of supported CRDs.

If you'd previously configured ASO in your dev cluster with a larger set of CRDs (maybe even crdPattern: *), then ASO would be finding those on startup and initializing itself for those too. This can be seen in the startup logs for ASO, and might explain higher memory use in that environment.

@AndreiBarbuOz
Copy link
Author

AndreiBarbuOz commented Apr 17, 2024

Does your dev cluster have any additional ASO CRDs installed that aren't listed in crdPattern?

No, we try to keep a firm grip on what is present in the clusters:

 $ kubectl api-resources | grep azure | wc -l
42

One of the design decisions we made around crdPattern is that we never orphan ASO CRDs that are already present in the cluster. Effectively, crdPattern specifies a minimum set of supported CRDs.

This is a very reasonable design decision 👏 , to err on the conservative side and prevent inadvertently cleaning up CRDs, watching how the GC controller deletes all the CRs 😅 . There is a funny little story behind how we acquired this knowledge 😬

We plan to add more monitoring tools to the clusters, and maybe have a granular graph of memory usage during pod startup. We'll update the issue when we have it

@matthchr
Copy link
Member

If you're willing to share the name of your dev cluster and a rough time when you've seen this happen (ideally recently) with us, we can also look from the service side and see if we can tell if anything funky is going on. It probably won't be as good as graphs of memory usage from your end but we might be able to glean something.

If you're not comfortable sharing that information publicly on GitHub, you can also message it to us privately on Kubernetes Slack (I'm matthchr and @theunrepentantgeek is Bevan Arps on k8s Slack), or just not tell us at all and we can wait for better graphs on your end (while also looking into making the limits configurable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-issue issues which would be a good starting point for newcomers to the codebase waiting-on-user-response Waiting on more information from the original user before progressing.
Projects
Development

No branches or pull requests

3 participants