Bug: ASO pod gets OOMKilled #3934

AndreiBarbuOz · 2024-04-11T21:02:05Z

Version of Azure Service Operator
v2.4.0

Describe the bug
During startup, ASO pods get OOMKilled:

$ kubectl get events -A | grep -i killed
default                       7m31s       Warning   OOMKilling                node/aks-nodepool-36894423-vmss000001                                             (combined from similar events): Memory cgroup out of memory: Killed process 4010806 (aso-controller) total-vm:1688296kB, anon-rss:522204kB, file-rss:57916kB, shmem-rss:0kB, UID:65532 pgtables:1504kB oom_score_adj:984
default                       55m         Warning   OOMKilling                node/aks-nodepool-36894423-vmss000001                                             Memory cgroup out of memory: Killed process 3948607 (aso-controller) total-vm:1488108kB, anon-rss:521664kB, file-rss:58428kB, shmem-rss:0kB, UID:65532 pgtables:1392kB oom_score_adj:984

and then crashloop

To Reproduce
Not exactly clear, the operator does not have a very significant load. We run multiple environments and only one of them experienced this
The steady state is over the requested memory, but under the limit

$ kubectl top pods -A
NAMESPACE                     NAME                                                       CPU(cores)   MEMORY(bytes)
azureserviceoperator-system   azureserviceoperator-controller-manager-5bd7f6f5d5-7gjt9   5m           316Mi

Expected behavior
Would expect the pod to not get OOMKilled
Alternatively, expose limits through the helm chart. We had to use a post-renderer to patch it

Screenshots
Nodes in the node pool memory working set are in the 50% range, so there is no issue with the underlying VM

The text was updated successfully, but these errors were encountered:

matthchr · 2024-04-11T21:06:27Z

and then crashloop

After the OOMKill it got into crashloop? Have any logs from that time or know why it was crashlooping? OOMKilled again?

matthchr · 2024-04-11T21:06:55Z

Did raising the limits via the hook fix the issue?

AndreiBarbuOz · 2024-04-12T04:21:09Z

Did raising the limits via the hook fix the issue?

yes. we bumped to 1024 Mi and rolled out the change to all clusters

Have any logs from that time or know why it was crashlooping?

I extracted some logs from our LAW. Containers were hitting OOM in during initialization, when workers were being started (looks like ASO is using the controller runtime). But the logs don't contain any useful indication. These are the last lines in 2 of the OOMed containers

I0411 15:49:42.219812 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1api20220131previewstorage.FederatedIdentityCredential"
I0411 15:49:42.219837 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1.ConfigMap"
I0411 15:49:42.219849 1 controller.go:186] "msg"="Starting Controller" "logger"="controllers"
I0411 15:49:42.342226 1 controller.go:220] "msg"="Starting workers" "logger"="controllers" "worker count"=1

I0411 15:44:11.364598 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1api20211101storage.NamespacesTopic"
I0411 15:44:11.364613 1 controller.go:186] "msg"="Starting Controller" "logger"="controllers"
I0411 15:44:11.364667 1 controller.go:178] "msg"="Starting EventSource" "logger"="controllers" "source"="kind source: *v1api20211101storage.NamespacesTopicsSubscriptionsRule"
I0411 15:44:11.364694 1 controller.go:186] "msg"="Starting Controller" "logger"="controllers"

theunrepentantgeek · 2024-04-15T20:15:28Z

What configuration do you have for crdPattern, to specify which of the available CRDs are made available by ASO? Do you have this set to * (all CRDs)?

I'm wondering if our default memory request/limit values are becoming inadequate for the growing number of supported resource kinds/versions.

AndreiBarbuOz · 2024-04-16T09:12:44Z

we use a fairly small set of CRDs:

$ helm get values -n azureserviceoperator-system aso2
USER-SUPPLIED VALUES:
crdPattern: resources.azure.com/*;containerservice.azure.com/*;keyvault.azure.com/*;managedidentity.azure.com/*;eventhub.azure.com/*;servicebus.azure.com/*;authorization.azure.com/*;cache.azure.com/*;documentdb.azure.com/*;containerregistry.azure.com/*
createAzureOperatorSecret: false

What puzzled me was the failure during upgrades specifically on AKS.
We use Kind clusters as PR gates where we perform in-place upgrades of controllers, including ASO, and we had zero failures across tens of executions, validating various scenarios:

installing ASO v2.4.0 and then adding new CRDs
installing ASO v2.4.0 and then upgrading to v2.5.0 without adding any new CRDs
installing ASO v2.5.0 and then adding new CRDs

but we only caught the problem in our dev environment on AKS 🤷‍♂️

theunrepentantgeek · 2024-04-16T22:54:50Z

The fact it's only in your dev environment is very intriguing.

Does your dev cluster have any additional ASO CRDs installed that aren't listed in crdPattern?

One of the design decisions we made around crdPattern is that we never orphan ASO CRDs that are already present in the cluster. Effectively, crdPattern specifies a minimum set of supported CRDs.

If you'd previously configured ASO in your dev cluster with a larger set of CRDs (maybe even crdPattern: *), then ASO would be finding those on startup and initializing itself for those too. This can be seen in the startup logs for ASO, and might explain higher memory use in that environment.

AndreiBarbuOz · 2024-04-17T06:23:18Z

Does your dev cluster have any additional ASO CRDs installed that aren't listed in crdPattern?

No, we try to keep a firm grip on what is present in the clusters:

 $ kubectl api-resources | grep azure | wc -l
42

One of the design decisions we made around crdPattern is that we never orphan ASO CRDs that are already present in the cluster. Effectively, crdPattern specifies a minimum set of supported CRDs.

This is a very reasonable design decision 👏 , to err on the conservative side and prevent inadvertently cleaning up CRDs, watching how the GC controller deletes all the CRs 😅 . There is a funny little story behind how we acquired this knowledge 😬

We plan to add more monitoring tools to the clusters, and maybe have a granular graph of memory usage during pod startup. We'll update the issue when we have it

matthchr · 2024-04-22T23:42:57Z

If you're willing to share the name of your dev cluster and a rough time when you've seen this happen (ideally recently) with us, we can also look from the service side and see if we can tell if anything funky is going on. It probably won't be as good as graphs of memory usage from your end but we might be able to glean something.

If you're not comfortable sharing that information publicly on GitHub, you can also message it to us privately on Kubernetes Slack (I'm matthchr and @theunrepentantgeek is Bevan Arps on k8s Slack), or just not tell us at all and we can wait for better graphs on your end (while also looking into making the limits configurable).

github-actions bot added the needs-triage 🔍 label Apr 11, 2024

matthchr added waiting-on-user-response Waiting on more information from the original user before progressing. good-first-issue issues which would be a good starting point for newcomers to the codebase and removed needs-triage 🔍 labels Apr 15, 2024

matthchr mentioned this issue Apr 23, 2024

Better scalability testing #2826

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: ASO pod gets OOMKilled #3934

Bug: ASO pod gets OOMKilled #3934

AndreiBarbuOz commented Apr 11, 2024

matthchr commented Apr 11, 2024

matthchr commented Apr 11, 2024

AndreiBarbuOz commented Apr 12, 2024 •

edited

theunrepentantgeek commented Apr 15, 2024

AndreiBarbuOz commented Apr 16, 2024

theunrepentantgeek commented Apr 16, 2024

AndreiBarbuOz commented Apr 17, 2024 •

edited

matthchr commented Apr 22, 2024

Bug: ASO pod gets OOMKilled #3934

Bug: ASO pod gets OOMKilled #3934

Comments

AndreiBarbuOz commented Apr 11, 2024

matthchr commented Apr 11, 2024

matthchr commented Apr 11, 2024

AndreiBarbuOz commented Apr 12, 2024 • edited

theunrepentantgeek commented Apr 15, 2024

AndreiBarbuOz commented Apr 16, 2024

theunrepentantgeek commented Apr 16, 2024

AndreiBarbuOz commented Apr 17, 2024 • edited

matthchr commented Apr 22, 2024

AndreiBarbuOz commented Apr 12, 2024 •

edited

AndreiBarbuOz commented Apr 17, 2024 •

edited