Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error when lots of information is pulled by context aware policies #716

Closed
1 task done
flavio opened this issue Mar 22, 2024 · 6 comments
Closed
1 task done
Assignees
Milestone

Comments

@flavio
Copy link
Member

flavio commented Mar 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Currently we create a kube-rs reflector object per each type of context-aware resource is made by a policy. This reflector keeps a copy of the Kubernetes results in memory, and keeps it in sync with the kubernetes's internal data.

The more resources are pulled, the more memory is consumed by the Policy Server process. We had reports of Policy Server being killed by the kernel OOM killer because it was consuming too much memory.

That happened with the following information being pulled from Kubernetes:

  • Namespaces 3200
  • Ingress: 10500
  • ClusterRoleBinding: 200
  • RoleBinding: 11000

Expected Behavior

The Policy Server should not be killed by the OOM killer. There should be no need to tune the memory limits of the Pod.

Steps To Reproduce

No response

Environment

* Kubewarden 1.11

Anything else?

No response

@flavio
Copy link
Member Author

flavio commented May 23, 2024

Kubewarden 1.13.0-RC1 is out which ships with this bug fix.

We have reduced the memory spike that happens when resources are fetched from the cluster fort the 1st time. Also, memory usage at rest has been drastically reduced.

The key point however is that there's no silver bullet. When operating inside of a big cluster, with lots of resources (like the numbers inside of the issue's description), there's not that much we can do to reduce the initial spike.
The spike happens when we fetch all these resources for the 1st time, and we put them into our reflectors.

When deploying a Policy Server inside of such a cluster, the administrator must take into account this spike when calculating the resource limits (just the memory) of the Pods.

According to our load tests, after the initial spike, the memory usage goes down and remains stable. This will prevent the policy-server processes from being flagged as offenders by the kernel's OOM killer.

@flavio
Copy link
Member Author

flavio commented May 23, 2024

@fabriziosestito can you share some numbers over there? This is going to be useful when doing the blog post

@fabriziosestito
Copy link
Contributor

Benchmark data

10000 RoleBindings

k6 load testing (load_all policy)
[policy server branch](https://github.com/fabriziosestito/policy-server/tree/feat/sqlite-cache

HTTP Request Duration (avg) Max RSS Under load Idle RSS
No fixes 436.15ms 1.4 Gb 1.2 Gb
kube-rs buffer fix + jemalloc 233.663ms 1.2 Gb 264 Mb

Kube-rs memory burst reduction

Watching 10000 RoleBindings:

ListWatch strategy:

With Store Without Store
Patched ~58 Mb ~13 Mb
Upstream ~92 Mb ~76 Mb

@fabriziosestito
Copy link
Contributor

Also see kube-rs/kube#1494 (comment)

@clux
Copy link

clux commented May 23, 2024

The merged PR into kube should help reduce the spike by a good factor (thank you!), as should Kubernetes InitialStreamingLists once it stabilises (probably a few years before you can use that in public distribution stuff tho).

In the mean time, here's a drive-by comment, you might have another optimisation path available to you now depending on how you structure things.

If you use metadata_watchers to fill your reflector stores, you could use stores for basic bookkeeping, but call an api.get(x) on the object before sending it to a user's validator. This can work if Kubernetes updates the objects more frequently than you send to the validators. It's kind of badly explained in kube.rs/optimization#watching-metadata-only.

@flavio
Copy link
Member Author

flavio commented May 31, 2024

Closing as fixed, we've seen positive results from the tests done by us and some users with the 1.13.0-RC1

@flavio flavio closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants