OOMKilled as number of constraints grew #3357

walking-appa · 2024-04-18T21:53:26Z

What steps did you take and what happened:
Our controller manager replicas are using a lot of memory as our number of constraints grew. We want to know if these resource limits are expected.

What did you expect to happen:
Lower resource limits.

Anything else you would like to add:
Seeing these error logs:

{"level":"error","ts":1713474646.0517645,"logger":"webhook","msg":"error executing query","hookType":"validation","error":"serving context canceled, abo │

│ rting request","stacktrace":"[github.com/open-policy-agent/gatekeeper/v3/pkg/webhook.(*validationHandler).Handle\n\t/go/src/github.com/open-policy-agent/](http://github.com/open-policy-agent/gatekeeper/v3/pkg/webhook.(*validationHandler).Handle%5Cn%5Ct/go/src/github.com/open-policy-agent/) │

│ gatekeeper/pkg/webhook/policy.go:188\[nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle\n\t/go/src/github.com/open-policy-agent/gat](http://nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle%5Cn%5Ct/go/src/github.com/open-policy-agent/gat) │

│ ekeeper/vendor/[sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:169\nsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Serv](http://sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:169%5Cnsigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Serv) │

│ eHTTP\n\t/go/src/[github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:98\ngithub.com/prometheus/c](http://github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:98%5Cngithub.com/prometheus/c) │

│ lient_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1\n\t/go/src/[github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_](http://github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_) │

│ golang/prometheus/promhttp/instrument_server.go:60\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\[ngithub.com/prometheus/c](http://ngithub.com/prometheus/c) │

│ lient_golang/prometheus/promhttp.InstrumentHandlerCounter.func1\n\t/go/src/[github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_g](http://github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_g) │

│ olang/prometheus/promhttp/instrument_server.go:147\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\[ngithub.com/prometheus/c](http://ngithub.com/prometheus/c) │

│ lient_golang/prometheus/promhttp.InstrumentHandlerDuration.func2\n\t/go/src/[github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_](http://github.com/open-policy-agent/gatekeeper/vendor/github.com/prometheus/client_) │

│ golang/prometheus/promhttp/instrument_server.go:109\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\nnet/http.(*ServeMux).S │

│ erveHTTP\n\t/usr/local/go/src/net/http/server.go:2500\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).s │

│ erve\n\t/usr/local/go/src/net/http/server.go:1995"}

Environment:

Gatekeeper version: 3.13.0
Kubernetes version: (use kubectl version): 1.27.3 (AKS)
CPU Limit: 3
Memory Limit: 4Gi
Constraints: 87

The text was updated successfully, but these errors were encountered:

maxsmythe · 2024-04-19T23:07:29Z

There are a number of possible causes for increased RAM usage.

What is your constraints and constraint template count? (you mention 87 for constraints, but not clear if/how that relates to # of constraint templates)

Has Gatekeeper been configured to cap the number of concurrent inbound requests? (if not, then each concurrent request may require more RAM)

Are you using referential data? If so has the data set grown? (referential data is cached in RAM)

To get a rough sense of whether RAM usage is due to at-rest conditions vs. satisfying serving requirements, I suggest disabling calling of validating/mutating webhooks temporarily to see what the RAM usage settles on.

If at-rest usage is still high, likely culprit is referential data.

If request volume is high, then I'd suspect either uncapped parallelism or poorly-optimized Rego in a constraint template.

Another option might be to compare webhook memory usage to audit pod memory usage. If Audit looks healthier, likely it's due to QPS.

The error you are highlighting is consistent with golang context being closed due to OOMKill (I.e. OOMKill causes the error, not the other way around).

walking-appa · 2024-04-19T23:53:07Z

The constraint template count is the same as constraints.

We are using referential data and removing pods from it made things better for both audit and the webhook; they were both consuming high memory.

I also noticed that it ended up needing more resources in a smaller multi-tenant cluster than another larger cluster so I'd still like to cap the number of requests, how can I do that?

Also despite it not being OOMKilled anymore I am continuing to see the error I mentioned above, even though the webhook continues to work as expected end to end.

maxsmythe · 2024-04-23T02:17:13Z

"context canceled" can mean the caller's context was canceled (usually due to request timeout). I think it usually says something different than "serving context canceled", though maybe the framework changed its error text.

In any case, --max-serving-threads is the flag you want for capping # of simultaneous calls to Rego, though it usually defaults to GOMAXPROCS:

gatekeeper/pkg/webhook/policy.go

Line 71 in 80d677a

    
           var maxServingThreads = flag.Int("max-serving-threads", -1, "cap the number of threads handling non-trivial requests, -1 caps the number of threads to GOMAXPROCS. Defaults to -1.")

Tuning that value and/or raising CPU or # of serving webhook pods may help with timeouts (assuming that's what the context canceled errors are)

walking-appa added the bug Something isn't working label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOMKilled as number of constraints grew #3357

OOMKilled as number of constraints grew #3357

walking-appa commented Apr 18, 2024 •

edited

maxsmythe commented Apr 19, 2024

walking-appa commented Apr 19, 2024 •

edited

maxsmythe commented Apr 23, 2024

OOMKilled as number of constraints grew #3357

OOMKilled as number of constraints grew #3357

Comments

walking-appa commented Apr 18, 2024 • edited

maxsmythe commented Apr 19, 2024

walking-appa commented Apr 19, 2024 • edited

maxsmythe commented Apr 23, 2024

walking-appa commented Apr 18, 2024 •

edited

walking-appa commented Apr 19, 2024 •

edited