New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542
Comments
This is unfortunate. Now that this feature is GA, adding the per-expression cost has the potential to break existing users. I don't see per-expression cost mentioned in the docs: https://github.com/kubernetes/website/blob/main/content/en/docs/reference/access-authn-authz/validating-admission-policy.md I do see per-expression cost mentioned in the KEP but nowhere else. I agree we should consider keeping this as-is and not adding the per-expression cost.. |
A per-policy (per-binding?) limit protects the apiserver from expensive policies. I suppose a per-expression limit puts an upper bound on the cost of the wasted work of evaluating a policy that fails due to its runtime budget? It is unfortunate that a single expression making 29 secondary authorization checks will incur the cost of 28 checks before exhausting its budget on the final check. With a per-expression limit, a determined policy author could still spread their evaluation across multiple variable or validation expressions, but it would be much harder to do by mistake. |
I opened #124569 to add integration tests for runtime cost, specifically around the intentionally-large costs we chose for secondary authorization checks, and even the per-policy limit is not reached with 29 checks (I expected the limit to be floor(10M/350k) or 28 checks). It looks like our
If I wire the |
@benluddy Thank you for the investigation! kubernetes/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/schema/cel/compilation.go Lines 232 to 235 in bae8300
There is a gap in the existing test suite. When we test cost in cost_test.go, we wired the cost estimator specifically:
When we test runtime cost budget in
Adding test case in
The concern I have for adding the env option now is the backward compatibility since it is a GA feature. It might break existing expression. We might have to have a ratcheting mechanism in place. Maybe ignore the previous existing and not changed expression for exceeding cost limit err? |
In addition to fixing the unit test, resolving this definitely needs to include at least some integration and possibly E2E tests. Secondary authorization is the easy way to exercise it since its cost was designed to allow a certain number of checks per expression. |
I'm not sure I agree... I don't think the ability to make the API server do unbounded work is acceptable. A ratcheting approach to keep existing persisted VAPs from getting stuck would make sense, but this should be ratcheted down for new VAPs edit: well, not completely unbounded, I guess... the overall limit still applies... but I'd still push for converging this to cost limits |
Our understanding of the issue changed since that comment (the issue has been edited), I think everyone agrees that we need to fix this now.
If we had in fact been in the situation where only per-policy limits are enforced, we would have then run into the problem that per-policy limits are only checked between expression evaluations (i.e. the evaluation is not interrupted if the per-policy limit is reached partway). I think on reflection that situation would have been more dangerous than the issue we do have. |
The trick is that it happens in runtime where cel is calculating the cost and reject based on the limit we set... SO simply ratcheting with checking if it is existing expression might not be easy in this case.. |
hmm... what if we fix the cost wiring in 1.31 / 1.30.x / 1.29.x with a gated opt-out escape hatch if an admin has existing use they need to audit / resolve first? is there a reasonable way they could check if there are existing VAPs that would exceed correct cost? |
Something like: 1.30.1 : Add a feature gate "AllowOverCostValidatingAdmissionPolicy", set "Deprecated", and on-by-default (maybe off by default?) ? The sooner we set that gate as deprecated, the less likely we are to break people, but no matter what there's a chance Is it possible to always add a cost indicator to status? Or to validate the cost at CREATE/UPDATE rather than runtime? |
Since the cost depends on the object being admitted (you might be expanding a macro over a list field, for example), I don't think it can be done cheaply in the general case. It should be possible to decide whether a given expression is "definitely not affected" versus "possibly affected" by seeing if the expression references any library functions whose cost should differ from the default cost. For policies that might be impacted, would it be possible to send per-object, per-operation dry run requests? It could look sort of like a storage version migration. |
A deprecating feature gate suggested by @thockin above sounds like a working solution moving forward. We should encourage admins to turn off the flag as early as possible. The cost error is in runtime, setting validationaction to |
Brainstorming on the ratcheting idea suggested above.. could we somehow set a The idea is that we ratchet by allowing grandfather in ALL resources written to storage before the fix to this (1.30.1 or 1.31 or whatever we choose) to continue bypass the fixed cost code. I'd love to do better and identify what specific resources actually need the ratcheting, but if that's not possible, maybe this is? |
I like the per object approach, though depending on a new API field means <=1.30 won't be protected at all. I really really want a way to protect clusters at least from >= 1.30.x One possibility would be modeling the strictness as a condition in VAP status, which already exists in the schema. It's a little strange to populate a condition in status on creation server-side, but there's some precedent in CRD. A possible sketch of the behavior:
|
Unfortunately this issue is going to affect all places in K8s using CEL through base environment except for CRD validation rules. The features which are affected the most would be the GAed features like ValidatingAdmissionPolicy and MatchConditions. The above approach works great for VAP but I am not so sure about the matchConditions added in Webhook. I don't think we have status for webhook currently... |
What happened?
Thanks for @benluddy who raised the issue and investigated!
We didn't register the cost estimator for extended library hence the cost calculation is inaccurate for expression including extended library.
What did you expect to happen?
The cost should reflect the extended library
/cc @jpbetz @benluddy
/sig api-machinery
/triage accepted
How can we reproduce it (as minimally and precisely as possible)?
An example would be:
The cost of the above expression exceeds the per expression limit but didn't fail.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: