Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: KEP-4381: DRA: avoid kubelet API version dependency with REST proxy #4615

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 3, 2024

By tunneling REST requests from drivers through kubelet to the API server it becomes possible to update drivers and the control plane to a newer release than the kubelet because the kubelet doesn't need to encode/decode the resource.k8s.io API types that are used by the drivers. This simplifies cluster upgrades.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 3, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label May 3, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 3, 2024

Keeping kubelet at some old release while upgrading the control and DRA drivers
is desirable and officially supported by Kubernetes. To support the same when
using DRA, the kubelet now provides a REST proxy via gRPC that can be used by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems generally useful for a range of version skew scenarios between on-node agents and the apiserver, as well as security controls.

I'm slightly concerned as to the scalability of this approach (kubelet acting as gateway) but the upsides involved for managing the impact of per node agents is very interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned as to the scalability of this approach (kubelet acting as gateway)

OTOH, kube-apiserver will be rate-limiting requests anyway, so I doubt that it would really be kubelet that would be a gatekeeper here - in such case those would probably be throttled server side anyway.

remain the same.

The alternative is to keep supporting older revisions for the sake of the kubelet.
<<[/UNRESOLVED]>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine if kubelet uses a certain version of the API for its own purposes. We can ensure that the apiserver provides a version of the API that is needed by an older kubelet as far back as we want to support version skew.

For example, suppose we switch to v1alpha3 in 1.31, a breaking change. Then in 1.32, we introduce v1beta1. Normally we would remove v1alpha3, but to try out version skew support, we can keep v1alpha3. Ideally this won't be hard as we don't plan any big changes for the API in 1.32. Then kubelet from v1.31 can be combined with drivers and control plane from v1.32 where both are using v1beta1.

@bart0sh bart0sh added this to Needs Reviewer in SIG Node PR Triage May 4, 2024
By tunneling REST requests from drivers through kubelet to the API server it
becomes possible to update drivers and the control plane to a newer release
than the kubelet because the kubelet doesn't need to encode/decode the
resource.k8s.io API types that are used by the drivers. This simplifies cluster
upgrades.
Comment on lines +2100 to +2101
convenience because for a PUT of a ResourceSlice, the driver has to be trusted
to not create an object for some other driver on the node. This cannot be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a downside of the proxy path, right? The unstructured / dynamic client aggregation approach could have checked the driver association field, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

However, the "unstructured / dynamic client" then has other downsides. Right now, it can check this because we assume that all future ResourceSlices will be of the form "nodeName + driverName + other fields". Without knowing anything about "other fields", it's hard to write a controller that synchronizes the content in the ResourceSlice objects with the desired content - not impossible, but harder.

The other downside if we consider apiserver traffic is the usage of JSON for the request bodies (both ways) instead of protobuf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want kubelet in the middle here if drivers are going to be doing more than write-only status reporting? If they are watching / getting / reading / reconciling status, that seems more like a normal client to me. Is the only reason for the node proxy so they can piggy-back on kubelet credentials and get authorized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage is indeed that the node authorizer continues to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but... not really the way we want, since it lets devices stomp each others' status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drivers on the same node have to trust each other. They often run with elevated privileges, so they can already do much harm locally even without this shared access to ResourceSlice objects of the node.

That's still better than having to trust all drivers anywhere in the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is the ideal scenario for us to build out the authz capabilities of node bound SA tokens (possibly combined with #4600). We have a good story for node agents that want to perform write requests via VAP+SA node ref. Having something similar to that for reads seems broadly useful. Building an entire gRPC proxy as a one-off for DRA seems like the wrong approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The REST proxy could be reused. But I agree, figuring out how to do this through normal auth mechanisms is the better approach. That would be out-of-scope for this KEP, though.

Instead, I'll put in some wording along the lines of:

  • DRA kubelet plugins must connect to the API server themselves to publish their own ResourceSlice obects and to read ResourceClaim objects.
  • Limiting access of the plugin is the responsibility of whoever deploys the driver. Some methods and best practices may get documented separately.

Does that sound okay? Then I'll close this PR and create a different one with that approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly I added this to the SIG Auth meeting for May 22nd. Hopefully we can hash out a concrete path forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets even more interesting with some uses cases that people have brought up for DRA: a DRA driver for a NIC runs on the node and must publish network-related information, ideally to the ResourceClaim status. It must do that itself because kubelet might not know what that information is due to version skew.

The node authorizer can limit write access to ResourceClaims that are actually in use on the node because it builds the graph (node -> pod -> claim). In contrast to ResourceSlice, there isn't a specific field in a ResourceClaim that can be checked.

[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
with “volume” replaced by “resource” and volume specific parts removed.

#### REST proxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/sig api-machinery auth
/cc @jpbetz @deads2k @enj

For visibility to a kube-apiserver <-- kubelet <-- client proxied API surface proposal

The read part of this is similar in some ways to what was proposed in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4188-kubelet-pod-readiness-api#proposed-api

  1. the reads / watches are passed through to kube-apiserver with the kubelet's credentials
  2. the API is a complete proxy, so client-go is usable (though tunneled over gRPC, which I'm not sure we've done before)
  3. the API allows drivers to write through to kube-apiserver using the kubelet's credentials

A lot of the discussion around that proposal seems relevant to this one (sig-arch 2023-08-24, https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit#bookmark=id.vud1o04xj4iv, https://www.youtube.com/watch?v=toN7t_y4zCk&t=22m45s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read part of this is similar in some ways to what was proposed in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4188-kubelet-pod-readiness-api#proposed-api

One key difference is that I am not proposing to add a new kubelet socket. I was worried about security implications around that. Instead, kubelet connects to plugins the same way as before and takes requests through one stream per plugin which is kept open by each plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a desire to read/write effectively from the kube-apiserver and the development of validatingadmissionpolicy + serviceaccount node claims + downward API for node name that can be used to self-restrict reads (for convenience) and enforce write requirements (must be from the same node), why is the proxy a better and more secure choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with that alternative to do a comparison. Can you point me to documentation?

The ask is to ensure that a DRA driver deployed as daemonset gets RBAC permissions which allow it to read/create/update ResourceSlice objects where the "nodeName" field is the same as the node on which the pod is running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ResourceClaim, the ask is to ensure that the pod only gets read permission for ResourceClaims referenced by pods which have been scheduled to the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k: I looked through https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy and https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/. It's not clear to me how I can connect the identity of the driver pod with the node that it is running on and that identity with rules that restricts what that pod can access.

The node authorizer uses hand-crafted Go code for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See kubernetes/kubernetes#124711 for the VAP+SA example David is referring to.

Note that admission only covers writes, so reads cannot be restricted in this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the key part:
object.metadata.name == request.userInfo.extra[\"authentication.kubernetes.io/node-name\"][0])

Does object provide access to any field?

Is it possible to extend what gets added to the credentials? If there was a dra.kubernetes.io/driver-name, then the same check could be done for the driverName field in ResourceSlice. That would be better than what the REST proxy can do.

I'm fine with dropping the read check on ResourceClaim. Node authorizer also has a "loop hole" there because it does limit what data gets returned by watches.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Status: Needs Reviewer
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

7 participants