New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3063 and KEP-4381: DRA: "Structured Parameters" as base, "classic" DRA as extension #4585
Conversation
a3851ba
to
8da56b7
Compare
We decided to focus on structured parameters as the part of DRA which will get moved to beta first. "Classic" DRA (PodSchedulingContext, support for control plane controllers) will remain in alpha. It needs a separate feature gate. Therefore the description of DRA in the two KEPs gets shifted around such that the "structured parameters" KEP is a stand-alone, complete description of everything needed for using DRA like that. The original KEP is now an extension of that and describes PodSchedulingContext and the interaction with a DRA driver control plane controller. Immediate allocation is defined there because it does not make much sense for structured parameters. Mostly this was just a matter of copying text around and updating the wording. Changes that go beyond those editorial changes are: - title changes - new diagram for "structured parameters" - ResourceClass.StructuredParameters in 4381 became ResourceClass.ControlPlaneController in 3063 (different default, breaking change!) - new DRAControlPlaneController feature gate for everything in 3063 - removal of the kubelet v1alpha2 gRPC interface (kubernetes/kubernetes#124316). - updated description of ephemeral ResourceClaim namimg (did not match the current implementation with generated names) - added NodeListAndWatchResources kubelet gRPC call
8da56b7
to
3159f89
Compare
// | ||
// Resource drivers have a unique name in forward domain order | ||
// (acme.example.com). | ||
DriverName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we don't need this anymore for structured parameters. As the description says, this determines which control plane controller to use, but there is none when using structured parameters.
Unless someone can think of some use for this, I'll move it to the other KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This also replaces the ControlPlaneController
boolean, leaving just one field in ResourceClass that determines whether to use a DRA driver controller for allocation. I chose ControllerName
instead of DriverName
to make it clearer that this field enables something related to the "Control Plane Controller" KEP - but as usual, naming is tenative....
// Setting this field is optional. If null, all nodes are candidates. | ||
// +optional | ||
SuitableNodes *core.NodeSelector | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wild idea... None of the fields above in the ResourceClass are absolutely required for structured parameters. Should the ResourceClass field in a ResourceClaim field also be optional?
IMHO we should keep it mandatory. When it is mandatory, cluster admins can be sure that users are using one of the classes that were set up by the admin. The admin can use that power to restrict access (SuitableNodes filter) or to set some parameters consistently (ParametersRef).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, we can make ResourceClaim.ResourceClassName
optional and add a default, like literally default
. The users don't need to bother with figuring out which classes might exist in any given cluster and admins who want to influence allocation can do it reliably.
The implication for DRA drivers is that they don't need to provide a ResourceClass as part of their install instructions.
I actually like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between your first and second comment? In the first you say it should be mandatory, and in the second you say it should be optional. Are you talking about a different ResourceClaim
in each comment (i.e. the actual object vs. the field in the pod spec)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same field (ResourceClaim.ResourceClassName
), different alternatives: either keeping it mandatory or making it optional with a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is heterogeneity in the cluster, isn't the case admins have to use ReseourceClassName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of requiring a ResourceClassName
, you're suggesting to do anything useful we either need to:
- Include an explicit
ResourceClaimParameters
(which can then be used to figure out which driver is in charge of providing resources for the claim), or; - Actually provide a
ResourceClassName
, where the admin will now be required to set a defaultResourceClaimParameters
(in lieu of setting a requiredDriverName
field).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582: in the first example it is not clear what using "gpu.nvidia.com" does. It depends on the defaults associated with that class.
If we decide to make ResourceClassName
optional with a way to have a default class, then there several options:
- Fixed name for the default resource class ("default" ?).
- Use a field or annotation to mark one of many classes as the default, similar to storage classes.
This storage class API with annotation is unusual, a field is probably preferred. I don't know whether making a storage class default configurable like that is actually considered a good UX - probably worth asking the SIG Storage folks.
@klueska: basically yes. Another variation would be to allow "ResourceClassName set, no default parameters" and treat it like the empty claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this simple:
- ResourceClassName may be empty.
- If no parameters are provided (neither in claim nor class), nothing needs to be allocated. Same for empty claim.
- No default ResourceClass.
I updated the text with this proposal: https://github.com/kubernetes/enhancements/compare/4fe1b58595ea013e08a4ede9930e7a5d244e9e68..d028f775ae50f60c66bfdcb7b28a8aedb4a6563b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine here. But I think we need to iterate on these kinds of decisions as we build out the prototype. The results of that will eventually land in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska: Are you also okay with the ResourceClassName update above?
@johnbelamaric: I add an UNRESOLVED to the API section. It includes some of the examples of things that we might want to change, but doesn't need to be complete - the important part is to have that marker.
Let's LGTM and approve this update so that we can move on.
Structured parameters don't use the ResourceClass driver name for anything. All that matters are the driver names in the individual requests and/or filters.
4fe1b58
to
d028f77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I am ok with this as a re-org of existing functionality. But we should add an UNRESOLVED tag letting people know that the API is under active development. I expect a number of changes to come out of that work.
basis. Because the new API is going to be implemented independently of the | ||
existing device plugin support, there's little risk of breaking stable APIs. | ||
|
||
* Provide an abstraction layer for resource requests, i.e., something like a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are going to do this for 1.31. But I am reviewing just "re-organization" here. The details like this will change as we approach 1.31 KEP deadlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly that was the intend. If someone wants to make "abstraction layer for resource requests", they'll have to prepare a KEP update with details how to do it.
// Setting this field is optional. If null, all nodes are candidates. | ||
// +optional | ||
SuitableNodes *core.NodeSelector | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine here. But I think we need to iterate on these kinds of decisions as we build out the prototype. The results of that will eventually land in here.
47a4d66
to
e61fc51
Compare
/approve |
Reorganizing the KEPs this way makes alof of sense. It flips the way things were originally designed by turning the structured parameters way of doing DRA into the "default" and the opaque parameters way of doing DRA into an extension. This is consistent with the current thinking on how we want to present DRA going forward. By making this change we can more easily update these newly restructured KEPs with any future extensions in a clear and consistent manner. /approve /cc @mrunalp for final approval from sig-node |
@klueska: GitHub didn't allow me to request PR reviews from the following users: approval, from, for, final. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric, klueska, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold In case that @mrunalp wants to have a look (approval from John was enough?). |
Looks like I was too slow with my hold. I think that's fine, if there's anything controversial in it we can always update again. |
Oops. Yeah, I am not a SIG Node approver but I am root in the enhancements repo. Forgot to hold it, sorry about that. |
Sorry, just catching up here, this organization sounds good. Thanks! |
One-line PR description: "Structured Parameters" as base, "classic" DRA as extension
Issue links:
We decided to focus on structured parameters as the part of DRA which will get moved to beta first. "Classic" DRA (PodSchedulingContext, support for control plane controllers) will remain in alpha. It needs a separate feature gate.
Therefore the description of DRA in the two KEPs gets shifted around such that the "structured parameters" KEP is a stand-alone, complete description of everything needed for using DRA like that.
The original KEP is now an extension of that and describes PodSchedulingContext and the interaction with a DRA driver control plane controller. Immediate allocation is defined there because it does not make much sense for structured parameters.
Mostly this was just a matter of copying text around and updating the wording. Changes that go beyond those editorial changes are:
title changes
new diagram for "structured parameters"
ResourceClass.StructuredParameters in 4381 and ResourceClass.DriverName became ResourceClass.ControllerName in 3063 (different default, breaking change!); structured parameters work without a driver name in the resource class.
Add ResourceClass.DefaultClaimParameters in 4381, which allows vendors to define what a claim with just a class name requests.
ResourceClaim.ResourceClassName is optional for structured parameters.
new DRAControlPlaneController feature gate for everything in 3063
removal of the kubelet v1alpha2 gRPC interface (DRA: remove support for v1alpha2 kubelet gRPC API kubernetes#124316).
updated description of ephemeral ResourceClaim naming (did not match the current implementation with generated names)
added NodeListAndWatchResources kubelet gRPC call