Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra scheduler: ensure that we never have nil claim/class parameters #123828

Merged
merged 2 commits into from Mar 11, 2024

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Mar 8, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR adds logic to define default claim/class parameters that will allow
claim allocation to proceed even if an end user doesn't provide claim
parameters themselves.

Without this change, the scheduler was crashing in newClaimController() in pkg/scheduler/framework/plugins/dynamicresources/structuredparameters.go

The code in newClaimController() assumes that claim parameters are not nil.
Furthermore it assumes that there is at least one DriverRequest populated in
order to allocate any resources to a claim.

However, we shouldn't force a user to supply a ResourceClaimParameters
object if they don't want to. We also shouldn't force a driver implementor to
write a controller just to associate a static ResourceClaimParameters object
with every claim.

In this PR, the "default" claim parameters objects are hard-coded in the
scheduler code. This won't work going forward as we add new structured
models though. What we probably want is a field in the ResourceClass which
points to a default ResourceClaimPrameters object that should be used if one
is not supplied as part of the claim.

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 8, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 8, 2024
@klueska
Copy link
Contributor Author

klueska commented Mar 8, 2024

/assign @pohly

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 8, 2024
@klueska
Copy link
Contributor Author

klueska commented Mar 9, 2024

I'm realizing this is not sufficient. I think we need a fully populated "default" claim parameters object that requests a single resource with no configuration parameters.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 9, 2024
@klueska klueska force-pushed the non-nil-parameters branch 2 times, most recently from 603f049 to e69c531 Compare March 9, 2024 01:24
@klueska klueska changed the title dra scheduler: ensure that we never have nil Claim/Class parameters dra scheduler: ensure that we never have nil claim/class parameters Mar 9, 2024
@klueska
Copy link
Contributor Author

klueska commented Mar 9, 2024

I have currently hard-coded a "default" claim parameters object in the scheduler code. This won't work going forward as we add new structured models though. What we probably want is a field in the ResourceClass which points to a default ResourceClaimPrameters object that should be used if one is not supplied as part of the claim.

@pohly
Copy link
Contributor

pohly commented Mar 9, 2024

This PR adds logic to define default claim/class parameters that will allow
claim allocation to proceed even if an end user doesn't provide claim
parameters themselves.

I wasn't sure whether we want that. The E2E tests use selector: "true", just as in your default. But you are right, this is unnecessarily complex for users. And of course it shouldn't crash... 😅

What we probably want is a field in the ResourceClass which points to a default ResourceClaimPrameters object that should be used if one is not supplied as part of the claim.

... because without claim parameters, the scheduler doesn't know the claim is for a resource using model "named resources" or some future "some-other-resources". Makes sense.

As a stop-gap solution this is okay.

/lgtm
/assign @alculquicondor

For approval.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: da955e7c2a17de521426b6f9513850323e009ed1

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2024
@ravisantoshgudimetla
Copy link
Contributor

ravisantoshgudimetla commented Mar 9, 2024

I'm looking into implementing a custom driver and noticed this issue. I like the defaulting idea, if you folks are fine with it, I can take care of it.

@pohly
Copy link
Contributor

pohly commented Mar 11, 2024

@klueska: we now have the second commit from #123831 also in this PR. Can you take it out again?

@klueska
Copy link
Contributor Author

klueska commented Mar 11, 2024

I reversed the order so that the other one can merge without further approval. It just needs a new LGTM (and to be added to the milestone of course).

@pohly
Copy link
Contributor

pohly commented Mar 11, 2024

@ravisantoshgudimetla: let's discuss in #123858.

Previously we were returning the error string from 'err' (which is nil), when
we should have been returning it from result.Error. Without this it is hard to
debug issues with NodeUnprepareResources.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
for _, f := range classParameters.Filters {
if f.DriverName == driverName && f.ResourceFilterModel.NamedResources != nil {
filter = f.ResourceFilterModel.NamedResources
break
}
}
controller, err := namedresourcesmodel.NewClaimController(filter, perDriver.requests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the existing behavior equivalent in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks the same to me. classParameters is now guaranteed to not be nil (a case that I had considered, in contrast to nil claim parameters), so the if check is not needed anymore. If it had been nil before, it's now empty, so the for loop doesn't do anything, just as before.

Without this, the scheduler was crashing in newClaimController() in
pkg/scheduler/framework/plugins/dynamicresources/structuredparameters.go

The code in newClaimController() assumes that the parameters are not nil.
Furthermore it assumes that there is at least one DriverRequest populated in
order to allocate any resources to a claim.

This PR adds logic to define default claim/class parameters that will allow
allocation to proceed even if an end user doesn't provide any class or claim
parameters themselves.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
// match the default to how the resources of this
// class are being advertized in a ResourceSlice.
NamedResources: &resourcev1alpha2.NamedResourcesRequest{
Selector: "true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we would have to parse this selector. I would imagine that a case of "no parameters" is fairly common, so we should optimize for by understanding nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the scheduler doesn't inherently know which resource request model a driver is using to advertise its resources. We only know this by parsing this object and seeing which field is set (in this case NamedResources).

As such, I think a better approach would be to allow a cluster admin to define a default ResourceClaimParameters object and link it to a resource class. However, this is a larger chunk of work (and we are past the code freeze), so I proposed adding this PR as a stop-gap for 1.30.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the scheduler doesn't inherently know which resource request model a driver is using.

Does it matter? The point is that it passes the "filter", regardless of the model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @pohly comment on that as I'm not intimately familiar with all of the code in this plugin. All I know is some default should be set and the default should allow sharing. Whether that's through an explicit object or a branch off of detecting nil I'll leave up to Patrick.

Copy link
Contributor

@pohly pohly Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach in this PR better because the rest of the code currently expects to determine the model and thus the resources based on the parameters. Making it handle nil adds more special cases.

There's also a clear path towards what we really want (configurable default parameters). We need those because detecting the model may not be enough, we also need to figure out what the filters for a claim with no parameters should be. This depends on the driver and/or cluster, not the model.

#123858 is tracking this next step.

@alculquicondor
Copy link
Member

/lgtm
/approve
Please have a test case for this before the freeze

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a55bff3202752fa89898e36a5714630b01d5ba17

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2024
@pohly
Copy link
Contributor

pohly commented Mar 11, 2024

I am adding an E2E test for this. Let's prioritize adding the fix first.

@pohly
Copy link
Contributor

pohly commented Mar 11, 2024

/cc @kubernetes/release-team

Another bug fix for a new feature in 1.30, please add the milestone.

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: kubernetes/release-team.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @kubernetes/release-team

Another bug fix for a new feature in 1.30, please add the milestone.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@katcosgrove
Copy link

/milestone v1.30

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Mar 11, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3ec6a38 into kubernetes:master Mar 11, 2024
17 checks passed
@bart0sh bart0sh added this to Triage in SIG Node PR Triage Mar 11, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-none Denotes a PR that doesn't merit a release note. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 13, 2024
@bart0sh bart0sh moved this from Triage to Done in SIG Node PR Triage Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants