Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: structured parameters #123516

Merged
merged 18 commits into from Mar 8, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Feb 26, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

"Structured parameters" are a new approach for handling scheduling and autoscaling with dynamic resource allocation: drivers publish information in a format defined and understood by Kubernetes, the scheduler handles the claim allocation without communicating with some DRA driver controller. The Cluster Autoscaler can use the information to simulate scale up and down.

Which issue(s) this PR fixes:

Related-to: kubernetes/enhancements#4381

Special notes for your reviewer:

To keep the PR manageable, the new functionality gets introduced in stages with one commit building on top of the previous:

  • in-tree API types without any specific structured parameter model
  • the "named resources" structured parameter model: this commit can serve as reference for the future where a new model needs to be added
  • node authorizer + admission
  • scheduler
  • scheduler_perf integration test
  • kubelet

Does this PR introduce a user-facing change?

Dynamic Resource Allocation: DRA drivers may now use "structured parameters" to let the scheduler handle claim allocation.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]:  https://github.com/kubernetes/enhancements/issues/4381

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 26, 2024
@k8s-ci-robot k8s-ci-robot added area/apiserver area/code-generation area/kubelet area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2024
@pohly pohly changed the title DRA: structured parameters WIP: DRA: structured parameters Feb 26, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2024
@bart0sh bart0sh added this to WIP in SIG Node PR Triage Feb 27, 2024
This adds support for semantic version comparison to the CEL support in the
"named resources" structured parameter model. For example, it can be used to
check that an instance supports a certain API level.

To minimize the risk, the new "semver" type is only defined in the CEL
environment for DRA expressions, not in the base library. See
kubernetes#123664 for a PR which
adds it to the base library.

Validation of semver strings is done with the regular expression from
semver.org. The actual evaluation at runtime then uses semver/v4.
While currently those objects only get published by the kubelet for node-local
resources, this could change once we also support network-attached
resources. Dropping the "Node" prefix enables such a future extension.

The NodeName in ResourceSlice and StructuredResourceHandle then becomes
optional. The kubelet still needs to provide one and it must match its own node
name, otherwise it doesn't have permission to access ResourceSlice objects.
This should better run with multiple nodes, it's more realistic that way.
Storing a modified claim with allocation and the original resource version in
the assume cache was not reliable: if an update was received, it replaced the
modified claim and the resource that was reserved for the claim might have been
used for some other claim.

To fix this, the in-flight claims are now stored in the map instead of just a
boolean and the status stored there overrides whatever is in the assume cache.

Logging got extended to diagnose this problem better. It started to occur in
E2E tests after splitting the claim update so that first the finalizer is set
and then the status, because setting the finalizer triggered an update.
This finishes the shuffling around of test scenarios so that all of them which
make sense with structured parameters are also executed with those.
There are two approaches for making new versioned CEL features available in the
release where they get introduced:
- Always use the environment for "StoredExpressions".
- Use an older version (typically 1.0) and only bump it up later.

The second approach was used before, so this is now also done here.
@pohly pohly force-pushed the dra-structured-parameters branch from 9b453d1 to 6a361e1 Compare March 7, 2024 21:27
@thockin
Copy link
Member

thockin commented Mar 7, 2024

I'll approve for API, who hold the kubelet LGTM?

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2024
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 7, 2024

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-capz-windows-master 8ea1a9a link false /test pull-kubernetes-e2e-capz-windows-master
pull-kubernetes-linter-hints 6a361e1 link false /test pull-kubernetes-linter-hints

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pohly
Copy link
Contributor Author

pohly commented Mar 7, 2024

@klueska is okay with merging, see #123516 (comment).

The wording isn't exactly LGTM, but I know that he doesn't want any further changes in this PR as far as kubelet is concerned.

@pohly
Copy link
Contributor Author

pohly commented Mar 7, 2024

@MaryamTavakkoli wrote:

/milestone v1.30

but that wasn't sufficient:

You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Milestone Maintainers Team and have them propose you as an additional delegate for this responsibility.

Can one of the @kubernetes/milestone-maintainers add the milestone?

The exception has been granted.

@sanchita-07
Copy link
Member

/milestone v1.30

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Mar 8, 2024
@dims
Copy link
Member

dims commented Mar 8, 2024

i'll cast the lgtm based on @klueska 's comments above and with my sig-node reviewer hat on.

/approve
/lgtm

thanks @pohly

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: df4d9cba0ea326b3ede6b6bc49a86e4fdcc62ca0

@dims
Copy link
Member

dims commented Mar 8, 2024

/skip

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, dims, klueska, pohly, soltysh, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 7ea3d02 into kubernetes:master Mar 8, 2024
19 checks passed
SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done Mar 8, 2024
SIG Node PR Triage automation moved this from WIP to Done Mar 8, 2024
"resource": "resourceslices",
"responseKind": {
"group": "",
"kind": "ResourceSlice",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the KEP to capture the name change and other deviations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cici37 cici37 added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Apr 11, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl area/kubelet area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Done
Status: Closed / Done
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet