Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 4382: add semantic model for dynamic resource allocation with simple counter #4385

Closed
wants to merge 2 commits into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 5, 2024

  • One-line PR description: initial draft

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2024
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 5, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 5, 2024
@bart0sh bart0sh added this to Triage in SIG Node PR Triage Jan 7, 2024
Copy link
Contributor

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pohly. I have taken an initial pass. I assume that this KEP doesn't discuss the addition of the NodeResourceCapacity because this would already be implemented as part of addressing #4384.

with Dynamic Resource Allocation (DRA).

This KEP defines a numeric model that roughly corresponds in functionality to
extended resources.
Copy link
Contributor

@elezar elezar Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extended resources.
countable extended resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know why the count is int64, in scenarios like timesharing full GPUs are allocated for a predetermined time for such use cases will making count type float make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, it's for feature parity with extended resources. Those are also integers.

Second, I'm wary of rounding errors. With integers, it's guaranteed that a + b + c = x and x - a -b - c = 0. Not so with floats.

NodeName string `json:"nodeName"`

// DriverName is the name of the driver on that node.
DriverName string `json:"driverName"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be multiple drivers on a specific node. Does this handle allocations by multiple drivers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NodeResourceCapacity is per-node, per-driver, per-resource-instance. This field is necessary for the numeric model to determine which capacity gets freed when this allocation is removed.


// InstanceID is the unique ID chosen by that driver for the resource
// instance that was picked for the claim.
InstanceID types.UID `json:"instanceID"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: If a DRA plugin assigns multiple underlying devices (e.g. GPUs) to a claim (i.e. Count > 1) this does not map to a single resource. Does that mean that either:

  1. The plugin needs to keep track of the association between InstanceID and the individual resources
  2. This field needs to be repeatable?

If we are considering 1 above. Where is this state persisted so that this remains usable when things get deleted as below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's responsibility of the DRA kubelet plugin to generated the instance ID and then map this field back to whatever the underlying hardware IDs are.

this remains usable when things get deleted as below.

"as below" - which section are you referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am refering to:

This ensures that the claim
is usable also when class or claim parameters get deleted. It also avoids
having to grant the kubelet read permission for arbitrary parameter resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment is mostly about ClassParameters and ClaimParameters: those are there for the DRA driver so that it can extract its configuration parameters.

The InstanceID needs to be checked by the DRA driver in case that it advertises more than one instance per node.

The numeric model implementation also uses it to match the allocated claims against the available resources in case that the kube-scheduler gets restarted and the numeric model needs to reconstruct which resources are in use.

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: DRACounterNumericModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are feature gate names basic strings? What about DRANumericModel.Counter or even DRANumericModel/Counter to indicate the heirarchy?

Here I assume that this depends on DRANumbicModel being enabled as well, or can these be toggled individually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are supposed to be plain strings. I'm not sure whether it's enforced anywhere, but none of the existing feature names have punctuation.

The DRANumericParameters feature must be enabled for DRACounterNumericModel to have an effect.

@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Jan 9, 2024
@pohly pohly changed the title KEP 4382: add counter numeric model for dynamic resource allocation KEP 4382: add semantic model for dynamic resource allocation with simple counter Jan 31, 2024
The types no longer get encoded/decoded based on their builtin TypeMeta, so
that can be removed. Having a separate Parameters type is cleaner and avoids
exposing fields like ObjectMeta which make no sense when requesting capacity.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly
Copy link
Contributor Author

pohly commented Feb 6, 2024

/close

Will be replaced by a more complete semantic model that'll be defined directly in #4384, aka KEP #4381.

SIG Node PR Triage automation moved this from Needs Reviewer to Done Feb 6, 2024
@k8s-ci-robot
Copy link
Contributor

@pohly: Closed this PR.

In response to this:

/close

Will be replaced by a more complete semantic model that'll be defined directly in #4384, aka KEP #4381.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants