New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 4382: add semantic model for dynamic resource allocation with simple counter #4385
Conversation
pohly
commented
Jan 5, 2024
- One-line PR description: initial draft
- Issue link: DRA: semantic model: simple counter #4382
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with Dynamic Resource Allocation (DRA). | ||
|
||
This KEP defines a numeric model that roughly corresponds in functionality to | ||
extended resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extended resources. | |
countable extended resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know why the count is int64, in scenarios like timesharing full GPUs are allocated for a predetermined time for such use cases will making count type float make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, it's for feature parity with extended resources. Those are also integers.
Second, I'm wary of rounding errors. With integers, it's guaranteed that a + b + c = x
and x - a -b - c = 0
. Not so with floats.
NodeName string `json:"nodeName"` | ||
|
||
// DriverName is the name of the driver on that node. | ||
DriverName string `json:"driverName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be multiple drivers on a specific node. Does this handle allocations by multiple drivers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NodeResourceCapacity is per-node, per-driver, per-resource-instance. This field is necessary for the numeric model to determine which capacity gets freed when this allocation is removed.
|
||
// InstanceID is the unique ID chosen by that driver for the resource | ||
// instance that was picked for the claim. | ||
InstanceID types.UID `json:"instanceID"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: If a DRA plugin assigns multiple underlying devices (e.g. GPUs) to a claim (i.e. Count > 1
) this does not map to a single resource. Does that mean that either:
- The plugin needs to keep track of the association between
InstanceID
and the individual resources - This field needs to be repeatable?
If we are considering 1 above. Where is this state persisted so that this remains usable when things get deleted as below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's responsibility of the DRA kubelet plugin to generated the instance ID and then map this field back to whatever the underlying hardware IDs are.
this remains usable when things get deleted as below.
"as below" - which section are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am refering to:
This ensures that the claim
is usable also when class or claim parameters get deleted. It also avoids
having to grant the kubelet read permission for arbitrary parameter resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment is mostly about ClassParameters
and ClaimParameters
: those are there for the DRA driver so that it can extract its configuration parameters.
The InstanceID
needs to be checked by the DRA driver in case that it advertises more than one instance per node.
The numeric model implementation also uses it to match the allocated claims against the available resources in case that the kube-scheduler gets restarted and the numeric model needs to reconstruct which resources are in use.
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: DRACounterNumericModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Are feature gate names basic strings? What about DRANumericModel.Counter
or even DRANumericModel/Counter
to indicate the heirarchy?
Here I assume that this depends on DRANumbicModel
being enabled as well, or can these be toggled individually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are supposed to be plain strings. I'm not sure whether it's enforced anywhere, but none of the existing feature names have punctuation.
The DRANumericParameters
feature must be enabled for DRACounterNumericModel
to have an effect.
The types no longer get encoded/decoded based on their builtin TypeMeta, so that can be removed. Having a separate Parameters type is cleaner and avoids exposing fields like ObjectMeta which make no sense when requesting capacity.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pohly: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |