Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: KEP-4381: DRA: support vendor-independent attributes #4614

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 2, 2024

While it isn't really feasible for GPUs, other kinds of hardware might be described via some standardized attributes. When removing the explicit driver name from requests and filters it becomes possible for users to ask for an instance where some standardized attribute is set without knowing in advance which driver is going to provide it.

The downside is that CEL expressions now must get applied to all instances, whether they are from the "expected" driver or some other one. It becomes the responsibility of the CEL expression to filter out unknown instances.

While it isn't really feasible for GPUs, other kinds of hardware might be
described via some standardized attributes. When removing the explicit driver
name from requests and filters it becomes possible for users to ask for an
instance where some standardized attribute is set without knowing in advance
which driver is going to provide it.

The downside is that CEL expressions now must get applied to all instances,
whether they are from the "expected" driver or some other one. It becomes the
responsibility of the CEL expression to filter out unknown instances.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 2, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 2, 2024
@@ -697,43 +699,51 @@ generatedFrom:
apiGroup: dra.example.com
uid: foobar-uid

vendorParameters:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was at the wrong level before. I'm just fixing this example here.

# Each entry here is a request for one resource. Entries could have
# their own vendor parameters (not shown here).
subRequests:
- namedResources:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, "requests" inside "driverRequests" made some sense because all "requests" were related to a single driver. Now the only remaining reason is that we need a place to put vendor parameters that are meant for more than one individual resource request.

@klueska: do you think that we can perhaps flatten this after all? Then a claim would have only "requests" where each request has vendor parameters and the selector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm having a hard time paging this back in. What are you proposing exactly? Can you show me a before and after?

Copy link
Contributor Author

@pohly pohly May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before is what this KEP has:

requests:
- vendorParameters: <parameters for all sub-requests>
  subRequests:
  - vendorParameters: <parameters for device A>
    namedResources:
      <device A>
  - vendorParameters: <parameters for device B>
    namedResources:
      <device B>

If we were to drop one level, that would become:

requests:
- vendorParameters: <parameters for device A>
  namedResources:
     <device A>
- vendorParameters: <parameters for device B>
   namedResources:
     <device B>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With one level removed, "common" parameters would need to be repeated for each device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or set once and then applied to all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop one level of indirection by moving the list of sub-requests into the model. When combined with renamed "named resources" model to "device" model (see Slack), the result looks fairly natural to me.

Here I am also using a one-of-many for the parameters because instead of "vendor" setup parameters we might at some point also have "in-tree" setup parameters.

In addition, I am including the driver name for the vendor parameters. That enables validation with an admission webhook and providing setup parameters for different drivers, which can happen now when the CEL expression allows it

requests:
- setup:
    vendor:
      driverName: cards.dra.example.com
      parameters: <parameters for devices A and B>
  devices:
  - selector: <selector for device A>
    setup:
      vendor:
        driverName: cards.dra.example.com
        parameters: <parameters for device A>
  - selector: selector for device B>
    setup:
      vendor:
        driverName: cards.dra.example.com
        parameters: <parameters for device B>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better, we don't even need the top-level "requests".

See the update that I just pushed.

# their own vendor parameters (not shown here).
subRequests:
- namedResources:
driverName: cards.dra.example.com
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key conceptual change: the driver name is no longer required at the level above "named resources". Instead, instances of "the driver I want to use" are selected via the CEL expression.

This driverName is just there to simplify the names of attributes in the CEL expression. I'm not sure whether this is important enough to justify the extra complexity.

@bart0sh bart0sh added this to Needs Reviewer in SIG Node PR Triage May 4, 2024
Comment on lines +600 to +601
- name: type
string: GPU # All named resources with this type have the following attributes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implying that type becomes a special attribute that the scheduler / other abstractions need to know exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It's defined by the vendor and exists inside the (in this example) "cards.dra.example.com" namespace.

attributes.quantity["memory"].isGreaterThan(quantity("32Gi"))
```
namedResources:
- attributeSuffix: cards.dra.example.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow this. Does this mean that when i define my resource slice that I have to name my attributes with a particular suffix to match here? Or is this the driver name appended to the end of the attributes I define in my resourceslice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute names get expanded in two places:

  • In ResourceSlice, the driver name gets appended.
  • In a selector, either the user already uses the fully-qualified name or the "attributeSuffix" gets appended.

I am open for dropping the "attributeSuffix" field. It helps with keeping CEL expressions shorter, but it also makes the API harder to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also drop the automatism for ResourceSlice and make it a validation requirement that attribute names are fully-qualified. The ResourceSlice object then becomes larger, but more readable.

Copy link
Contributor

@klueska klueska May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't want to be that verbose when generating a resource slice. It also feels weird (at least to me) to encode the driver name in the attribute string at all (regardless of the level at which its happening). Would an alternative be to have an optional field for driver at the same level as name/<attribute-type> in cases where you want to match on exactly that driver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't want to be that verbose when generating a resource slice.

So we keep what I have for ResourceSlice now, with short names that inherit the driver name as suffix.

It also feels weird (at least to me) to encode the driver name in the attribute string at all

The problem is that we want to enable selectors that work across different vendors. The attributes then could come from a k8s.io scope (namespace?) or from the scopes of two different vendors, if that is what the user wants: "select a card of vendor A with these attributes or a card of vendor B with these attributes".

Would an alternative be to have an optional field for driver at the same level as name/ in cases where you want to match on exactly that driver?

Field where? The CEL expression is a single string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get what you meant. We can rename "attributeSuffix" to "driverName" and change the meaning to "first filter by that name, then check the CEL expression". Should be more efficient, too.

Update pushed.

@pohly pohly changed the title RFC: KEP-4381: DRA: support abstract attributes RFC: KEP-4381: DRA: support vendor-independent attributes May 6, 2024
# At the moment, only setup parameters defined by a vendor
# are supported. In-tree definition of common setup parameters
# might get added in the future.
setup:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NamedResources *NamedResourcesRequest
// The list may be empty, in which case the claim can be allocated without
// using any instances.
NamedResources *[]NamedResourcesRequest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh!

The gogo protobuf code generator fails for this. It generates len(NamedResources) which does not compile because NamedResources is a pointer. Not sure yet how to resolve this.

I can wrap []NamedResourcesRequest in a struct, but the the YAML and JSON end up with one additional indirection.

This is not a show-stopper for "vendor-independent attributes", but it throws a monkey wrench into the attempt to flatten the YAML at the same time.

Copy link
Contributor Author

@pohly pohly May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look?

requests:
- device:
     selector: <CEL expression for named resources>
  config:
    vendor:
      driverName: cards.dra.example.com
      parameters: <per-request parameters>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

request:
- device:
    selector: <CEL expression for named resources>
  config:
    vendor:
      driverName: cards.dra.example.com
      parameters: <per-request parameters>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is assuming that "named resources" gets renamed to "device". For now, it would have "namedResource".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a convention to use plural form for lists?

I'm wondering about request (imperative) vs requests (noun, plural).

@k8s-ci-robot
Copy link
Contributor

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify d42bfb2 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pohly added a commit to pohly/wg-device-management that referenced this pull request May 13, 2024
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
pohly added a commit to pohly/wg-device-management that referenced this pull request May 13, 2024
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
pohly added a commit to pohly/wg-device-management that referenced this pull request May 13, 2024
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
pohly added a commit to pohly/wg-device-management that referenced this pull request May 13, 2024
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
pohly added a commit to pohly/wg-device-management that referenced this pull request May 16, 2024
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
pohly added a commit to pohly/wg-device-management that referenced this pull request May 16, 2024
This proposal takes the existing KEP as base and includes:
- vendor-independent classes and attributes (kubernetes/enhancements#4614)
- optional allocation (kubernetes/enhancements#4619)
- inline parameters (kubernetes/enhancements#4613)
- management access (kubernetes/enhancements#4611)
- renaming "named resources" to "devices" wherever it makes sense and is
  user-facing (Slack discussion)
- MatchAttributes (from k8srm-prototype)
- OneOf (from k8srm-prototype)

`pkg/api` currently builds, but the rest doesn't. None of the YAML examples
have been updated yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Reviewer
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

3 participants