Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic resource allocation #3063

Open
25 of 34 tasks
pohly opened this issue Nov 30, 2021 · 119 comments
Open
25 of 34 tasks

dynamic resource allocation #3063

pohly opened this issue Nov 30, 2021 · 119 comments
Assignees
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Milestone

Comments

@pohly
Copy link
Contributor

pohly commented Nov 30, 2021

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 30, 2021
@pohly
Copy link
Contributor Author

pohly commented Nov 30, 2021

/assign @pohly
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 30, 2021
@ahg-g
Copy link
Member

ahg-g commented Dec 20, 2021

do we have a discussion issue on this enhancement?

@pohly
Copy link
Contributor Author

pohly commented Jan 10, 2022

@ahg-g: with discussion issue you mean a separate issue in some repo (where?) in which arbitrary comments are welcome?

No, not at the moment. I've also not seen that done elsewhere before. IMHO at this point the open KEP PR is a good place to collect feedback and questions. I also intend to come to the next SIG-Scheduling meeting.

@ahg-g
Copy link
Member

ahg-g commented Jan 10, 2022

@ahg-g: with discussion issue you mean a separate issue in some repo (where?) in which arbitrary comments are welcome?

Yeah, this is what I was looking for, the issue would be under k/k repo.

No, not at the moment. I've also not seen that done elsewhere before.

That is actually the common practice, one starts a feature request issue where the community discusses initial ideas and the merits of the request (look for issues with label kind/feature). That is what I would expect in the discussion link.

IMHO at this point the open KEP PR is a good place to collect feedback and questions. I also intend to come to the next SIG-Scheduling meeting.

But the community have no idea what this is about yet, so better to have an issue discusses "What would you like to be added?" and "Why is this needed" beforehand. Also, meetings are attended by fairly small groups of contributors, having an issue tracking the discussion is important IMO.

@pohly
Copy link
Contributor Author

pohly commented Jan 10, 2022

In my work in SIG-Storage I've not seen much use of such a discussion issue. Instead I had the impression that the usage of "kind/feature" is discouraged nowadays.

https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.yaml explicitly says

Feature requests are unlikely to make progress as issues. Please consider engaging with SIGs on slack and mailing lists, instead. A proposal that works through the design along with the implications of the change can be opened as a KEP.

This proposal was discussed with various people beforehand, now we are in the formal KEP phase. But I agree, it is hard to provide a good link to those prior discussions.

@ahg-g
Copy link
Member

ahg-g commented Jan 10, 2022

We use that in sig-scheduling, and it does serve as a very good place for initial rounds of discussions, discussions on slack and meetings are hard to reference as you pointed out.

I still have no idea what this is proposing, and I may not attend the next sig meeting for example...

@gracenng gracenng added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Jan 17, 2022
@gracenng gracenng added this to the v1.24 milestone Jan 17, 2022
@gracenng
Copy link
Member

gracenng commented Jan 30, 2022

Hi @ ! 1.24 Enhancements team here.
Checking in as we approach enhancements freeze in less than a week on 18:00pm PT on Thursday Feb 3rd
Here’s where this enhancement currently stands:

  • Updated KEP file using the latest template has been merged into the k/enhancements repo. KEP-3063: dynamic resource allocation #3064
  • KEP status is marked as implementable for this release with latest-milestone: 1.24
  • KEP has a test plan section filled out.
  • KEP has up to date graduation criteria.
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

The status of this enhancement is track as at risk.
Thanks!

@gracenng gracenng added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team and removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Feb 4, 2022
@gracenng
Copy link
Member

gracenng commented Feb 4, 2022

The Enhancements Freeze is now in effect and this enhancement is removed from the release.
Please feel free to file an exception.

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.24 milestone Feb 4, 2022
@gracenng
Copy link
Member

gracenng commented Mar 1, 2022 via email

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2022
@kerthcet
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022
@dchen1107 dchen1107 added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Jun 9, 2022
@dchen1107 dchen1107 added this to the v1.25 milestone Jun 9, 2022
@Priyankasaggu11929 Priyankasaggu11929 added tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team and removed tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team labels Jun 10, 2022
@marosset
Copy link
Contributor

Hello @pohly 👋, 1.25 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PST on Thursday June 16, 2022.

For note, This enhancement is targeting for stage alpha for 1.25 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP file using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable
  • KEP has a updated detailed test plan section filled out
  • KEP has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

It looks like #3064 will address everything in this list.

For note, the status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@marosset
Copy link
Contributor

Hello @pohly 👋, just a quick check-in again, as we approach the 1.25 enhancements freeze.

Please plan to get #3064 reviewed and merged before enhancements freeze on Thursday, June 23, 2022 at 18:00 PM PT.

For note, the current status of the enhancement is atat-risk. Thank you!

@johnbelamaric
Copy link
Member

So when a deployment is used, all pods reference the same ResourceClaim? Then all

Yeah, I think you're right, this doesn't quite work and templates are probably the fix.

The goal as you said is be to avoid pod scheduling context, not templates really.

I still think it's possible to create a scoped down but still useful API that accomplishes that.

@johnbelamaric
Copy link
Member

Who are those folks? This seems very speculative to me.

Today people solve this by grabbing the whole node and/or running privileges pods. This API that avoids this, allowing an administrator to pre-allocate resources via the node-side (privileged) drivers, without requiring a the user pod to have those privileges. Those would be the users of this initial API.

@thockin
Copy link
Member

thockin commented Jan 30, 2024

Before we automate something (i.e., add templating and automatic scheduling), we need the manual flow to work

This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.

  • Users evaluate their workloads and decide on the set of resources needed on each node.
  • Users manually select the set of nodes on which they expect those workloads to run, and label those sets of nodes.
  • Users pre-provision the resources on those nodes using ResourceClaims that specify those nodes.
  • Users create deployments or other workload controller resources to provision the pods, using a nodeSelector to map those to the set of nodes with the pre-provisioned resources.

In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right?

IOW it uses node-management as a substitute for (coarse) resource management.

I am certainly seeing a trend of people running (effectively) one workload pod per node, which fits this model. If we tidy that up and codify it (even just saying "here's a pattern you should not feel bad about"), does it relieve some of the pressure?

@sftim
Copy link
Contributor

sftim commented Jan 30, 2024

A manual (external) flow could work for things that you can attach to nodes, especially if they are then dedicated to the thing they attach to. Device plugins provide the ability for the thing you attach to work, but they don't attach it for you.

Something like https://cloud.google.com/compute/docs/gpus/add-remove-gpus -something triggers making a ResourceClaim, and automation fulfils it by attaching a GPU to a VM. NFD runs and labels the node. There is probably a device plugin in this story; anyway, we end up with a node that has some GPU capacity available.

Next, someone runs a Job that selects for that kind of GPU, and it works. However, what's possibly missing at this stage is the ability to reserve that GPU resource from the claim and avoid it being used for other Pods.

If we want to run a second Pod, maybe we're able to attach a second GPU, avoiding the one-pod-per-node problem. We aren't doing DRA but we have helped some stories and narrowed what still needs delivering.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

Pre-provisioned volumes are a lesson I took to heart.

Those were also what came to my mind when I read @johnbelamaric's outline. Pre-provisioned volumes have been replaced by CSI volume provisioning, but now that also is stuck having to support the complicated "volume binding" between PVC and PV. The result is that there are still race conditions that can lead to leaked volumes. Let's not repeat this for DRA.

@sftim
Copy link
Contributor

sftim commented Jan 30, 2024

Here's why I think it could be different.

  • with storage, you can set up the volume manually (which, to be fair, the control plane doesn't stop you from doing even with newer mechanisms)
  • the PVC has to handle the case where something has provided the PV already

However:

  • I don't think anyone is proposing the equivalent model for general resources.

Let's say you're attaching GPUs to a node, and you make a ResourceClaim to specify there should be 2 GPUs attached. The helper finds there's already one GPU manually / previous attached. How about we specify that this is not only undefined behaviour, but that we expect drivers to taint the node if they see it. No need to have complex logic around manual and automatic provisioning; something is Wrong and the node might not be any good now.

If the helper is told to add 2 GPUs for a total of 2 attached GPUs and finds 0 attached GPUs: great! We get nodes with GPUs, no need to taint, and other consequences such as NFD and device plugin registration can all happen.

Does that work?

@johnbelamaric
Copy link
Member

This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.

Fair point. But I think the trouble comes in more when you don't have a clear sense of ownership - that is, mixed manual and automated flows. If the automated flows have full ownership (overriding anything the user may have done).

In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right?

I am not sure there is one, except that we are rebuilding a model that can be further extended to the full support. Another alternative may be to extend those existing mechanisms, rather than invent a new one.

My main goal with spit balling some alternatives is to see if we can deliver incremental scope in a useful bug digestible way. My thinking is:

  1. I can solve my use case with a combination of basic primitives and client-side tooling, but it requires me to know a lot of stuff about resources in my nodes.
  2. I can reduce the client side tooling with an API that incorporates basic client-side patterns I am using.
  3. I can automate scheduling by encoding what I know about resources in my nodes and exposing that information to the control plane.

2 and 3 may have to go together, but I was hoping to find a solution to 1. With this approach, the functionality delivered in 1 can be done earlier because it is simpler, and it does not need to material change as we implement 2 and 3, reducing risk.

Admittedly I may be cutting things up wrong...but I think there is merit to the approach.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

  1. I can solve my use case with a combination of basic primitives and client-side tooling, but it requires me to know a lot of stuff about resources in my nodes.

This can be done with the current API minus PodSchedulingContext and without adding numeric parameters: write a DRA driver kubelet plugin and a corresponding controller, then use immediate allocation with some driver specific claim parameters that select the node (by name or by labels). There's no need for an API extension specifically for this mode of operation.

Perhaps some hardware vendor will even write such a DRA driver for you, if some of their customers want to use it like this - I can't speak for either of them. This is why the KEP has "collect feedback" as one of the graduation criteria.

This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷

@johnbelamaric
Copy link
Member

This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷

I am not sure there is one!

@thockin
Copy link
Member

thockin commented Jan 30, 2024

Immediate provisioning doesn't ensure that the node on which the resource was provisioned can fit the pod's other dimensions, but maybe that's OK? The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests.

What I am trying to get to, and I think John is aiming at the same goal (not that you're not, Patrick :), is to say what is the biggest problem people really experience today, and how can we make that better? A year ago I would have said it was GPU sharing. Now I understand (anecdotally) that sharing is far less important than simply getting out of the way for people who want to use whole GPUs.

Here's my uber-concern: k8s is the distillation of more than a decade of real-world experience running serving workloads (primarily) on mostly-fungible hardware. The game has changed, and we don't have a decade of experience. Anything we do right now has better than even odds of being wrong within a short period of time. The hardware is changing. Training vs. inference is changing. Capacity is crunched everywhere, and there's a sort of "gold rush" going on.

What can we do to make life better for people? How can we help them improve their efficiency and their time to market? Everything else seems secondary.

I ACK that this is GPU-centric and that DRA does more than just GPUs.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.

@klueska, @byako : I'll punt this to you.

Just beware that it would mean that we need to rush another KEP for "PodSchedulingContext" for 1.30 and add a feature gate for that - I'm a bit worried that we are stretching ourselves too thin when we do that, and we also skip all of the usual "gather feedback" steps for beta. I'd much rather focus on numeric parameters...

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests.

That sounds like "numeric parameters", which we cannot promote to beta in 1.30. When using "numeric parameters", PodSchedulingContext is indeed not needed.

@thockin
Copy link
Member

thockin commented Jan 30, 2024

Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that.

If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30?

@johnbelamaric
Copy link
Member

Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that.

Yeah. I concede nothing in beta in 1.30. That was my original position anyway, but I though perhaps if we scoped something way down but kept continuity, we could do it. But it's clearly a no-go.

If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30?

Great question, that is what I am looking for along the lines of MVP. What we really need to do is go back to the use cases for that, which is pretty hard on this tight timeline. The better option may be to ask "what would we be trying to achieve" cutting out the "in 30", and defer that question to 31. In the meantime, maybe we make some incremental steps in the direction we need in 30 based on what we know so far - something like @pohly is saying here plus numerical models.

@alculquicondor
Copy link
Member

alculquicondor commented Jan 30, 2024

I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.

Can you expand on this? What becomes core DRA?
Are we going to end up in the same inconsistent situation that the topology manager is? The one where kube-scheduler sends a Pod to a Node and the kubelet just rejects it.

@johnbelamaric
Copy link
Member

I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.

Can you expand on this? What becomes core DRA?

Are we going to end up in the same inconsistent situation that the topology manager is? The one where kube-scheduler sends a Pod to a Node and the kubelet just rejects it.

I think it's clear at this point that nothing is going beta in 1.30. We will make sure to avoid the issue you are describing. I think we should err on the side of "failing to schedule" rather than "scheduling and failing".

@salehsedghpour
Copy link
Contributor

/milestone clear

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2024
@pohly
Copy link
Contributor Author

pohly commented May 22, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2024
@SergeyKanzhelev
Copy link
Member

/label lead-opted-in
/milestone v1.31

this will be worked on, correct?

@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone May 23, 2024
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label May 23, 2024
@johnbelamaric
Copy link
Member

Yes, though what we will be doing is moving the "classic DRA" under a separate feature gate, and the "structured parameters DRA" will be under the primary feature gate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Projects
Status: Net New
Status: Tracked
Status: Tracked
Status: Removed from Milestone
Status: No status
Status: Needs Triage
Development

No branches or pull requests