New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic resource allocation #3063
Comments
/assign @pohly |
do we have a discussion issue on this enhancement? |
@ahg-g: with discussion issue you mean a separate issue in some repo (where?) in which arbitrary comments are welcome? No, not at the moment. I've also not seen that done elsewhere before. IMHO at this point the open KEP PR is a good place to collect feedback and questions. I also intend to come to the next SIG-Scheduling meeting. |
Yeah, this is what I was looking for, the issue would be under k/k repo.
That is actually the common practice, one starts a feature request issue where the community discusses initial ideas and the merits of the request (look for issues with label
But the community have no idea what this is about yet, so better to have an issue discusses "What would you like to be added?" and "Why is this needed" beforehand. Also, meetings are attended by fairly small groups of contributors, having an issue tracking the discussion is important IMO. |
In my work in SIG-Storage I've not seen much use of such a discussion issue. Instead I had the impression that the usage of "kind/feature" is discouraged nowadays. https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.yaml explicitly says
This proposal was discussed with various people beforehand, now we are in the formal KEP phase. But I agree, it is hard to provide a good link to those prior discussions. |
We use that in sig-scheduling, and it does serve as a very good place for initial rounds of discussions, discussions on slack and meetings are hard to reference as you pointed out. I still have no idea what this is proposing, and I may not attend the next sig meeting for example... |
Hi @ ! 1.24 Enhancements team here.
The status of this enhancement is track as |
The Enhancements Freeze is now in effect and this enhancement is removed from the release. /milestone clear |
Hi Patrick, `tracked/yes` will be applied when the KEP is merged and all
requirements for Enhancement Freeze are met. We are definitely keeping an
eye on this! Feel free to ping me once it's merged
…On Tue, Mar 1, 2022 at 11:17 AM Patrick Ohly ***@***.***> wrote:
@gracenng <https://github.com/gracenng> : an exception was requested and
granted for this enhancement to move to GA in 1.24:
https://groups.google.com/g/kubernetes-sig-release/c/sUpd2H1wxnk/m/lL_I6GT-BwAJ
Can you label it again as "tracked/yes"?
—
Reply to this email directly, view it on GitHub
<#3063 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKCRLO46IYCBRTXXZE6EHGTU5ZUM3ANCNFSM5JCIMTQA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Hello @pohly 👋, 1.25 Enhancements team here. Just checking in as we approach enhancements freeze on 18:00 PST on Thursday June 16, 2022. For note, This enhancement is targeting for stage Here's where this enhancement currently stands:
It looks like #3064 will address everything in this list. For note, the status of this enhancement is marked as |
Yeah, I think you're right, this doesn't quite work and templates are probably the fix. The goal as you said is be to avoid pod scheduling context, not templates really. I still think it's possible to create a scoped down but still useful API that accomplishes that. |
Today people solve this by grabbing the whole node and/or running privileges pods. This API that avoids this, allowing an administrator to pre-allocate resources via the node-side (privileged) drivers, without requiring a the user pod to have those privileges. Those would be the users of this initial API. |
This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.
In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right? IOW it uses node-management as a substitute for (coarse) resource management. I am certainly seeing a trend of people running (effectively) one workload pod per node, which fits this model. If we tidy that up and codify it (even just saying "here's a pattern you should not feel bad about"), does it relieve some of the pressure? |
A manual (external) flow could work for things that you can attach to nodes, especially if they are then dedicated to the thing they attach to. Device plugins provide the ability for the thing you attach to work, but they don't attach it for you. Something like https://cloud.google.com/compute/docs/gpus/add-remove-gpus -something triggers making a ResourceClaim, and automation fulfils it by attaching a GPU to a VM. NFD runs and labels the node. There is probably a device plugin in this story; anyway, we end up with a node that has some GPU capacity available. Next, someone runs a Job that selects for that kind of GPU, and it works. However, what's possibly missing at this stage is the ability to reserve that GPU resource from the claim and avoid it being used for other Pods. If we want to run a second Pod, maybe we're able to attach a second GPU, avoiding the one-pod-per-node problem. We aren't doing DRA but we have helped some stories and narrowed what still needs delivering. |
Those were also what came to my mind when I read @johnbelamaric's outline. Pre-provisioned volumes have been replaced by CSI volume provisioning, but now that also is stuck having to support the complicated "volume binding" between PVC and PV. The result is that there are still race conditions that can lead to leaked volumes. Let's not repeat this for DRA. |
Here's why I think it could be different.
However:
Let's say you're attaching GPUs to a node, and you make a ResourceClaim to specify there should be 2 GPUs attached. The helper finds there's already one GPU manually / previous attached. How about we specify that this is not only undefined behaviour, but that we expect drivers to taint the node if they see it. No need to have complex logic around manual and automatic provisioning; something is Wrong and the node might not be any good now. If the helper is told to add 2 GPUs for a total of 2 attached GPUs and finds 0 attached GPUs: great! We get nodes with GPUs, no need to taint, and other consequences such as NFD and device plugin registration can all happen. Does that work? |
Fair point. But I think the trouble comes in more when you don't have a clear sense of ownership - that is, mixed manual and automated flows. If the automated flows have full ownership (overriding anything the user may have done).
I am not sure there is one, except that we are rebuilding a model that can be further extended to the full support. Another alternative may be to extend those existing mechanisms, rather than invent a new one. My main goal with spit balling some alternatives is to see if we can deliver incremental scope in a useful bug digestible way. My thinking is:
2 and 3 may have to go together, but I was hoping to find a solution to 1. With this approach, the functionality delivered in 1 can be done earlier because it is simpler, and it does not need to material change as we implement 2 and 3, reducing risk. Admittedly I may be cutting things up wrong...but I think there is merit to the approach. |
This can be done with the current API minus Perhaps some hardware vendor will even write such a DRA driver for you, if some of their customers want to use it like this - I can't speak for either of them. This is why the KEP has "collect feedback" as one of the graduation criteria. This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷 |
I am not sure there is one! |
Immediate provisioning doesn't ensure that the node on which the resource was provisioned can fit the pod's other dimensions, but maybe that's OK? The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests. What I am trying to get to, and I think John is aiming at the same goal (not that you're not, Patrick :), is to say what is the biggest problem people really experience today, and how can we make that better? A year ago I would have said it was GPU sharing. Now I understand (anecdotally) that sharing is far less important than simply getting out of the way for people who want to use whole GPUs. Here's my uber-concern: k8s is the distillation of more than a decade of real-world experience running serving workloads (primarily) on mostly-fungible hardware. The game has changed, and we don't have a decade of experience. Anything we do right now has better than even odds of being wrong within a short period of time. The hardware is changing. Training vs. inference is changing. Capacity is crunched everywhere, and there's a sort of "gold rush" going on. What can we do to make life better for people? How can we help them improve their efficiency and their time to market? Everything else seems secondary. I ACK that this is GPU-centric and that DRA does more than just GPUs. |
I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere. @klueska, @byako : I'll punt this to you. Just beware that it would mean that we need to rush another KEP for "PodSchedulingContext" for 1.30 and add a feature gate for that - I'm a bit worried that we are stretching ourselves too thin when we do that, and we also skip all of the usual "gather feedback" steps for beta. I'd much rather focus on numeric parameters... |
That sounds like "numeric parameters", which we cannot promote to beta in 1.30. When using "numeric parameters", PodSchedulingContext is indeed not needed. |
Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that. If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30? |
Yeah. I concede nothing in beta in 1.30. That was my original position anyway, but I though perhaps if we scoped something way down but kept continuity, we could do it. But it's clearly a no-go.
Great question, that is what I am looking for along the lines of MVP. What we really need to do is go back to the use cases for that, which is pretty hard on this tight timeline. The better option may be to ask "what would we be trying to achieve" cutting out the "in 30", and defer that question to 31. In the meantime, maybe we make some incremental steps in the direction we need in 30 based on what we know so far - something like @pohly is saying here plus numerical models. |
Can you expand on this? What becomes core DRA? |
I think it's clear at this point that nothing is going beta in 1.30. We will make sure to avoid the issue you are describing. I think we should err on the side of "failing to schedule" rather than "scheduling and failing". |
/milestone clear |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/label lead-opted-in this will be worked on, correct? |
Yes, though what we will be doing is moving the "classic DRA" under a separate feature gate, and the "structured parameters DRA" will be under the primary feature gate. |
Enhancement Description
One-line enhancement description: dynamic resource allocation
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
Discussion Link: CNCF TAG Runtime Container Device Interface (COD) Working Group meeting(s)
Primary contact (assignee): @pohly
Responsible SIGs: SIG Node
Enhancement target (which target equals to which milestone):
Alpha (1.26)
k/enhancements
) update PR(s):k/k
) update PR(s): dynamic resource allocation kubernetes#111023k/website
) update PR(s):Alpha (1.27)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Alpha (1.28)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s): dra: update for Kubernetes 1.28 website#41856Alpha (1.29)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Alpha (1.30)
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):The text was updated successfully, but these errors were encountered: