New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: integration with cluster autoscaler #118612
Comments
/sig node |
/kind feature |
/assign |
@pohly Can you please link the doc that captures the latest design decision for this integration? |
I'll share PRs soon. That'll explain the changes that will be needed in the scheduler framework and CA. |
/sig autoscaling |
cc @devigned |
/cc |
cc @alexeldeib |
My full branch with all in-tree changes is master...pohly:kubernetes:dra-cluster-autoscaler The corresponding WIP PR for CA is kubernetes/autoscaler#6163 But this is just for reference. In the Batch WG meeting from 2023-12-07 we agreed to focus on "numeric parameters" as a compromise between flexibility and ease-of-use with regards to autoscaling. I'm currently doing some prototyping and will publish a KEP soon. |
See kubernetes/enhancements#4381 for the "structured parameters" KEP. /assign @towca See #123516 (comment) for a discussion how the current code for "structured parameters" could fit into Cluster Autoscaler. |
/priority important-soon I think we want this to work together with Kubernetes 1.31. |
Hey, I started looking into the scheduler side of the required changes and could really use some scheduler expertise and discussion here. @kubernetes/sig-scheduling-leads could you take a look when you get a chance? Quick summary of the current state:
Proposed solution
This seems like the cleanest solution to me, but I'm not very familiar with the scheduler codebase. It also has the benefit of making the DRA object changes visible to other plugins, which could be useful if we e.g. have multiple DRA-related plugins in the future. Details:
Alternatives consideredFor a much quicker solution, I think we could just inject fake informers for the DRA objects from CA through the framework's From CA's perspective, both solutions should work the same, but this one feels like a hack on the scheduler side. It feels weird that scheduler operates on a snapshot for Pods and Nodes, but on a live version of DRA objects associated to Pods and Nodes. Doesn't that make us subject to similar race conditions that the snapshot is meant to solve? |
Overall, the proposal SGTM. Of course, the devil lives in the details, and some details we will only know once you start prototyping :) My only immediate question is regarding:
Can you remind me which event handlers? Ideally, there shouldn't be any event handlers in the plugins.
@pohly do you recall the reasoning? Other than that, I also don't like the alternative. |
Storing a locally-modified object in the assume cache (= deep-copy, change some status field, assume) is affected by a race: if that same object gets modified in the API server, the informer receives a more recent copy and the local modification gets dropped. I know that the volume binding plugin does that and pointed it out here. Perhaps it's okay there because the plugin recovers gracefully (not sure!), but for DRA it wouldn't be okay because it would make scheduling decisions based on incorrect data. Therefore the DRA plugin only stores objects in the assume cache which it received back from the API server after the update. That is safe. |
@towca: have you seen #124102 and the follow-up in master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers? I think I mentioned those already, but I am no longer sure where 😅 That changes how the assume cache gets managed. My expectation is that this is a prerequisite of your work. |
@alculquicondor The event handlers registered by
@pohly I haven't before, just skimmed through the changes now. It seems that my plan would make your proposed changes obsolete - the claims would be integrated into the current scheduler cache, instead of using the If I understand the race correctly, I think my proposal would still take care of it, while also making integration with CA straightforward. WDYT? |
Ah gotcha. But that code doesn't depend on the listers. CA doesn't use this function, so it shouldn't be a problem. |
We can wrap the assume cache API as you suggested. But we still need to move the assume cache implementation (#124102) and then make it support driving cluster events (pkg/scheduler/eventhandlers.go in master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers), right? |
@alculquicondor IIUC the event handlers are only used for moving the pod between scheduler queues, so definitely shouldn't impact CA. However, they do actually use listers, see e.g. here: this method fires for a change in
I think we might be understanding my plan differently 😅 The main scheduler cache exposes My plan was to use more or less the same logic for claims: expose But you're right that to solve the race we'd then also need to add event handling (event being either Add or Assume) functionality to the cache, which probably doesn't make sense. Ok, so how about this:
@alculquicondor does the above sound ok as well? BTW @pohly Do I understand correctly that after your changes are merged, the claim Assume call can be moved to Reserve? |
No, it has to stay in PreBind. PreBind is where the plugin gets the updated ResourceClaim back from the API server and that object then gets stored in the assume cache. |
I see. Yes, I think so.
The cache is already updated by event handlers, so I don't see why this wouldn't be fine.
That sounds slightly more error prone. |
@alculquicondor IIUC the cache is updated from the informers by event handlers, or by the My understanding was that in order to solve the race, we need to have event handlers that get notifications for If we were to implement assuming claims directly in the cache, some of the notifications would be triggered by the informer event handlers, and some by the cache itself. We'd probably need some third object to coordinate this, at which point just using the
@pohly Ok, now I'm confused. Is it not enough that the plugin's claim event handler gets the Assume notification? Also, isn't the scheduler free to move on to the scheduling cycle of a next pod before the |
Both the event handlers in But that is unrelated to what I said. What I pointed out is that the AssumeCache should never be used to store a locally modified copy of a claim, because that local modification can get lost.
Yes. That is why |
Suppose Reserve were to use AssumeCache.Assume with a ResourceClaim where the status has been updated to document an allocation before actually writing that status to the apiserver. The following then can happen:
|
@pohly Thanks for the detailed answers! Sorry for more questions, but I'm still not clear on many things:
In general, the responsibilities and relationship between inFlightAllocations and the AssumeCache are quite convoluted IMO. It feels like they should be one thing that gets called in Reserve, and internally solves the race you explained in the comment above. In particular, I don't think we can have Cluster Autoscaler call |
The race in #118612 (comment) does not affect the dynamicresource plugin because I am avoiding the root cause (storing a locally modified copy of the object in the assume cache). I believe the volumebinding plugin is affected, but perhaps it doesn't matter there. In the dynamicresource plugin, I am handling the local modification with the "in-flight claim" map instead of the assume cache. The other race which does affect the dynamicresource plugin is between event handlers and the plugin (
Once the allocation has been written to the API server, it will get removed from the status by kube-controller-manager when it observes that the allocation is no longer needed. The assume cache handles such updates properly. If we were instead to keep the local modification in One could argue that deallocation should also be handled by the scheduler, but I think doing it in the kube-controller-manager is better: a single instance of that can handle deallaction for all schedulers, and adding a controller to the scheduler felt wrong.
Not necessarily. You could give the plugin a fake client with a store managed by CA. Then after calling PreBind, you have the new state without touching anything in the apiserver. I thought that was how you intended to reuse a plugin without having to modify what it does when called by CA? If you think that this is too complicated, then I'd like to bring up my original proposal once more: introduce an interface for managing arbitrary state inside plugins. What CA then needs to do is call that interface if implemented by a plugin, but it would do that the same way for all plugins. Only the dynamicresources plugin would implement this interface initially, but I could also see it being useful to simulate local storage in the volumebinding plugin. To me, it feels wrong to have an abstract plugin interface and then do special handling of the dynamicresource plugin in CA. The downside is that we need to do some work in k/k which can only be tested end-to-end once the code lands in CA. |
@pohly Yup, that's what I meant. Ok, so I think I have the full picture now:
Currently, the CA/scheduler integration is designed to go entirely through this In other words: at the beginning of a scheduling cycle of a pod, scheduler takes a snapshot of Pods and Nodes in the cluster that it then uses instead of real Pod/Node informers; Cluster Autoscaler injects its own snapshot implementation that it fully controls. In particular, the only information that CA needs to pass through the interface are aggregated Pod/Node lists at one point in time - which CA operates on anyway. My idea this whole time has been extending this pluggable snapshot mechanism to cover DRA resources in addition to Pods and Nodes, and integrate the DRA plugin with the snapshot mechanism:
CA injecting partially-fake clientsets/informerfactories into the framework would IMO be a hack going around the CA/scheduler integration:
Your proposal was CA calling some framework methods at specific points in the simulation, right? That would IMO unfortunately be worse than hacking the clients/informers. It'd require a lot of re-work in CA, and re-implementing a lot of current CA logic in scheduler (the "forkable, modifiable snapshot" logic would have to be reimplemented on the scheduler side, including behavior like nested forks). I see your point about the framework being technically plugin-agnostic but then CA needing to do plugin-specific things like tracking the DRA objects, but I don't think there's a way around it. To me, CA tracking DRA objects is less "doing something DRA-plugin-specific", and more "actually tracking Nodes correctly in a world where now essentially a part of the Node lives in a separate object".
I don't have much experience writing controller code, but could something like "remove if we see the change reflected in the claim OR if we see a different resourceVersion" work? |
So you want to introduce a custom API for modifying ResourceClaim and ResourceClaimStatus? I'm fine with that, if you feel that it makes your life easier. Just beware that #122148 (comment) uses a JSON patch for a reason. Whatever API you provide instead will have to do the same under the hood in kube-scheduler. |
"remove if we see the change reflected in the claim" doesn't work because there is no guarantee that all intermediate states are seen. There's a theoretic race where the assumed state gets applied on the apiserver, immediately removed, and then the client only sees the new object without the change. "see a different resourceVersion" is what what the assume cache currently does (more precisely, a "newer than"). But that drops the assumed state for any update coming from the apiserver, which is the first race for locally-modified objects. |
I want to extend the current CA/scheduler integration "API" with that functionality, but essentially yes.
You mean kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go Line 1635 in 0f06328
right? I noticed the comments there, I'll definitely take this into account.
Yeah, I had a feeling it wouldn't be this easy 😅 Thanks for the discussion @pohly! I think I have a pretty good picture of what behavior needs to be preserved in the plugin now, and a couple of ideas for implementing the integration. I need to do some prototyping now, should have the bandwidth for that in 1-2 weeks. |
The problem that we need to solve for cluster autoscaling when the cluster uses dynamic resource allocation is to simulate how adding new nodes will affect pod scheduling. This matters when some DRA driver depends on resources on such new nodes.
One possible solution is outlined in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#cluster-autoscaler
Another idea was to enable loading and calling WASM plugins. This would allow vendors to publish portable plugins and avoid inter-process communication. https://github.com/kubernetes-sigs/kube-scheduler-wasm-extension is exploring that as alternative for normal scheduler plugins. However, WASM has several challenges, primarily that all data exchanged between host and plugin guest needs to be encoded and decoded, which makes one of the goals (similar performance and overhead as built-in plugins) impossible.
The text was updated successfully, but these errors were encountered: