Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: integration with cluster autoscaler #118612

Open
pohly opened this issue Jun 12, 2023 · 33 comments
Open

DRA: integration with cluster autoscaler #118612

pohly opened this issue Jun 12, 2023 · 33 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pohly
Copy link
Contributor

pohly commented Jun 12, 2023

The problem that we need to solve for cluster autoscaling when the cluster uses dynamic resource allocation is to simulate how adding new nodes will affect pod scheduling. This matters when some DRA driver depends on resources on such new nodes.

One possible solution is outlined in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#cluster-autoscaler

Another idea was to enable loading and calling WASM plugins. This would allow vendors to publish portable plugins and avoid inter-process communication. https://github.com/kubernetes-sigs/kube-scheduler-wasm-extension is exploring that as alternative for normal scheduler plugins. However, WASM has several challenges, primarily that all data exchanged between host and plugin guest needs to be encoded and decoded, which makes one of the goals (similar performance and overhead as built-in plugins) impossible.

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 12, 2023
@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2023

/sig node
/sig scheduling
/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 12, 2023
@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2023

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 12, 2023
@pohly pohly changed the title integration with cluster autoscaler DRA: integration with cluster autoscaler Jun 12, 2023
@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2023

/assign

@geetasg
Copy link

geetasg commented Sep 28, 2023

@pohly Can you please link the doc that captures the latest design decision for this integration?

@pohly
Copy link
Contributor Author

pohly commented Sep 28, 2023

I'll share PRs soon. That'll explain the changes that will be needed in the scheduler framework and CA.

@dims
Copy link
Member

dims commented Nov 12, 2023

cc @jonathan-innis

@sftim
Copy link
Contributor

sftim commented Nov 13, 2023

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Nov 13, 2023
@jackfrancis
Copy link
Contributor

cc @devigned

@tzneal
Copy link
Contributor

tzneal commented Nov 15, 2023

/cc

@marwanad
Copy link
Member

cc @alexeldeib

@pohly
Copy link
Contributor Author

pohly commented Dec 12, 2023

My full branch with all in-tree changes is master...pohly:kubernetes:dra-cluster-autoscaler

The corresponding WIP PR for CA is kubernetes/autoscaler#6163

But this is just for reference. In the Batch WG meeting from 2023-12-07 we agreed to focus on "numeric parameters" as a compromise between flexibility and ease-of-use with regards to autoscaling. I'm currently doing some prototyping and will publish a KEP soon.

@pohly
Copy link
Contributor Author

pohly commented Mar 11, 2024

See kubernetes/enhancements#4381 for the "structured parameters" KEP.

/assign @towca

See #123516 (comment) for a discussion how the current code for "structured parameters" could fit into Cluster Autoscaler.

@pohly
Copy link
Contributor Author

pohly commented Mar 11, 2024

/priority important-soon

I think we want this to work together with Kubernetes 1.31.

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 11, 2024
@towca
Copy link
Contributor

towca commented Apr 12, 2024

Hey, I started looking into the scheduler side of the required changes and could really use some scheduler expertise and discussion here. @kubernetes/sig-scheduling-leads could you take a look when you get a chance?

Quick summary of the current state:

  • The DRA plugin currently uses real listers for 6 DRA-related objects.
  • Some of the real listers are necessary for the eventHandler logic.
  • Some of the objects (ResourceClaim, ResourceClaimParameters, ResourceSlice) are associated to Pods/Nodes, and are used in the filtering phase of the plugin. Cluster Autoscaler effectively needs to be able to control them.
  • The DRA plugin currently uses a local volumebinding.AssumeCache for persisting allocation information in ResourceClaims between scheduling cycles for different Pods. The cache is currently backed by a real ResourceClaims informer. Cluster Autoscaler effectively needs to be able to control this cache.

Proposed solution

ResourceSlices are associated to Nodes semantically. Similarly, ResourceClaims are associated to Pods, and ResourceClaimParameters are associated to ResourceClaims. We could track the new objects in the scheduler cache alongside Nodes and Pods. The DRA plugin would use the aggregated Pod/Node cluster snapshot dumped from the cache at the beginning of a scheduling cycle instead of the real listers. It would persist the claim changes in the scheduler cache similarly to how Pods are already persisted. Cluster Autoscaler is already integrated with the snapshot, it would just need to start simulating the new objects with minimal changes to the CA/scheduler integration itself.

This seems like the cleanest solution to me, but I'm not very familiar with the scheduler codebase. It also has the benefit of making the DRA object changes visible to other plugins, which could be useful if we e.g. have multiple DRA-related plugins in the future.

Details:

  • Add a []ResourceSlice field to framework.NodeInfo and []ResourceClaim, []ResourceClaimParameters fields to framework.PodInfo.
  • Extend the internal.Cache interface, its implementation, and the global scheduler eventHandlers to track ResourceClaims, ResourceClaimParameters, and ResourceSlices in addition to Pods and Nodes. Aggregate the objects per-Node and per-Pod.
  • In particular, extend the cache to support assuming ResourceClaims similarly to how it supports assuming Pods. Expose methods for assuming/forgetting the claims from framework.Handle.
  • In UpdateSnapshot(), dump ResourceClaims and ResourceClaimParameters into PodInfos, and ResourceSlices into NodeInfos.
  • In the DRA plugin PreFilter and Filter, use fh.SnapshotSharedLister().NodeInfos() instead of the real listers for getting/listing ResourceClaims, ResourceClaimParameters, and ResourceSlices.
  • In the DRA plugin Reserve/Unreserve/PreBind: use the newly exposed framework handle methods instead of the local AssumeCache for assuming and forgetting claims.
    • I thought the claims were supposed to be assumed in Reserve, but it seems that it actually happens in PreBind.
  • In the DRA plugin keep the real listers for eventHandler logic, warn about using them in the filtering/reservation logic.

Alternatives considered

For a much quicker solution, I think we could just inject fake informers for the DRA objects from CA through the framework's SharedInformerFactory. This would require no changes to the DRA plugin, or the scheduler cache/snapshot/eventHandlers/framework. It would require more work on the CA side to craft the partly-fake factory, but that feels relatively easy in comparison to the scheduler changes proposed above.

From CA's perspective, both solutions should work the same, but this one feels like a hack on the scheduler side. It feels weird that scheduler operates on a snapshot for Pods and Nodes, but on a live version of DRA objects associated to Pods and Nodes. Doesn't that make us subject to similar race conditions that the snapshot is meant to solve?

@alculquicondor
Copy link
Member

cc @johnbelamaric

@alculquicondor
Copy link
Member

Overall, the proposal SGTM. Of course, the devil lives in the details, and some details we will only know once you start prototyping :)

My only immediate question is regarding:

In the DRA plugin keep the real listers for eventHandler logic, warn about using them in the filtering/reservation logic

Can you remind me which event handlers? Ideally, there shouldn't be any event handlers in the plugins.

I thought the claims were supposed to be assumed in Reserve, but it seems that it actually happens in PreBind.

@pohly do you recall the reasoning?

Other than that, I also don't like the alternative.

@pohly
Copy link
Contributor Author

pohly commented Apr 13, 2024

I thought the claims were supposed to be assumed in Reserve, but it seems that it actually happens in PreBind.

@pohly do you recall the reasoning?

Storing a locally-modified object in the assume cache (= deep-copy, change some status field, assume) is affected by a race: if that same object gets modified in the API server, the informer receives a more recent copy and the local modification gets dropped.

I know that the volume binding plugin does that and pointed it out here. Perhaps it's okay there because the plugin recovers gracefully (not sure!), but for DRA it wouldn't be okay because it would make scheduling decisions based on incorrect data.

Therefore the DRA plugin only stores objects in the assume cache which it received back from the API server after the update. That is safe.

@pohly
Copy link
Contributor Author

pohly commented Apr 13, 2024

@towca: have you seen #124102 and the follow-up in master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers? I think I mentioned those already, but I am no longer sure where 😅

That changes how the assume cache gets managed. My expectation is that this is a prerequisite of your work.

@towca
Copy link
Contributor

towca commented Apr 15, 2024

Can you remind me which event handlers? Ideally, there shouldn't be any event handlers in the plugins.

@alculquicondor The event handlers registered by EnqueueExtensions.EventsToRegister: https://github.com/kubernetes/kubernetes/blob/cae35dba5a3060711a2a3f958537003bc74a59c0/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go#L381C29-L381C45. This seems like a valid thing to do in a plugin, at least judging by the framework docs.

@towca: have you seen #124102 and the follow-up in master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers? I think I mentioned those already, but I am no longer sure where 😅

@pohly I haven't before, just skimmed through the changes now. It seems that my plan would make your proposed changes obsolete - the claims would be integrated into the current scheduler cache, instead of using the AssumeCache.

If I understand the race correctly, I think my proposal would still take care of it, while also making integration with CA straightforward. WDYT?

@alculquicondor
Copy link
Member

The event handlers registered by EnqueueExtensions.EventsToRegister

Ah gotcha. But that code doesn't depend on the listers. CA doesn't use this function, so it shouldn't be a problem.

@pohly
Copy link
Contributor Author

pohly commented Apr 15, 2024

If I understand the race correctly, I think my proposal would still take care of it, while also making integration with CA straightforward. WDYT?

We can wrap the assume cache API as you suggested.

But we still need to move the assume cache implementation (#124102) and then make it support driving cluster events (pkg/scheduler/eventhandlers.go in master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers), right?

@towca
Copy link
Contributor

towca commented Apr 16, 2024

The event handlers registered by EnqueueExtensions.EventsToRegister

Ah gotcha. But that code doesn't depend on the listers. CA doesn't use this function, so it shouldn't be a problem.

@alculquicondor IIUC the event handlers are only used for moving the pod between scheduler queues, so definitely shouldn't impact CA. However, they do actually use listers, see e.g. here: this method fires for a change in ResourceClaimParameters, and uses a ResourceClaim lister to iterate over all claims for a pod. So it seems that the plugin will need a lister for this regardless of the changes I proposed, right?

If I understand the race correctly, I think my proposal would still take care of it, while also making integration with CA straightforward. WDYT?

We can wrap the assume cache API as you suggested.

But we still need to move the assume cache implementation (#124102) and then make it support driving cluster events (pkg/scheduler/eventhandlers.go in master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers), right?

I think we might be understanding my plan differently 😅

The main scheduler cache exposes AddPod, and AssumePod methods. GetPod is called from a global informer event handler. Assuming is implemented directly in the cache, without delegating to other objects.

My plan was to use more or less the same logic for claims: expose AddClaim, and AssumeClaim methods from the cache, call AddClaim from a global event handler for claims, mirror/extract the pod logic for assuming since it's pretty minimal.

But you're right that to solve the race we'd then also need to add event handling (event being either Add or Assume) functionality to the cache, which probably doesn't make sense.

Ok, so how about this:

  • Don't change the main scheduler cache.
  • Add a global AssumeCache for claims to the framework - covered by Patrick's changes.
  • Instead of updating the NodeInfo/PodInfo snapshot (at the beginning of a new scheduling cycle) from just the main cache, update it from 3 sources:
    • The main cache: Pods, Nodes
    • The global ResourceClaim AssumeCache: ResourceClaims
    • The regular listers: ResourceClaimParameters, ResourceSlices
  • The rest stays the same as my original plan: DRA plugin uses the snapshot in PreFilter and Filter, calls the framework to assume a Pod (which passes through to the global claim AssumeCache), etc.

@alculquicondor does the above sound ok as well?

BTW @pohly Do I understand correctly that after your changes are merged, the claim Assume call can be moved to Reserve?

@pohly
Copy link
Contributor Author

pohly commented Apr 16, 2024

Do I understand correctly that after your changes are merged, the claim Assume call can be moved to Reserve?

No, it has to stay in PreBind. PreBind is where the plugin gets the updated ResourceClaim back from the API server and that object then gets stored in the assume cache.

@alculquicondor
Copy link
Member

So it seems that the plugin will need a lister for this regardless of the changes I proposed, right?

I see. Yes, I think so.

But you're right that to solve the race we'd then also need to add event handling (event being either Add or Assume) functionality to the cache, which probably doesn't make sense.

The cache is already updated by event handlers, so I don't see why this wouldn't be fine.

Instead of updating the NodeInfo/PodInfo snapshot (at the beginning of a new scheduling cycle) from just the main cache, update it from 3 sources.

That sounds slightly more error prone.

@towca
Copy link
Contributor

towca commented Apr 17, 2024

But you're right that to solve the race we'd then also need to add event handling (event being either Add or Assume) functionality to the cache, which probably doesn't make sense.

The cache is already updated by event handlers, so I don't see why this wouldn't be fine.

@alculquicondor IIUC the cache is updated from the informers by event handlers, or by the schedule_one.go logic calling Assume() without going through the API/informers.

My understanding was that in order to solve the race, we need to have event handlers that get notifications for ResourceClaim updates both from the informers, and from the plugin calling Assume(). Not sure if this understanding is correct though (see the reply to Patrick below).

If we were to implement assuming claims directly in the cache, some of the notifications would be triggered by the informer event handlers, and some by the cache itself. We'd probably need some third object to coordinate this, at which point just using the AssumeCache seems better. Or am I not seeing something here?

Do I understand correctly that after your changes are merged, the claim Assume call can be moved to Reserve?

No, it has to stay in PreBind. PreBind is where the plugin gets the updated ResourceClaim back from the API server and that object then gets stored in the assume cache.

@pohly Ok, now I'm confused. Is it not enough that the plugin's claim event handler gets the Assume notification?

Also, isn't the scheduler free to move on to the scheduling cycle of a next pod before the PreBind logic for the initial pod is exercised (since the bind phase is async)? Wouldn't this result in the plugin potentially not seeing the reserved claim while evaluating the second pod? Could you maybe explain the race in a bit more detail so that we're on the same page?

@pohly
Copy link
Contributor Author

pohly commented Apr 17, 2024

No, it has to stay in PreBind. PreBind is where the plugin gets the updated ResourceClaim back from the API server and that object then gets stored in the assume cache.

Is it not enough that the plugin's claim event handler gets the Assume notification?

Both the event handlers in pkg/scheduler/eventhandlers.go and in pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go must be driven by real changes in the API server. They also should better use the same source. In master...pohly:kubernetes:dra-scheduler-assume-cache-eventhandlers, that is the extended AssumeCache implementation.

But that is unrelated to what I said. What I pointed out is that the AssumeCache should never be used to store a locally modified copy of a claim, because that local modification can get lost.

Also, isn't the scheduler free to move on to the scheduling cycle of a next pod before the PreBind logic for the initial pod is exercised (since the bind phase is async)?

Yes. That is why dynamicresources/dynamicresources.go has inFlightAllocations.

@pohly
Copy link
Contributor Author

pohly commented Apr 17, 2024

Suppose Reserve were to use AssumeCache.Assume with a ResourceClaim where the status has been updated to document an allocation before actually writing that status to the apiserver. The following then can happen:

  • pod A causes resource X to be used for claim A, which gets recorded in the assume cache during Reserve
  • something modifies claim A in the apiserver
  • the apiserver sends an update of claim A with no allocation
  • the assume cache compares revisions and drops the locally modified copy of claim A because it is older than the update
  • pod B gets scheduled and also gets to use resource X for its claim B because the current state of the claims shows X as unused
  • PreBind of pod A and B updates the status of claims A and B such that both use the same underlying resource -> BOOM!

@towca
Copy link
Contributor

towca commented Apr 19, 2024

@pohly Thanks for the detailed answers! Sorry for more questions, but I'm still not clear on many things:

  1. Does the race you just described have anything to do with making the AssumeCache global? Or is that to fix a different race (the one mentioned here)? I think we might've been talking about 2 different things.
  2. If we have inFlightAllocations already, why do we need the AssumeCache? IIUC, the only utility it gives us is seeing the updated version of the claim in the next scheduling cycle, without having to wait for the informer to process the update notification. But don't we have the exact same information in inFlightAllocations already, and could use that instead?

In general, the responsibilities and relationship between inFlightAllocations and the AssumeCache are quite convoluted IMO. It feels like they should be one thing that gets called in Reserve, and internally solves the race you explained in the comment above.

In particular, I don't think we can have Cluster Autoscaler call PreBind during its simulations, since it actually modifies the real claims in the apiserver. And just calling Reserve doesn't seem to be correct - IIUC it would mark claims as in-flight forever? Or am I not seeing something here again?

@pohly
Copy link
Contributor Author

pohly commented Apr 20, 2024

The race in #118612 (comment) does not affect the dynamicresource plugin because I am avoiding the root cause (storing a locally modified copy of the object in the assume cache). I believe the volumebinding plugin is affected, but perhaps it doesn't matter there. In the dynamicresource plugin, I am handling the local modification with the "in-flight claim" map instead of the assume cache.

The other race which does affect the dynamicresource plugin is between event handlers and the plugin (

). Moving the assume cache doesn't fix that (it's just moving code around, without functional changes) but then my follow-up fixes it.

If we have inFlightAllocations already, why do we need the AssumeCache?

Once the allocation has been written to the API server, it will get removed from the status by kube-controller-manager when it observes that the allocation is no longer needed. The assume cache handles such updates properly. If we were instead to keep the local modification in inFlightAllocations, how would it get removed from there?

One could argue that deallocation should also be handled by the scheduler, but I think doing it in the kube-controller-manager is better: a single instance of that can handle deallaction for all schedulers, and adding a controller to the scheduler felt wrong.

I don't think we can have Cluster Autoscaler call PreBind during its simulations, since it actually modifies the real claims in the apiserver.

Not necessarily. You could give the plugin a fake client with a store managed by CA. Then after calling PreBind, you have the new state without touching anything in the apiserver. I thought that was how you intended to reuse a plugin without having to modify what it does when called by CA?

If you think that this is too complicated, then I'd like to bring up my original proposal once more: introduce an interface for managing arbitrary state inside plugins. What CA then needs to do is call that interface if implemented by a plugin, but it would do that the same way for all plugins. Only the dynamicresources plugin would implement this interface initially, but I could also see it being useful to simulate local storage in the volumebinding plugin. To me, it feels wrong to have an abstract plugin interface and then do special handling of the dynamicresource plugin in CA.

The downside is that we need to do some work in k/k which can only be tested end-to-end once the code lands in CA.

@towca
Copy link
Contributor

towca commented Apr 22, 2024

). #124102 doesn't fix that (it's just moving code around, without functional changes) but then my follow-up fixes it.

@pohly Yup, that's what I meant. Ok, so I think I have the full picture now:

  • There is a race condition affecting AssumeCache if a locally-modified object is assumed before doing the API call with the actual change. Because of it, the DRA plugin only assumes in PreBind, after the API call.
  • There is another race condition affecting AssumeCache. The DRA plugin is currently affected, your in-progress changes adress that.

Not necessarily. You could give the plugin a fake client with a store managed by CA. Then after calling PreBind, you have the new state without touching anything in the apiserver. I thought that was how you intended to reuse a plugin without having to modify what it does when called by CA?

Currently, the CA/scheduler integration is designed to go entirely through this SharedListers interface. In scheduler, this interface is implemented by the snapshot created at the beginning of a scheduling cycle from the main scheduler cache tracking Pods and Nodes. In CA, the SharedListers interface is extended to ClusterSnapshot, and implemented e.g. by DeltaClusterSnapshot.

In other words: at the beginning of a scheduling cycle of a pod, scheduler takes a snapshot of Pods and Nodes in the cluster that it then uses instead of real Pod/Node informers; Cluster Autoscaler injects its own snapshot implementation that it fully controls.

In particular, the only information that CA needs to pass through the interface are aggregated Pod/Node lists at one point in time - which CA operates on anyway.

My idea this whole time has been extending this pluggable snapshot mechanism to cover DRA resources in addition to Pods and Nodes, and integrate the DRA plugin with the snapshot mechanism:

  • Attach DRA resources to the same structure that already holds Pods and Nodes and that CA already knows.
  • Make the DRA plugin only read and modify DRA resources by interacting with the snapshot (or some other high-level object that CA can plug in) instead of real informers.

CA injecting partially-fake clientsets/informerfactories into the framework would IMO be a hack going around the CA/scheduler integration:

  • It'd make the whole CA/scheduler integration surface less clear/discoverable/understandable. Instead of a single well-defined interface that injects information into a single point of the scheduler logic, we'd have the interface+required hacking of random informers/clients with unclear effect on the scheduler logic.
  • It'd require Cluster Autoscaler to go much lower level with its integration than it currently does. Instead of just dumping the Pod/Node/DRA information from an object that it already needs without much processing, we'd have to introduce a new object that'd track that information in a way compatible with informers/stores.
  • This issue of a plugin modifying node/pod-scoped k8s resources between filter/reserve seems like a pattern that we've already seen (e.g. volumes), and that's likely to repeat in the future. If we solve it through the "proper" integration mechanism we have right now, there's at least some chance that future plugins will follow this pattern, making CA integration straightforward.

Your proposal was CA calling some framework methods at specific points in the simulation, right? That would IMO unfortunately be worse than hacking the clients/informers. It'd require a lot of re-work in CA, and re-implementing a lot of current CA logic in scheduler (the "forkable, modifiable snapshot" logic would have to be reimplemented on the scheduler side, including behavior like nested forks).

I see your point about the framework being technically plugin-agnostic but then CA needing to do plugin-specific things like tracking the DRA objects, but I don't think there's a way around it. To me, CA tracking DRA objects is less "doing something DRA-plugin-specific", and more "actually tracking Nodes correctly in a world where now essentially a part of the Node lives in a separate object".

Once the allocation has been written to the API server, it will get removed from the status by kube-controller-manager when it observes that the allocation is no longer needed. The assume cache handles such updates properly. If we were instead to keep the local modification in inFlightAllocations, how would it get removed from there?

I don't have much experience writing controller code, but could something like "remove if we see the change reflected in the claim OR if we see a different resourceVersion" work?

@pohly
Copy link
Contributor Author

pohly commented Apr 22, 2024

My idea this whole time has been extending this pluggable snapshot mechanism to cover DRA resources in addition to Pods and Nodes, and integrate the DRA plugin with the snapshot mechanism:

  • Attach DRA resources to the same structure that already holds Pods and Nodes and that CA already knows.
  • Make the DRA plugin only read and modify DRA resources by interacting with the snapshot (or some other high-level object that CA can plug in) instead of real informers.

So you want to introduce a custom API for modifying ResourceClaim and ResourceClaimStatus? I'm fine with that, if you feel that it makes your life easier.

Just beware that #122148 (comment) uses a JSON patch for a reason. Whatever API you provide instead will have to do the same under the hood in kube-scheduler.

@pohly
Copy link
Contributor Author

pohly commented Apr 22, 2024

I don't have much experience writing controller code, but could something like "remove if we see the change reflected in the claim OR if we see a different resourceVersion" work?

"remove if we see the change reflected in the claim" doesn't work because there is no guarantee that all intermediate states are seen. There's a theoretic race where the assumed state gets applied on the apiserver, immediately removed, and then the client only sees the new object without the change.

"see a different resourceVersion" is what what the assume cache currently does (more precisely, a "newer than"). But that drops the assumed state for any update coming from the apiserver, which is the first race for locally-modified objects.

@towca
Copy link
Contributor

towca commented Apr 24, 2024

So you want to introduce a custom API for modifying ResourceClaim and ResourceClaimStatus? I'm fine with that, if you feel that it makes your life easier.

I want to extend the current CA/scheduler integration "API" with that functionality, but essentially yes.

Just beware that #122148 (comment) uses a JSON patch for a reason. Whatever API you provide instead will have to do the same under the hood in kube-scheduler.

You mean

patch := fmt.Sprintf(`{"metadata": {"uid": %q}, "status": {%s "reservedFor": [ {"resource": "pods", "name": %q, "uid": %q} ] }}`,

right? I noticed the comments there, I'll definitely take this into account.

"remove if we see the change reflected in the claim" doesn't work because there is no guarantee that all intermediate states are seen. There's a theoretic race where the assumed state gets applied on the apiserver, immediately removed, and then the client only sees the new object without the change.

"see a different resourceVersion" is what what the assume cache currently does (more precisely, a "newer than"). But that drops the assumed state for any update coming from the apiserver, which is the first race for locally-modified objects.

Yeah, I had a feeling it wouldn't be this easy 😅 Thanks for the discussion @pohly! I think I have a pretty good picture of what behavior needs to be preserved in the plugin now, and a couple of ideas for implementing the integration. I need to do some prototyping now, should have the bandwidth for that in 1-2 weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: 🔖 Ready
Status: Backlog
Development

No branches or pull requests

10 participants