New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between preemption within ClusterQueue, reclaim within Cohort and preemption with borrowing #1874
Comments
cc @tenzen-y @astefanutti @kerthcet wdyt of the breaking change? |
+1 not maintaining backward compatibility is OK on our side. I think that relates to #1741 for the metrics part. |
Actually, the current state could be considered buggy, because the message in the Evicted condition always says: "Preempted to accommodate a higher priority Workload". So at the very least we can start by distinguishing the message. |
@alculquicondor, I agree with making UX better. But, I think we need to announce this change as breaking in RELEASE NOTE and release announcement in google groups. |
Yes, we can put |
+1. The current message is misleading because when I'm not sure using I'm thinking about enriching the message, similarly as we build the events in the k8s scheduler. We could have a message like: `Preempted to accommodate the "UID-x" workload submitted to the "team-y" ClusterQueue." |
The thing is that messages are not machine-readable. For direct users of Kueue, messages are acceptable. But if you use kueue as a low-level component, then you might need a reason to distinguish between the different kinds of preemption. The reasons having a common prefix could be good enough to group them as one thing with different variants. Btw, in fair sharing, the preemption reason would still be PreemptDueToReclaim (or something similar). It's just that the "reclamation thresholds" are dynamic. |
I see, so maybe instead we have an extra condition type - type: Evicted
reason: Preempted
message: preempted to accomodate workload UID
- type: Preempted
reason: ReclaimWithinCohort / PreemptionWithinClusterQueue / FairSharingWithinCohort ...
message: preempted to accomodate workload UID # possibly some more data here
Yes, but then when a new reason is added by Kueue, then the external system also needs to be extended for monitoring, when a new version of Kueue introduces a new reason (if the system does not need fine grained information). I guess we could mitigate this by some convention, say all reasons start with "PreemptedDueTo". |
I like the idea of a dedicated Preempted condition. It's fully backwards compatible. We just need to make sure that the documentation makes a distinction between the two conditions (or that one is a subset of the other). |
A separate Preempted condition looks like a good solution, bringing better structure maintaining backward compatibility. |
+1 on having the |
/assign |
I'm wondering if we should distinguish I think preemption while borrowing can happen it two scenarios:
Then, there are variants of (2), because the threshold does not need to be specified. Also, My suggestion would be to just go with two reasons: Any views? |
If we can make the distinction between the two, that would be better IMO. End users might have access to the preemption condition, but they might not have access to the names or priorities or other workloads. But they still would want to know roughly why they were preempted. A nit on |
From the offline discussions we want to balance two aspects: satisfy the MVP which distinguishes "In ClusterQueue" and "In cohort", and being as granular as possible. I think there will always be a possibility, that a new config option or threshold is introduced, making the reason to split into two or more detailed reasons. In that case, I would suggest to start with something simple for now, two reasons: |
What would you like to be added:
Preemption in Kueue can happen due to the following reasons:
We need to enhance the Condition reason and messages to incorporate this information.
Why is this needed:
For users to better understand why Kueue preempted their Jobs.
Administrators might want to aggregate information for multiple preemptions, so ideally the Reason should be different, with a common prefix.
A potential problem is to break users that already depend on the
Preempted
reason to identify preemptions. Perhaps highlighting the breaking change should be enough in the release notes.We can also incorporate the information in a metric label.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: