New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A Job might finish with ready!=0, terminating!=0 #123775
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Another alternative would be to do something like 3, but 1 when using |
/cc |
/sig apps |
Reading this: When a Job is declared Failed, the running Pods still count as ready. This seems to be WAI. Ready and running are not related states. |
The issue extends also for
|
/cc |
Since this affects both This is not achieved if Job controller stops tracking "terminating" pods before the pods complete. In particular, once Kueue If we want to protect by beta-level feature, then I would suggest This is a fully non-breaking change, just "fixing" beta-level feature. Having said that I would also be for breaking change and do the waiting regardless of the feature gate. |
This was discovered during recent work on kubernetes/enhancements#4368 and affects the recent changes to the API field comments. I propose to revert the inacurate comments here #123792, and track it also in the graduation criteria for the feature, opened the KEP update: kubernetes/enhancements#4542. |
We should probably stress the fact, this is true only for failed jobs, not succeeded. The latter would be a bug, the former is a separate discussion. I'm looking at this situation like this, what needs to happen that this job won't fail? Why do you think we need to wait for the remaining 3 pods to fail? Is it possible the job will fix itself and fulfill the contract of a success? The only place where I can think of, where this kind of situation might be problematic is when you have the SuccessPolicy specified, that's the only place where I could see a job with a failed pod should not fail. In all other cases, I think the extra wait doesn't give us anything, does it? @tenzen-y this particular case described above might be important for you, when you'll be working on your feature. Probably worth adding an explicit test case covering that. |
Well, I don't think this is needed for the basic job functionality. Nonetheless, this is expected for job schedulers like Kueue. The 3 remaining pods still consume resources for 30s by default. Then, Kueue seeing the Job is "Failed" schedules a new one and we are running overcommitted with unschedulable pods, and Cluster Autoscaler intervenes creating new nodes, which is not desired by user who pays for the extra nodes. We aimed solving this as part of PodReplacementPolicy by introducing the |
The Job is already meant to fail at this point. The only motivation for waiting is to clearly signal that the are no more pods running for this Job and no more status updates are necessary. Adding a Failed condition only at the end is an easy way for users or controllers to know when the job has reached the final state. An alternate solution would be to continue syncing a Job even after the Failed condition is added, until ready/terminating fields are zero. |
/assign @dejanzele |
@soltysh Regarding the JobSuccessPolicy test case, I added a dedicated unit test here: kubernetes/pkg/controller/job/job_controller_test.go Lines 3837 to 3900 in e216742
|
I put this topic for the next WG meeting, but I hope we can reach consensus before that, given that the test-freeze is close. |
@rphillips and I were discussing this. One concern I have is (for example see #122824 (comment)). By design, readiness probes run during terminating to help with gracefulshutdown of services. So it may be a bit confusing but it is possible for a terminating pod to be considered ready. Ready just means that the readiness probe was successfully. I think this change could be fine as it should be self contained to Job/CronJob but we should mention that I don't think it is a bug for a pod that is terminating to also be considered ready. |
From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies. |
I also updated the issue description to incorporate motivation and extra thoughts on each option. |
Since the SuccessPolicy is still in the alpha stage, it might be a good starting point. |
Shouldn't then kueue to not only wait for terminated condition, but also for the remaining pods to finish as well?
I don't think that's a good approach. I'm personally leaning towards option 4, which is backwards compatible, with eventual doc updates to strengthen the contract of the API. I will say that I'm relieved to hear this is not blocking 1.30, and we can continue the discussions for 1.31 😄 |
It could but, with separation of concerns in mind, Kueue only reacts to the Job status. I still prefer option 1 for its simplicity: users and controllers only need to look at one place to determine whether everything is done for the job. But option 4 is acceptable, IMO. |
For 4 I think one would need to see if kubelet actually sets ready to 0 once the pod is completed (failed or success). We run readiness probes during terminating so they could be true or false while terminating. I have not looked into what happens once a pod is complete (if ready gets set to false). I personally like scoping this to 1 (or job controller) as there is more control over this behavior. If we want to go the route that kubelet sets ready to false when pods are failed, it could impact other workloads. |
I also prefer, for simplicity of API clients, to only add terminal conditions (like Failed or Complete) when no more updates to the status. Otherwise clients need to monitor until ready=0 and terminating=0, maybe in the future for more things, which could become problematic. Also, I'm not sure why delaying setting the Failed condition is a "breaking change". It delays setting the Failed condition for deleted pods by graceful termination time, which is typically 30s. However, from the API client POV the behavior is consistent. When I introduced initially the validation rule for terminating=0 when Failed=true, when working on Having said that I'm also ok with 4, but maybe then we should have an extra condition 'Terminal' to indicate no more updates, to provide future a proof check for the API clients? |
Hi all, I just wanted to kick-start the discussion again, where did we leave it? I see initial sentiment was for option 1, then it started to shift to option 4. Personally, I think I am more inclined towards option 4, if we have I volunteered to work on the fix implementation, and before I start I wanted to double-check are we in agreement or there are still topics for discussion? |
I think the two outstanding options are 1 and 4. Addressing this issue is part of Beta graduation criteria for managed-by, so I'm also interested in the conclusion.
In option 1 we will wait until One nice property of option 1. is that we could verify that |
If I understand correctly, option 1 and 4 are similar to some degree, where the difference is in option 1 we fail the Job when all Pods complete and all status fields converge to the correct values meaning the Job reaches the Failed status later, and in option 4 the Job reaches the Failed state earlier but we continue syncing it until all the status fields converge to their correct values? |
Correct. I think ideally (if this was a new API) (1.) is enough: a user is only interested in two states: To reflect these two states The "only" issue is that historically we didn't have |
Can you put an item in the agenda for the next SIG Apps meeting? |
Sure, I'm happy to, but for May 13, as I'm on holiday next week (as probably a few other folks). |
What happened?
When a Job is declared Failed, the running Pods still count as ready.
This causes problems for higher level controllers that use the Failed/Completed conditions to do usage accounting. If the job is marked as finished before all the pods finish, the accounting is inaccurate.
What did you expect to happen?
A few options (non-necessarily exclusive), in order of my preference:
FailureTarget
orSuccessCriteriaMet
conditions to provide early feedback to controllers that a Job is marked for failure.Do not count Pods with a deletionTimestamp as ready. (Possibly a breaking change)We just consider ready=0 if setting a finished conditionOptions 2 and 3 don't satisfy the requirement of proper accounting.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Kubernetes version
ANY
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: