A Job might finish with ready!=0, terminating!=0 #123775

alculquicondor · 2024-03-06T18:45:01Z

What happened?

When a Job is declared Failed, the running Pods still count as ready.
This causes problems for higher level controllers that use the Failed/Completed conditions to do usage accounting. If the job is marked as finished before all the pods finish, the accounting is inaccurate.

What did you expect to happen?

A few options (non-necessarily exclusive), in order of my preference:

The job not to be declared Completed/Failed until all the Pods have finished. This is possibly a breaking change. As a mitigation, we can use the FailureTarget or SuccessCriteriaMet conditions to provide early feedback to controllers that a Job is marked for failure.
~~Do not count Pods with a deletionTimestamp as ready. (Possibly a breaking change)~~
~~We just consider ready=0 if setting a finished condition~~
Continue syncing a Job even after the Failed condition is added, until ready/terminating fields are zero.

Options 2 and 3 don't satisfy the requirement of proper accounting.

How can we reproduce it (as minimally and precisely as possible)?

Use this Job (note backoffLimit: 0):

apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-job
  labels:
    jobgroup: indexedjob
spec:
  completions: 4
  parallelism: 4
  backoffLimit: 0
  completionMode: Indexed
  podReplacementPolicy: Failed
  template:
    metadata:
      labels:
        jobgroup: indexedjob
    spec:
      restartPolicy: Never
      containers:
      - name: 'worker'
        image: 'centos:7'
        command:
        - "sh"
        - "-c"
        - "echo 'hello' && sleep 120 && echo 'bye'

Delete one of the pods with kubectl

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

ANY

Cloud provider

ANY

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-03-06T18:45:11Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2024-03-06T18:47:16Z

cc @soltysh @deads2k for thoughts.

I listed the options according to my preference towards a consistent API, even if it's somewhat of a breaking change.

alculquicondor · 2024-03-06T18:50:31Z

Another alternative would be to do something like 3, but 1 when using podReplacementPolicy: Failed or podFailurePolicies which are beta features.

tenzen-y · 2024-03-06T18:52:09Z

/cc

alculquicondor · 2024-03-06T19:00:13Z

/sig apps
/wg batch

kannon92 · 2024-03-06T19:07:03Z

Reading this:

When a Job is declared Failed, the running Pods still count as ready.

This seems to be WAI.

Ready and running are not related states.

mimowo · 2024-03-07T08:39:54Z

The issue extends also for terminating (status using the same job):

  status:
    conditions:
    - lastProbeTime: "2024-03-07T08:38:21Z"
      lastTransitionTime: "2024-03-07T08:38:21Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 4
    ready: 2
    startTime: "2024-03-07T08:37:08Z"
    terminating: 1
    uncountedTerminatedPods: {}

AxeZhan · 2024-03-07T09:02:09Z

/cc

mimowo · 2024-03-07T09:09:53Z

Since this affects both ready and terminating I think we could fold the fix under PodReplacmentPolicy, because it explicitly in Motivation says: "This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.". The KEP has also similar statements as goal "Job controller will have a new status field where we include the number of terminating pods.", and Story 3.

This is not achieved if Job controller stops tracking "terminating" pods before the pods complete. In particular, once Kueue
(or another Job scheduling controller) sees that a job is "Failed" it is free to admit more jobs to run, but the resources might still be occupied by the terminating pods (Kueue does not have a way to know the number of terminating pods after "Failed" or "Complete" is added). This may lead to undesired scale up by Cluster Autoscaler, because for a short while there are more pods that node resources.

If we want to protect by beta-level feature, then I would suggest PodReplacementPolicy, and wait in enactJobFinished for ready=0 and terminating=0, probably here where we wait for uncountedTerminaedPods to be empty.

This is a fully non-breaking change, just "fixing" beta-level feature.

Having said that I would also be for breaking change and do the waiting regardless of the feature gate.

mimowo · 2024-03-07T09:22:51Z

This was discovered during recent work on kubernetes/enhancements#4368 and affects the recent changes to the API field comments. I propose to revert the inacurate comments here #123792, and track it also in the graduation criteria for the feature, opened the KEP update: kubernetes/enhancements#4542.

soltysh · 2024-03-07T11:25:20Z

We should probably stress the fact, this is true only for failed jobs, not succeeded. The latter would be a bug, the former is a separate discussion. I'm looking at this situation like this, what needs to happen that this job won't fail? Why do you think we need to wait for the remaining 3 pods to fail? Is it possible the job will fix itself and fulfill the contract of a success? The only place where I can think of, where this kind of situation might be problematic is when you have the SuccessPolicy specified, that's the only place where I could see a job with a failed pod should not fail. In all other cases, I think the extra wait doesn't give us anything, does it?

@tenzen-y this particular case described above might be important for you, when you'll be working on your feature. Probably worth adding an explicit test case covering that.

mimowo · 2024-03-07T11:39:57Z

Why do you think we need to wait for the remaining 3 pods to fail?

Well, I don't think this is needed for the basic job functionality.

Nonetheless, this is expected for job schedulers like Kueue. The 3 remaining pods still consume resources for 30s by default. Then, Kueue seeing the Job is "Failed" schedules a new one and we are running overcommitted with unschedulable pods, and Cluster Autoscaler intervenes creating new nodes, which is not desired by user who pays for the extra nodes.

We aimed solving this as part of PodReplacementPolicy by introducing the terminating field that Kueue could watch. It works ok for terminating pods as long as the Job is not Failed itself.

alculquicondor · 2024-03-07T14:51:17Z

Why do you think we need to wait for the remaining 3 pods to fail? Is it possible the job will fix itself and fulfill the contract of a success?

The Job is already meant to fail at this point. The only motivation for waiting is to clearly signal that the are no more pods running for this Job and no more status updates are necessary. Adding a Failed condition only at the end is an easy way for users or controllers to know when the job has reached the final state.

An alternate solution would be to continue syncing a Job even after the Failed condition is added, until ready/terminating fields are zero.

alculquicondor · 2024-03-07T21:01:43Z

/assign @dejanzele
Since you volunteered to work on this once we reach an agreement.

tenzen-y · 2024-03-07T21:57:56Z

@tenzen-y this particular case described above might be important for you, when you'll be working on your feature. Probably worth adding an explicit test case covering that.

@soltysh Regarding the JobSuccessPolicy test case, I added a dedicated unit test here:

kubernetes/pkg/controller/job/job_controller_test.go

Lines 3837 to 3900 in e216742

    
           // In the current mechanism, the job controller adds Complete condition to Job 
        
           // even if some running pods still remain. 
        
           // So, we need to revisit here before we graduate the JobSuccessPolicy to beta. 
        
           // TODO(#123775): A Job might finish with ready!=0 
        
           // REF: https://github.com/kubernetes/kubernetes/issues/123775 
        
           "job with successPolicy; job has SuccessCriteriaMet and Complete condition when job meets to successPolicy and some pods still are running": { 
        
           	enableJobSuccessPolicy: true, 
        
           	job: batch.Job{ 
        
           		TypeMeta:   validTypeMeta, 
        
           		ObjectMeta: validObjectMeta, 
        
           		Spec: batch.JobSpec{ 
        
           			Selector:             validSelector, 
        
           			Template:             validTemplate, 
        
           			CompletionMode:       completionModePtr(batch.IndexedCompletion), 
        
           			Parallelism:          ptr.To[int32](3), 
        
           			Completions:          ptr.To[int32](3), 
        
           			BackoffLimit:         ptr.To[int32](math.MaxInt32), 
        
           			BackoffLimitPerIndex: ptr.To[int32](3), 
        
           			SuccessPolicy: &batch.SuccessPolicy{ 
        
           				Rules: []batch.SuccessPolicyRule{{ 
        
           					SucceededIndexes: ptr.To("0,1"), 
        
           					SucceededCount:   ptr.To[int32](1), 
        
           				}}, 
        
           			}, 
        
           		}, 
        
           		Status: batch.JobStatus{ 
        
           			Conditions: []batch.JobCondition{ 
        
           				{ 
        
           					Type:    batch.JobSuccessCriteriaMet, 
        
           					Status:  v1.ConditionTrue, 
        
           					Reason:  batch.JobReasonSuccessPolicy, 
        
           					Message: "Matched rules at index 0", 
        
           				}, 
        
           			}, 
        
           		}, 
        
           	}, 
        
           	pods: []v1.Pod{ 
        
           		*buildPod().uid("a1").index("0").phase(v1.PodFailed).trackingFinalizer().Pod, 
        
           		*buildPod().uid("a2").index("1").phase(v1.PodRunning).trackingFinalizer().Pod, 
        
           		*buildPod().uid("b").index("1").phase(v1.PodSucceeded).trackingFinalizer().Pod, 
        
           		*buildPod().uid("c").index("2").phase(v1.PodRunning).trackingFinalizer().Pod, 
        
           	}, 
        
           	wantStatus: batch.JobStatus{ 
        
           		Failed:                  1, 
        
           		Succeeded:               1, 
        
           		Terminating:             ptr.To[int32](0), 
        
           		CompletedIndexes:        "1", 
        
           		UncountedTerminatedPods: &batch.UncountedTerminatedPods{}, 
        
           		Conditions: []batch.JobCondition{ 
        
           			{ 
        
           				Type:    batch.JobSuccessCriteriaMet, 
        
           				Status:  v1.ConditionTrue, 
        
           				Reason:  batch.JobReasonSuccessPolicy, 
        
           				Message: "Matched rules at index 0", 
        
           			}, 
        
           			{ 
        
           				Type:    batch.JobComplete, 
        
           				Status:  v1.ConditionTrue, 
        
           				Reason:  batch.JobReasonSuccessPolicy, 
        
           				Message: "Matched rules at index 0", 
        
           			}, 
        
           		}, 
        
           	}, 
        
           },

alculquicondor · 2024-03-08T15:14:41Z

I put this topic for the next WG meeting, but I hope we can reach consensus before that, given that the test-freeze is close.

kannon92 · 2024-03-11T21:09:25Z

@rphillips and I were discussing this. One concern I have is (for example see #122824 (comment)).

By design, readiness probes run during terminating to help with gracefulshutdown of services. So it may be a bit confusing but it is possible for a terminating pod to be considered ready. Ready just means that the readiness probe was successfully.

I think this change could be fine as it should be self contained to Job/CronJob but we should mention that I don't think it is a bug for a pod that is terminating to also be considered ready.

alculquicondor · 2024-03-14T14:47:42Z

From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies.

alculquicondor · 2024-03-14T14:49:35Z

I also updated the issue description to incorporate motivation and extra thoughts on each option.

tenzen-y · 2024-03-14T14:49:51Z

From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies.

Since the SuccessPolicy is still in the alpha stage, it might be a good starting point.

soltysh · 2024-03-14T16:56:05Z

Nonetheless, this is expected for job schedulers like Kueue. The 3 remaining pods still consume resources for 30s by default. Then, Kueue seeing the Job is "Failed" schedules a new one and we are running overcommitted with unschedulable pods, and Cluster Autoscaler intervenes creating new nodes, which is not desired by user who pays for the extra nodes.

Shouldn't then kueue to not only wait for terminated condition, but also for the remaining pods to finish as well?

From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies.

I don't think that's a good approach. I'm personally leaning towards option 4, which is backwards compatible, with eventual doc updates to strengthen the contract of the API.

I will say that I'm relieved to hear this is not blocking 1.30, and we can continue the discussions for 1.31 😄

alculquicondor · 2024-03-14T17:05:13Z

Shouldn't then kueue to not only wait for terminated condition, but also for the remaining pods to finish as well?

It could but, with separation of concerns in mind, Kueue only reacts to the Job status.

I still prefer option 1 for its simplicity: users and controllers only need to look at one place to determine whether everything is done for the job.

But option 4 is acceptable, IMO.

kannon92 · 2024-03-14T17:09:37Z

For 4 I think one would need to see if kubelet actually sets ready to 0 once the pod is completed (failed or success).

We run readiness probes during terminating so they could be true or false while terminating. I have not looked into what happens once a pod is complete (if ready gets set to false).

I personally like scoping this to 1 (or job controller) as there is more control over this behavior. If we want to go the route that kubelet sets ready to false when pods are failed, it could impact other workloads.

mimowo · 2024-03-14T17:33:37Z

I also prefer, for simplicity of API clients, to only add terminal conditions (like Failed or Complete) when no more updates to the status. Otherwise clients need to monitor until ready=0 and terminating=0, maybe in the future for more things, which could become problematic.

Also, I'm not sure why delaying setting the Failed condition is a "breaking change". It delays setting the Failed condition for deleted pods by graceful termination time, which is typically 30s. However, from the API client POV the behavior is consistent.

When I introduced initially the validation rule for terminating=0 when Failed=true, when working on managedBy I didn't realize there is any change, because the API server kept rejecting setting of Failed condition by the Job controller, until terminating=0. All conformance tests for Job were passing. So this is not a change that would "break a client", but it would make it wait longer in some cases.

Having said that I'm also ok with 4, but maybe then we should have an extra condition 'Terminal' to indicate no more updates, to provide future a proof check for the API clients?

dejanzele · 2024-04-24T12:54:23Z

Hi all,

I just wanted to kick-start the discussion again, where did we leave it?

I see initial sentiment was for option 1, then it started to shift to option 4. Personally, I think I am more inclined towards option 4, if we have ready status field, I'd expect it to be in correct state no matter what, as having something like failed: 4, active: 3 and 0 pods is a bit misleading.

I volunteered to work on the fix implementation, and before I start I wanted to double-check are we in agreement or there are still topics for discussion?

mimowo · 2024-04-24T13:40:09Z

I just wanted to kick-start the discussion again, where did we leave it?

I think the two outstanding options are 1 and 4. Addressing this issue is part of Beta graduation criteria for managed-by, so I'm also interested in the conclusion.

I'd expect it to be in correct state no matter what, as having something like failed: 4, active: 3 and 0 pods is a bit misleading.

In option 1 we will wait until active=0, ready=0, terminating=0 before we set Failed=True, or Complete=True.
A job which is doomed to fail (and awaits for pod terminations) with have the FailureTarget condition added early.

One nice property of option 1. is that we could verify that active=0, ready=0, terminating=0 in status validation
introduced by managedBy KEP. Maybe we could do an analogous validation in option 4 if we have a condition indicating that all pods are terminated, and there are no more status updates to the Job.

dejanzele · 2024-04-24T15:12:09Z

If I understand correctly, option 1 and 4 are similar to some degree, where the difference is in option 1 we fail the Job when all Pods complete and all status fields converge to the correct values meaning the Job reaches the Failed status later, and in option 4 the Job reaches the Failed state earlier but we continue syncing it until all the status fields converge to their correct values?

mimowo · 2024-04-24T15:37:49Z

Correct. I think ideally (if this was a new API) (1.) is enough: a user is only interested in two states:
a) the job is doomed to fail or succeed
b) the job is terminal (no pods are running)

To reflect these two states FailureTarget and Failed are enough.

The "only" issue is that historically we didn't have FailureTarget, and Failed played more of a role "doomed to be failed". Also, FailureTarget is currently used only for pod failure policy, so we would need to extend its scope. Finally, some clients are only intersested in knowing if a job is doomed to fail, and they watch for "Failed" condition, now they would need to watch for "FailureTarget" or "Failed" for that purpose.

alculquicondor · 2024-04-24T16:37:40Z

Can you put an item in the agenda for the next SIG Apps meeting?

mimowo · 2024-04-24T16:52:53Z

Can you put an item in the agenda for the next SIG Apps meeting?

Sure, I'm happy to, but for May 13, as I'm on holiday next week (as probably a few other folks).

alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 6, 2024

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 6, 2024

This was referenced Mar 7, 2024

Adjust the Job field API comments and validation to the current state #123792

Merged

KEP-4368 update to add graduation criteria to adjust the validation based on the decisions kubernetes/enhancements#4542

Merged

mimowo mentioned this issue Mar 7, 2024

Job: Support for the SuccessPolicy #123412

Merged

3 tasks

k8s-ci-robot assigned dejanzele Mar 7, 2024

alculquicondor changed the title ~~A Job might finish with ready!=0~~ A Job might finish with ready!=0, terminating!=0 Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Job might finish with ready!=0, terminating!=0 #123775

A Job might finish with ready!=0, terminating!=0 #123775

alculquicondor commented Mar 6, 2024 •

edited

k8s-ci-robot commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

tenzen-y commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

kannon92 commented Mar 6, 2024

mimowo commented Mar 7, 2024

AxeZhan commented Mar 7, 2024

mimowo commented Mar 7, 2024

mimowo commented Mar 7, 2024

soltysh commented Mar 7, 2024 •

edited

mimowo commented Mar 7, 2024 •

edited

alculquicondor commented Mar 7, 2024

alculquicondor commented Mar 7, 2024

tenzen-y commented Mar 7, 2024

alculquicondor commented Mar 8, 2024

kannon92 commented Mar 11, 2024 •

edited

alculquicondor commented Mar 14, 2024

alculquicondor commented Mar 14, 2024

tenzen-y commented Mar 14, 2024

soltysh commented Mar 14, 2024

alculquicondor commented Mar 14, 2024

kannon92 commented Mar 14, 2024

mimowo commented Mar 14, 2024 •

edited

dejanzele commented Apr 24, 2024

mimowo commented Apr 24, 2024

dejanzele commented Apr 24, 2024

mimowo commented Apr 24, 2024 •

edited

alculquicondor commented Apr 24, 2024

mimowo commented Apr 24, 2024

A Job might finish with ready!=0, terminating!=0 #123775

A Job might finish with ready!=0, terminating!=0 #123775

Comments

alculquicondor commented Mar 6, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

tenzen-y commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

kannon92 commented Mar 6, 2024

mimowo commented Mar 7, 2024

AxeZhan commented Mar 7, 2024

mimowo commented Mar 7, 2024

mimowo commented Mar 7, 2024

soltysh commented Mar 7, 2024 • edited

mimowo commented Mar 7, 2024 • edited

alculquicondor commented Mar 7, 2024

alculquicondor commented Mar 7, 2024

tenzen-y commented Mar 7, 2024

alculquicondor commented Mar 8, 2024

kannon92 commented Mar 11, 2024 • edited

alculquicondor commented Mar 14, 2024

alculquicondor commented Mar 14, 2024

tenzen-y commented Mar 14, 2024

soltysh commented Mar 14, 2024

alculquicondor commented Mar 14, 2024

kannon92 commented Mar 14, 2024

mimowo commented Mar 14, 2024 • edited

dejanzele commented Apr 24, 2024

mimowo commented Apr 24, 2024

dejanzele commented Apr 24, 2024

mimowo commented Apr 24, 2024 • edited

alculquicondor commented Apr 24, 2024

mimowo commented Apr 24, 2024

alculquicondor commented Mar 6, 2024 •

edited

soltysh commented Mar 7, 2024 •

edited

mimowo commented Mar 7, 2024 •

edited

kannon92 commented Mar 11, 2024 •

edited

mimowo commented Mar 14, 2024 •

edited

mimowo commented Apr 24, 2024 •

edited