Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Job might finish with ready!=0, terminating!=0 #123775

Open
alculquicondor opened this issue Mar 6, 2024 · 30 comments
Open

A Job might finish with ready!=0, terminating!=0 #123775

alculquicondor opened this issue Mar 6, 2024 · 30 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@alculquicondor
Copy link
Member

alculquicondor commented Mar 6, 2024

What happened?

When a Job is declared Failed, the running Pods still count as ready.
This causes problems for higher level controllers that use the Failed/Completed conditions to do usage accounting. If the job is marked as finished before all the pods finish, the accounting is inaccurate.

What did you expect to happen?

A few options (non-necessarily exclusive), in order of my preference:

  1. The job not to be declared Completed/Failed until all the Pods have finished. This is possibly a breaking change. As a mitigation, we can use the FailureTarget or SuccessCriteriaMet conditions to provide early feedback to controllers that a Job is marked for failure.
  2. Do not count Pods with a deletionTimestamp as ready. (Possibly a breaking change)
  3. We just consider ready=0 if setting a finished condition
  4. Continue syncing a Job even after the Failed condition is added, until ready/terminating fields are zero.

Options 2 and 3 don't satisfy the requirement of proper accounting.

How can we reproduce it (as minimally and precisely as possible)?

  1. Use this Job (note backoffLimit: 0):
apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-job
  labels:
    jobgroup: indexedjob
spec:
  completions: 4
  parallelism: 4
  backoffLimit: 0
  completionMode: Indexed
  podReplacementPolicy: Failed
  template:
    metadata:
      labels:
        jobgroup: indexedjob
    spec:
      restartPolicy: Never
      containers:
      - name: 'worker'
        image: 'centos:7'
        command:
        - "sh"
        - "-c"
        - "echo 'hello' && sleep 120 && echo 'bye'
  1. Delete one of the pods with kubectl

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

ANY

Cloud provider

ANY

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@alculquicondor alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Member Author

cc @soltysh @deads2k for thoughts.

I listed the options according to my preference towards a consistent API, even if it's somewhat of a breaking change.

@alculquicondor
Copy link
Member Author

Another alternative would be to do something like 3, but 1 when using podReplacementPolicy: Failed or podFailurePolicies which are beta features.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 6, 2024

/cc

@alculquicondor
Copy link
Member Author

/sig apps
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 6, 2024
@kannon92
Copy link
Contributor

kannon92 commented Mar 6, 2024

Reading this:

When a Job is declared Failed, the running Pods still count as ready.

This seems to be WAI.

Ready and running are not related states.

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2024

The issue extends also for terminating (status using the same job):

  status:
    conditions:
    - lastProbeTime: "2024-03-07T08:38:21Z"
      lastTransitionTime: "2024-03-07T08:38:21Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 4
    ready: 2
    startTime: "2024-03-07T08:37:08Z"
    terminating: 1
    uncountedTerminatedPods: {}

@AxeZhan
Copy link
Member

AxeZhan commented Mar 7, 2024

/cc

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2024

Since this affects both ready and terminating I think we could fold the fix under PodReplacmentPolicy, because it explicitly in Motivation says: "This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.". The KEP has also similar statements as goal "Job controller will have a new status field where we include the number of terminating pods.", and Story 3.

This is not achieved if Job controller stops tracking "terminating" pods before the pods complete. In particular, once Kueue
(or another Job scheduling controller) sees that a job is "Failed" it is free to admit more jobs to run, but the resources might still be occupied by the terminating pods (Kueue does not have a way to know the number of terminating pods after "Failed" or "Complete" is added). This may lead to undesired scale up by Cluster Autoscaler, because for a short while there are more pods that node resources.

If we want to protect by beta-level feature, then I would suggest PodReplacementPolicy, and wait in enactJobFinished for ready=0 and terminating=0, probably here where we wait for uncountedTerminaedPods to be empty.

This is a fully non-breaking change, just "fixing" beta-level feature.

Having said that I would also be for breaking change and do the waiting regardless of the feature gate.

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2024

This was discovered during recent work on kubernetes/enhancements#4368 and affects the recent changes to the API field comments. I propose to revert the inacurate comments here #123792, and track it also in the graduation criteria for the feature, opened the KEP update: kubernetes/enhancements#4542.

@soltysh
Copy link
Contributor

soltysh commented Mar 7, 2024

We should probably stress the fact, this is true only for failed jobs, not succeeded. The latter would be a bug, the former is a separate discussion. I'm looking at this situation like this, what needs to happen that this job won't fail? Why do you think we need to wait for the remaining 3 pods to fail? Is it possible the job will fix itself and fulfill the contract of a success? The only place where I can think of, where this kind of situation might be problematic is when you have the SuccessPolicy specified, that's the only place where I could see a job with a failed pod should not fail. In all other cases, I think the extra wait doesn't give us anything, does it?

@tenzen-y this particular case described above might be important for you, when you'll be working on your feature. Probably worth adding an explicit test case covering that.

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2024

Why do you think we need to wait for the remaining 3 pods to fail?

Well, I don't think this is needed for the basic job functionality.

Nonetheless, this is expected for job schedulers like Kueue. The 3 remaining pods still consume resources for 30s by default. Then, Kueue seeing the Job is "Failed" schedules a new one and we are running overcommitted with unschedulable pods, and Cluster Autoscaler intervenes creating new nodes, which is not desired by user who pays for the extra nodes.

We aimed solving this as part of PodReplacementPolicy by introducing the terminating field that Kueue could watch. It works ok for terminating pods as long as the Job is not Failed itself.

@alculquicondor
Copy link
Member Author

Why do you think we need to wait for the remaining 3 pods to fail? Is it possible the job will fix itself and fulfill the contract of a success?

The Job is already meant to fail at this point. The only motivation for waiting is to clearly signal that the are no more pods running for this Job and no more status updates are necessary. Adding a Failed condition only at the end is an easy way for users or controllers to know when the job has reached the final state.

An alternate solution would be to continue syncing a Job even after the Failed condition is added, until ready/terminating fields are zero.

@alculquicondor
Copy link
Member Author

/assign @dejanzele
Since you volunteered to work on this once we reach an agreement.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 7, 2024

@tenzen-y this particular case described above might be important for you, when you'll be working on your feature. Probably worth adding an explicit test case covering that.

@soltysh Regarding the JobSuccessPolicy test case, I added a dedicated unit test here:

// In the current mechanism, the job controller adds Complete condition to Job
// even if some running pods still remain.
// So, we need to revisit here before we graduate the JobSuccessPolicy to beta.
// TODO(#123775): A Job might finish with ready!=0
// REF: https://github.com/kubernetes/kubernetes/issues/123775
"job with successPolicy; job has SuccessCriteriaMet and Complete condition when job meets to successPolicy and some pods still are running": {
enableJobSuccessPolicy: true,
job: batch.Job{
TypeMeta: validTypeMeta,
ObjectMeta: validObjectMeta,
Spec: batch.JobSpec{
Selector: validSelector,
Template: validTemplate,
CompletionMode: completionModePtr(batch.IndexedCompletion),
Parallelism: ptr.To[int32](3),
Completions: ptr.To[int32](3),
BackoffLimit: ptr.To[int32](math.MaxInt32),
BackoffLimitPerIndex: ptr.To[int32](3),
SuccessPolicy: &batch.SuccessPolicy{
Rules: []batch.SuccessPolicyRule{{
SucceededIndexes: ptr.To("0,1"),
SucceededCount: ptr.To[int32](1),
}},
},
},
Status: batch.JobStatus{
Conditions: []batch.JobCondition{
{
Type: batch.JobSuccessCriteriaMet,
Status: v1.ConditionTrue,
Reason: batch.JobReasonSuccessPolicy,
Message: "Matched rules at index 0",
},
},
},
},
pods: []v1.Pod{
*buildPod().uid("a1").index("0").phase(v1.PodFailed).trackingFinalizer().Pod,
*buildPod().uid("a2").index("1").phase(v1.PodRunning).trackingFinalizer().Pod,
*buildPod().uid("b").index("1").phase(v1.PodSucceeded).trackingFinalizer().Pod,
*buildPod().uid("c").index("2").phase(v1.PodRunning).trackingFinalizer().Pod,
},
wantStatus: batch.JobStatus{
Failed: 1,
Succeeded: 1,
Terminating: ptr.To[int32](0),
CompletedIndexes: "1",
UncountedTerminatedPods: &batch.UncountedTerminatedPods{},
Conditions: []batch.JobCondition{
{
Type: batch.JobSuccessCriteriaMet,
Status: v1.ConditionTrue,
Reason: batch.JobReasonSuccessPolicy,
Message: "Matched rules at index 0",
},
{
Type: batch.JobComplete,
Status: v1.ConditionTrue,
Reason: batch.JobReasonSuccessPolicy,
Message: "Matched rules at index 0",
},
},
},
},

@alculquicondor
Copy link
Member Author

I put this topic for the next WG meeting, but I hope we can reach consensus before that, given that the test-freeze is close.

@kannon92
Copy link
Contributor

kannon92 commented Mar 11, 2024

@rphillips and I were discussing this. One concern I have is (for example see #122824 (comment)).

By design, readiness probes run during terminating to help with gracefulshutdown of services. So it may be a bit confusing but it is possible for a terminating pod to be considered ready. Ready just means that the readiness probe was successfully.

I think this change could be fine as it should be self contained to Job/CronJob but we should mention that I don't think it is a bug for a pod that is terminating to also be considered ready.

@alculquicondor alculquicondor changed the title A Job might finish with ready!=0 A Job might finish with ready!=0, terminating!=0 Mar 14, 2024
@alculquicondor
Copy link
Member Author

From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies.

@alculquicondor
Copy link
Member Author

I also updated the issue description to incorporate motivation and extra thoughts on each option.

@tenzen-y
Copy link
Member

From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies.

Since the SuccessPolicy is still in the alpha stage, it might be a good starting point.

@soltysh
Copy link
Contributor

soltysh commented Mar 14, 2024

Nonetheless, this is expected for job schedulers like Kueue. The 3 remaining pods still consume resources for 30s by default. Then, Kueue seeing the Job is "Failed" schedules a new one and we are running overcommitted with unschedulable pods, and Cluster Autoscaler intervenes creating new nodes, which is not desired by user who pays for the extra nodes.

Shouldn't then kueue to not only wait for terminated condition, but also for the remaining pods to finish as well?

From the WG Batch meeting, the sentiment was that a solution close to option 1 would be the best path forward. However, the implications are too big to consider in 1.30. It should be revisited for 1.31, either as a standalone KEP or as part of failure policies, replacement policies or success policies.

I don't think that's a good approach. I'm personally leaning towards option 4, which is backwards compatible, with eventual doc updates to strengthen the contract of the API.

I will say that I'm relieved to hear this is not blocking 1.30, and we can continue the discussions for 1.31 😄

@alculquicondor
Copy link
Member Author

Shouldn't then kueue to not only wait for terminated condition, but also for the remaining pods to finish as well?

It could but, with separation of concerns in mind, Kueue only reacts to the Job status.

I still prefer option 1 for its simplicity: users and controllers only need to look at one place to determine whether everything is done for the job.

But option 4 is acceptable, IMO.

@kannon92
Copy link
Contributor

For 4 I think one would need to see if kubelet actually sets ready to 0 once the pod is completed (failed or success).

We run readiness probes during terminating so they could be true or false while terminating. I have not looked into what happens once a pod is complete (if ready gets set to false).

I personally like scoping this to 1 (or job controller) as there is more control over this behavior. If we want to go the route that kubelet sets ready to false when pods are failed, it could impact other workloads.

@mimowo
Copy link
Contributor

mimowo commented Mar 14, 2024

I also prefer, for simplicity of API clients, to only add terminal conditions (like Failed or Complete) when no more updates to the status. Otherwise clients need to monitor until ready=0 and terminating=0, maybe in the future for more things, which could become problematic.

Also, I'm not sure why delaying setting the Failed condition is a "breaking change". It delays setting the Failed condition for deleted pods by graceful termination time, which is typically 30s. However, from the API client POV the behavior is consistent.

When I introduced initially the validation rule for terminating=0 when Failed=true, when working on managedBy I didn't realize there is any change, because the API server kept rejecting setting of Failed condition by the Job controller, until terminating=0. All conformance tests for Job were passing. So this is not a change that would "break a client", but it would make it wait longer in some cases.

Having said that I'm also ok with 4, but maybe then we should have an extra condition 'Terminal' to indicate no more updates, to provide future a proof check for the API clients?

@dejanzele
Copy link
Contributor

Hi all,

I just wanted to kick-start the discussion again, where did we leave it?

I see initial sentiment was for option 1, then it started to shift to option 4. Personally, I think I am more inclined towards option 4, if we have ready status field, I'd expect it to be in correct state no matter what, as having something like failed: 4, active: 3 and 0 pods is a bit misleading.

I volunteered to work on the fix implementation, and before I start I wanted to double-check are we in agreement or there are still topics for discussion?

@mimowo
Copy link
Contributor

mimowo commented Apr 24, 2024

I just wanted to kick-start the discussion again, where did we leave it?

I think the two outstanding options are 1 and 4. Addressing this issue is part of Beta graduation criteria for managed-by, so I'm also interested in the conclusion.

I'd expect it to be in correct state no matter what, as having something like failed: 4, active: 3 and 0 pods is a bit misleading.

In option 1 we will wait until active=0, ready=0, terminating=0 before we set Failed=True, or Complete=True.
A job which is doomed to fail (and awaits for pod terminations) with have the FailureTarget condition added early.

One nice property of option 1. is that we could verify that active=0, ready=0, terminating=0 in status validation
introduced by managedBy KEP. Maybe we could do an analogous validation in option 4 if we have a condition indicating that all pods are terminated, and there are no more status updates to the Job.

@dejanzele
Copy link
Contributor

If I understand correctly, option 1 and 4 are similar to some degree, where the difference is in option 1 we fail the Job when all Pods complete and all status fields converge to the correct values meaning the Job reaches the Failed status later, and in option 4 the Job reaches the Failed state earlier but we continue syncing it until all the status fields converge to their correct values?

@mimowo
Copy link
Contributor

mimowo commented Apr 24, 2024

Correct. I think ideally (if this was a new API) (1.) is enough: a user is only interested in two states:
a) the job is doomed to fail or succeed
b) the job is terminal (no pods are running)

To reflect these two states FailureTarget and Failed are enough.

The "only" issue is that historically we didn't have FailureTarget, and Failed played more of a role "doomed to be failed". Also, FailureTarget is currently used only for pod failure policy, so we would need to extend its scope. Finally, some clients are only intersested in knowing if a job is doomed to fail, and they watch for "Failed" condition, now they would need to watch for "FailureTarget" or "Failed" for that purpose.

@alculquicondor
Copy link
Member Author

Can you put an item in the agenda for the next SIG Apps meeting?

@mimowo
Copy link
Contributor

mimowo commented Apr 24, 2024

Can you put an item in the agenda for the next SIG Apps meeting?

Sure, I'm happy to, but for May 13, as I'm on holiday next week (as probably a few other folks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Needs Triage
Development

No branches or pull requests

8 participants