New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JobSuccessPolicy Doc #45135
Add JobSuccessPolicy Doc #45135
Changes from all commits
92a0032
105d90a
32fd60c
cec3c3f
fcdb477
d79de02
7465256
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -550,6 +550,62 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be | |||||
to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy). | ||||||
{{< /note >}} | ||||||
|
||||||
## Success policy {#success-policy} | ||||||
|
||||||
{{< feature-state feature_gate_name="JobSuccessPolicy" >}} | ||||||
|
||||||
{{< note >}} | ||||||
You can only configure a success policy for an Indexed Job if you have the | ||||||
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) | ||||||
enabled in your cluster. | ||||||
{{< /note >}} | ||||||
|
||||||
When creating an Indexed Job, you can define when a Job can be declared as succeeded using a `.spec.successPolicy`, | ||||||
based on the pods that succeeded. | ||||||
|
||||||
By default, a Job succeeds when the number of succeeded Pods equals `.spec.completions`. | ||||||
These are some situations where you might want additional control for declaring a Job succeeded: | ||||||
|
||||||
* When running simulations with different parameters, | ||||||
you might not need all the simulations to succeed for the overall Job to be successful. | ||||||
* When following a leader-worker pattern, only the success of the leader determines the success or | ||||||
failure of a Job. Examples of this are frameworks like MPI and PyTorch etc. | ||||||
|
||||||
You can configure a success policy, in the `.spec.successPolicy` field, | ||||||
to meet the above use cases. This policy can handle Job success based on the | ||||||
succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods. | ||||||
A success policy is defined by rules. Each rule can take one of the following forms: | ||||||
|
||||||
* When you specify the `succeededIndexes` only, | ||||||
once all indexes specified in the `succeededIndexes` succeed, the job controller marks the Job as succeeded. | ||||||
The `succeededIndexes` must be a list of intervals between 0 and `.spec.completions-1`. | ||||||
* When you specify the `succeededCount` only, | ||||||
once the number of succeeded indexes reaches the `succeededCount`, the job controller marks the Job as succeeded. | ||||||
* When you specify both `succeededIndexes` and `succeededCount`, | ||||||
once the number of succeeded indexes from the subset of indexes specified in the `succeededIndexes` reaches the `succeededCount`, | ||||||
the job controller marks the Job as succeeded. | ||||||
|
||||||
Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`, | ||||||
the job controller evaluates the rules in order. Once the Job meets a rule, the job controller ignores remaining rules. | ||||||
|
||||||
Here is a manifest for a Job with `successPolicy`: | ||||||
|
||||||
{{% code_sample file="/controllers/job-success-policy.yaml" %}} | ||||||
|
||||||
In the example above, the rule of the success policy specifies that | ||||||
the Job should be marked succeeded and terminate the lingering Pods | ||||||
if one of the 0, 2, and 3 indexes succeeded. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: We might want to rephrase this question for better understanding. Ideally, we should be saying something like "In the example above, both P.S. We can do this in a follow-up PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for this great suggestion! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||
The Job that met the success policy gets the `SuccessCriteriaMet` condition. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grammatical nit: Can be amended in a follow-up PR.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||
After the removal of the lingering Pods is issued, the Job gets the `Complete` condition. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we clarify here? The Job is considered complete after the Job Controller removes the lingering pods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, the job controller doesn't care if the lingering pods are actually removed. |
||||||
|
||||||
Note that the `succeededIndexes` is represented as intervals separated by a hyphen. | ||||||
The number are listed in represented by the first and last element of the series, separated by a hyphen. | ||||||
|
||||||
{{< note >}} | ||||||
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`, | ||||||
once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy. | ||||||
{{< /note >}} | ||||||
|
||||||
## Job termination and cleanup | ||||||
|
||||||
When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either. | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
--- | ||
title: JobSuccessPolicy | ||
content_type: feature_gate | ||
|
||
_build: | ||
list: never | ||
render: false | ||
|
||
stages: | ||
- stage: alpha | ||
defaultValue: false | ||
fromVersion: "1.30" | ||
--- | ||
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
apiVersion: batch/v1 | ||
kind: Job | ||
spec: | ||
parallelism: 10 | ||
completions: 10 | ||
completionMode: Indexed # Required for the success policy | ||
successPolicy: | ||
rules: | ||
- succeededIndexes: 0,2-3 | ||
succeededCount: 1 | ||
template: | ||
spec: | ||
containers: | ||
- name: main | ||
image: python | ||
command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded, | ||
# the overall Job is a success. | ||
- python3 | ||
- -c | ||
- | | ||
import os, sys | ||
if os.environ.get("JOB_COMPLETION_INDEX") == "2": | ||
sys.exit(0) | ||
else: | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (can be addressed later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.