diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md index f4462477f15d7..1701af7f79550 100644 --- a/content/en/docs/concepts/workloads/controllers/job.md +++ b/content/en/docs/concepts/workloads/controllers/job.md @@ -550,6 +550,62 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy). {{< /note >}} +## Success policy {#success-policy} + +{{< feature-state feature_gate_name="JobSuccessPolicy" >}} + +{{< note >}} +You can only configure a success policy for an Indexed Job if you have the +`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +enabled in your cluster. +{{< /note >}} + +When creating an Indexed Job, you can define when a Job can be declared as succeeded using a `.spec.successPolicy`, +based on the pods that succeeded. + +By default, a Job succeeds when the number of succeeded Pods equals `.spec.completions`. +These are some situations where you might want additional control for declaring a Job succeeded: + +* When running simulations with different parameters, + you might not need all the simulations to succeed for the overall Job to be successful. +* When following a leader-worker pattern, only the success of the leader determines the success or + failure of a Job. Examples of this are frameworks like MPI and PyTorch etc. + +You can configure a success policy, in the `.spec.successPolicy` field, +to meet the above use cases. This policy can handle Job success based on the +succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods. +A success policy is defined by rules. Each rule can take one of the following forms: + +* When you specify the `succeededIndexes` only, + once all indexes specified in the `succeededIndexes` succeed, the job controller marks the Job as succeeded. + The `succeededIndexes` must be a list of intervals between 0 and `.spec.completions-1`. +* When you specify the `succeededCount` only, + once the number of succeeded indexes reaches the `succeededCount`, the job controller marks the Job as succeeded. +* When you specify both `succeededIndexes` and `succeededCount`, + once the number of succeeded indexes from the subset of indexes specified in the `succeededIndexes` reaches the `succeededCount`, + the job controller marks the Job as succeeded. + +Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`, +the job controller evaluates the rules in order. Once the Job meets a rule, the job controller ignores remaining rules. + +Here is a manifest for a Job with `successPolicy`: + +{{% code_sample file="/controllers/job-success-policy.yaml" %}} + +In the example above, the rule of the success policy specifies that +the Job should be marked succeeded and terminate the lingering Pods +if one of the 0, 2, and 3 indexes succeeded. +The Job that met the success policy gets the `SuccessCriteriaMet` condition. +After the removal of the lingering Pods is issued, the Job gets the `Complete` condition. + +Note that the `succeededIndexes` is represented as intervals separated by a hyphen. +The number are listed in represented by the first and last element of the series, separated by a hyphen. + +{{< note >}} +When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`, +once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy. +{{< /note >}} + ## Job termination and cleanup When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either. diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md new file mode 100644 index 0000000000000..601680357ccc9 --- /dev/null +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md @@ -0,0 +1,14 @@ +--- +title: JobSuccessPolicy +content_type: feature_gate + +_build: + list: never + render: false + +stages: + - stage: alpha + defaultValue: false + fromVersion: "1.30" +--- +Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods. diff --git a/content/en/examples/controllers/job-success-policy.yaml b/content/en/examples/controllers/job-success-policy.yaml new file mode 100644 index 0000000000000..1f7927b2f34fc --- /dev/null +++ b/content/en/examples/controllers/job-success-policy.yaml @@ -0,0 +1,25 @@ +apiVersion: batch/v1 +kind: Job +spec: + parallelism: 10 + completions: 10 + completionMode: Indexed # Required for the success policy + successPolicy: + rules: + - succeededIndexes: 0,2-3 + succeededCount: 1 + template: + spec: + containers: + - name: main + image: python + command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded, + # the overall Job is a success. + - python3 + - -c + - | + import os, sys + if os.environ.get("JOB_COMPLETION_INDEX") == "2": + sys.exit(0) + else: + sys.exit(1)