Skip to content

Commit

Permalink
Merge pull request #45135 from tenzen-y/job-success-policy-doc
Browse files Browse the repository at this point in the history
Add JobSuccessPolicy Doc
  • Loading branch information
k8s-ci-robot committed Mar 26, 2024
2 parents 3d33323 + 7465256 commit deb1be8
Show file tree
Hide file tree
Showing 3 changed files with 95 additions and 0 deletions.
56 changes: 56 additions & 0 deletions content/en/docs/concepts/workloads/controllers/job.md
Expand Up @@ -550,6 +550,62 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be
to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
{{< /note >}}

## Success policy {#success-policy}

{{< feature-state feature_gate_name="JobSuccessPolicy" >}}

{{< note >}}
You can only configure a success policy for an Indexed Job if you have the
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
enabled in your cluster.
{{< /note >}}

When creating an Indexed Job, you can define when a Job can be declared as succeeded using a `.spec.successPolicy`,
based on the pods that succeeded.

By default, a Job succeeds when the number of succeeded Pods equals `.spec.completions`.
These are some situations where you might want additional control for declaring a Job succeeded:

* When running simulations with different parameters,
you might not need all the simulations to succeed for the overall Job to be successful.
* When following a leader-worker pattern, only the success of the leader determines the success or
failure of a Job. Examples of this are frameworks like MPI and PyTorch etc.

You can configure a success policy, in the `.spec.successPolicy` field,
to meet the above use cases. This policy can handle Job success based on the
succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods.
A success policy is defined by rules. Each rule can take one of the following forms:

* When you specify the `succeededIndexes` only,
once all indexes specified in the `succeededIndexes` succeed, the job controller marks the Job as succeeded.
The `succeededIndexes` must be a list of intervals between 0 and `.spec.completions-1`.
* When you specify the `succeededCount` only,
once the number of succeeded indexes reaches the `succeededCount`, the job controller marks the Job as succeeded.
* When you specify both `succeededIndexes` and `succeededCount`,
once the number of succeeded indexes from the subset of indexes specified in the `succeededIndexes` reaches the `succeededCount`,
the job controller marks the Job as succeeded.

Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
the job controller evaluates the rules in order. Once the Job meets a rule, the job controller ignores remaining rules.

Here is a manifest for a Job with `successPolicy`:

{{% code_sample file="/controllers/job-success-policy.yaml" %}}

In the example above, the rule of the success policy specifies that
the Job should be marked succeeded and terminate the lingering Pods
if one of the 0, 2, and 3 indexes succeeded.
The Job that met the success policy gets the `SuccessCriteriaMet` condition.
After the removal of the lingering Pods is issued, the Job gets the `Complete` condition.

Note that the `succeededIndexes` is represented as intervals separated by a hyphen.
The number are listed in represented by the first and last element of the series, separated by a hyphen.

{{< note >}}
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy.
{{< /note >}}

## Job termination and cleanup

When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
Expand Down
@@ -0,0 +1,14 @@
---
title: JobSuccessPolicy
content_type: feature_gate

_build:
list: never
render: false

stages:
- stage: alpha
defaultValue: false
fromVersion: "1.30"
---
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.
25 changes: 25 additions & 0 deletions content/en/examples/controllers/job-success-policy.yaml
@@ -0,0 +1,25 @@
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed # Required for the success policy
successPolicy:
rules:
- succeededIndexes: 0,2-3
succeededCount: 1
template:
spec:
containers:
- name: main
image: python
command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded,
# the overall Job is a success.
- python3
- -c
- |
import os, sys
if os.environ.get("JOB_COMPLETION_INDEX") == "2":
sys.exit(0)
else:
sys.exit(1)

0 comments on commit deb1be8

Please sign in to comment.