Skip to content

Commit

Permalink
KEP-3998: Add JobSuccessPolicy Documentation
Browse files Browse the repository at this point in the history
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
  • Loading branch information
tenzen-y committed Mar 8, 2024
1 parent e260aaa commit c6d4db6
Show file tree
Hide file tree
Showing 3 changed files with 96 additions and 0 deletions.
57 changes: 57 additions & 0 deletions content/en/docs/concepts/workloads/controllers/job.md
Expand Up @@ -1006,6 +1006,63 @@ status:
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
```

### Success policy {#success-policy}

{{< feature-state for_k8s_version="v1.29" state="alpha" >}}

{{< note >}}
You can only configure a success policy for an Indexed Job if you have the
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
enabled in your cluster.
{{< /note >}}

When you run an indexed Job, a success policy defined with the `spec.successPolicy` field,
allows you to define when a Job can be declared as succeeded based on the number of succeeded pods.

In some situations, you may want to have a better control when handling Pod
successes than the control provided by the `.spec.completins`.
There are some examples of use cases:

* To optimize costs of running workloads by avoiding unnecessary Pod running,
you can terminate a Job as soon as one of its Pods succeeds.
* To care only about a leader index in determining the success or failure of a Job
in a batch workloads such as MPI and PyTorch etc.

You can configure a success policy, in the `.spec.successPolicy` field,
to meet the above use cases. This policy can handle Job successes based on the
number of succeeded pods. After the Job meet success policy, the lingering Pods
are terminated by the Job controller.

When you specify the only `.spec.successPolicy.rules[*].succeededIndexes`,
once all indexes specified in the `succeededIndexes` succeeded, the Job is marked as succeeded.
The `succeededIndexes` must be a list within 0 to `.spec.completions-1` and
must not contain duplicate indexes. The `succeededIndexes` is represented as intervals separated by a hyphen.
The number are listed in represented by the first and last element of the series, separated by a hyphen.
For example, if you want to specify 1, 3, 4, 5 and 7, the `succeededIndexes` is represented as `1,3-5,7`.

When you specify the only `spec.successPolicy.rules[*].succeededCount`,
once the number of succeeded indexes reaches the `succeededCount`, the Job is marked as succeeded.

When you specify both `succeededIndexes` and `succeededCount`,
once the number of succeeded indexes specified in the `succeededIndexes` reaches the `succeededCount`,
the Job is marked as succeeded.

Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
the rules are evaluated in order. Once the Job meets a rule, the remaining rules are ignored.

Here is a manifest for a Job with `successPolicy`:

{{% code_sample file="/controllers/job-success-policy-example.yaml" %}}

In the example above, the rule of the success policy specifies that
the Job should be marked succeeded and terminate the lingering Pods
if one of the 0, 1, and 2 indexes succeeded.

{{< note >}}
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
once the Job meets both policies, the terminating policies are respected and a success policy is ignored.
{{< /note >}}

## Alternatives

### Bare Pods
Expand Down
@@ -0,0 +1,14 @@
---
title: JobSuccessPolicy
content_type: feature_gate

_build:
list: never
render: false

stages:
- stage: alpha
defaultValue: false
fromVersion: "1.30"
---
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.
25 changes: 25 additions & 0 deletions content/en/examples/controllers/job-success-policy-example.yaml
@@ -0,0 +1,25 @@
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed # Required for the feature
successPolicy:
rules:
- succeededIndexes: 0-2
succeededCount: 1
template:
spec:
containers:
- name: main
image: python
command: # The jobs succeed as there is one succeeded index
# among indexes 0, 1, and 2.
- python3
- -c
- |
import os, sys
if os.environ.get("JOB_COMPLETION_INDEX") == "1":
sys.exit(0)
else:
sys.exit(1)

0 comments on commit c6d4db6

Please sign in to comment.