From c6d4db6f2ce3553c652f67f992d3a0e10cfe2944 Mon Sep 17 00:00:00 2001 From: Yuki Iwai Date: Wed, 14 Feb 2024 23:38:47 +0900 Subject: [PATCH] KEP-3998: Add JobSuccessPolicy Documentation Signed-off-by: Yuki Iwai --- .../concepts/workloads/controllers/job.md | 57 +++++++++++++++++++ .../feature-gates/job-success-policy.md | 14 +++++ .../job-success-policy-example.yaml | 25 ++++++++ 3 files changed, 96 insertions(+) create mode 100644 content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md create mode 100644 content/en/examples/controllers/job-success-policy-example.yaml diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md index be5775973b83d..dcf46f42152a7 100644 --- a/content/en/docs/concepts/workloads/controllers/job.md +++ b/content/en/docs/concepts/workloads/controllers/job.md @@ -1006,6 +1006,63 @@ status: terminating: 3 # three Pods are terminating and have not yet reached the Failed phase ``` +### Success policy {#success-policy} + +{{< feature-state for_k8s_version="v1.29" state="alpha" >}} + +{{< note >}} +You can only configure a success policy for an Indexed Job if you have the +`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +enabled in your cluster. +{{< /note >}} + +When you run an indexed Job, a success policy defined with the `spec.successPolicy` field, +allows you to define when a Job can be declared as succeeded based on the number of succeeded pods. + +In some situations, you may want to have a better control when handling Pod +successes than the control provided by the `.spec.completins`. +There are some examples of use cases: + +* To optimize costs of running workloads by avoiding unnecessary Pod running, + you can terminate a Job as soon as one of its Pods succeeds. +* To care only about a leader index in determining the success or failure of a Job + in a batch workloads such as MPI and PyTorch etc. + +You can configure a success policy, in the `.spec.successPolicy` field, +to meet the above use cases. This policy can handle Job successes based on the +number of succeeded pods. After the Job meet success policy, the lingering Pods +are terminated by the Job controller. + +When you specify the only `.spec.successPolicy.rules[*].succeededIndexes`, +once all indexes specified in the `succeededIndexes` succeeded, the Job is marked as succeeded. +The `succeededIndexes` must be a list within 0 to `.spec.completions-1` and +must not contain duplicate indexes. The `succeededIndexes` is represented as intervals separated by a hyphen. +The number are listed in represented by the first and last element of the series, separated by a hyphen. +For example, if you want to specify 1, 3, 4, 5 and 7, the `succeededIndexes` is represented as `1,3-5,7`. + +When you specify the only `spec.successPolicy.rules[*].succeededCount`, +once the number of succeeded indexes reaches the `succeededCount`, the Job is marked as succeeded. + +When you specify both `succeededIndexes` and `succeededCount`, +once the number of succeeded indexes specified in the `succeededIndexes` reaches the `succeededCount`, +the Job is marked as succeeded. + +Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`, +the rules are evaluated in order. Once the Job meets a rule, the remaining rules are ignored. + +Here is a manifest for a Job with `successPolicy`: + +{{% code_sample file="/controllers/job-success-policy-example.yaml" %}} + +In the example above, the rule of the success policy specifies that +the Job should be marked succeeded and terminate the lingering Pods +if one of the 0, 1, and 2 indexes succeeded. + +{{< note >}} +When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`, +once the Job meets both policies, the terminating policies are respected and a success policy is ignored. +{{< /note >}} + ## Alternatives ### Bare Pods diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md new file mode 100644 index 0000000000000..601680357ccc9 --- /dev/null +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates/job-success-policy.md @@ -0,0 +1,14 @@ +--- +title: JobSuccessPolicy +content_type: feature_gate + +_build: + list: never + render: false + +stages: + - stage: alpha + defaultValue: false + fromVersion: "1.30" +--- +Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods. diff --git a/content/en/examples/controllers/job-success-policy-example.yaml b/content/en/examples/controllers/job-success-policy-example.yaml new file mode 100644 index 0000000000000..5fdb6a274bd07 --- /dev/null +++ b/content/en/examples/controllers/job-success-policy-example.yaml @@ -0,0 +1,25 @@ +apiVersion: batch/v1 +kind: Job +spec: + parallelism: 10 + completions: 10 + completionMode: Indexed # Required for the feature + successPolicy: + rules: + - succeededIndexes: 0-2 + succeededCount: 1 + template: + spec: + containers: + - name: main + image: python + command: # The jobs succeed as there is one succeeded index + # among indexes 0, 1, and 2. + - python3 + - -c + - | + import os, sys + if os.environ.get("JOB_COMPLETION_INDEX") == "1": + sys.exit(0) + else: + sys.exit(1)