Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JobSuccessPolicy Doc #45135

Merged
merged 7 commits into from Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
54 changes: 54 additions & 0 deletions content/en/docs/concepts/workloads/controllers/job.md
Expand Up @@ -550,6 +550,60 @@ terminating Pods only once these Pods reach the terminal `Failed` phase. This be
to `podReplacementPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
{{< /note >}}

## Success policy {#success-policy}

{{< feature-state feature_gate_name="JobSuccessPolicy" >}}

{{< note >}}
You can only configure a success policy for an Indexed Job if you have the
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
enabled in your cluster.
{{< /note >}}

When creating an Indexed Job, you can define when a Job can be declared as succeeded using a `.spec.successPolicy`,
based on the pods that succeeded.

By default, a Job succeeds when the number of succeeded Pods equals `.spec.completions`.
These are some situations where you might want additional control for declaring a Job succeeded:

* When running simulations with different parameters,
you might not need all the simulations to succeed for the overall Job to be successful.
* When following a leader-worker pattern, only the success of the leader determines the success or
failure of a Job. Examples of this are frameworks like MPI and PyTorch etc.

You can configure a success policy, in the `.spec.successPolicy` field,
to meet the above use cases. This policy can handle Job successes based on the
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (can be addressed later).

Suggested change
succeeded pods. After the Job meet success policy, the job controller terminates the lingering Pods.
succeeded pods. After the Job meets the success policy, the job controller terminates the lingering Pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

A success policy is defined by rules. Each rule can take one of the following forms:

* When you specify the `succeededIndexes` only,
once all indexes specified in the `succeededIndexes` succeeded, the Job is marked as succeeded.
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
The `succeededIndexes` must be a list of intervals between 0 and `.spec.completions-1`.
* When you specify the `succeededCount` only,
once the number of succeeded indexes reaches the `succeededCount`, the Job is marked as succeeded.
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
* When you specify both `succeededIndexes` and `succeededCount`,
once the number of succeeded indexes specified in the `succeededIndexes` reaches the `succeededCount`,
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved
the Job is marked as succeeded.
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved

Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
the rules are evaluated in order. Once the Job meets a rule, the remaining rules are ignored.
tenzen-y marked this conversation as resolved.
Show resolved Hide resolved

Here is a manifest for a Job with `successPolicy`:

{{% code_sample file="/controllers/job-success-policy-example.yaml" %}}

In the example above, the rule of the success policy specifies that
the Job should be marked succeeded and terminate the lingering Pods
if one of the 0, 2, and 3 indexes succeeded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We might want to rephrase this question for better understanding. Ideally, we should be saying something like

"In the example above, both succeededIndexes and succeededCount have been specified. Therefore, the job controller will mark the Job as succeeded and terminate the lingering Pods when either of the specified indexes, 0, 2, or 3, succeed."

P.S. We can do this in a follow-up PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great suggestion!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Note that the `succeededIndexes` is represented as intervals separated by a hyphen.
The number are listed in represented by the first and last element of the series, separated by a hyphen.

{{< note >}}
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy.
{{< /note >}}

## Job termination and cleanup

When a Job completes, no more Pods are created, but the Pods are [usually](#pod-backoff-failure-policy) not deleted either.
Expand Down
@@ -0,0 +1,14 @@
---
title: JobSuccessPolicy
content_type: feature_gate

_build:
list: never
render: false

stages:
- stage: alpha
defaultValue: false
fromVersion: "1.30"
---
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.
25 changes: 25 additions & 0 deletions content/en/examples/controllers/job-success-policy-example.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no need to put example in the filename; the path already makes it clear)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.

@@ -0,0 +1,25 @@
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed # Required for the feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
completionMode: Indexed # Required for the feature
completionMode: Indexed # Required for the success policy

(because if we write it like this, the example works even after graduation to GA)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
Thank you!

successPolicy:
rules:
- succeededIndexes: 0,2-3
succeededCount: 1
template:
spec:
containers:
- name: main
image: python
command: # The jobs succeed as there is one succeeded index
# among indexes 0, 2, and 3.
- python3
- -c
- |
import os, sys
if os.environ.get("JOB_COMPLETION_INDEX") == "2":
sys.exit(0)
else:
sys.exit(1)