KEP-4443: More granular Job failure reason added by PodFailurePolicy #4479

danielvegamyhre · 2024-02-05T21:43:50Z

One-line PR description: Configurable Job failure reason for PodFailurePolicy

Issue link: Add more granular failure reason for Job PodFailurePolicy #4443

Other comments:

danielvegamyhre · 2024-02-05T21:52:58Z

@soltysh for sig-apps lead review
tagging @ahg-g @alculquicondor @kannon92 for review as well

danielvegamyhre · 2024-02-05T21:55:57Z

@deads2k would you be able to do an API review for this? Here we are proposing adding an optional Reason field to the PodFailurePolicyRule for the Job API, similar to the Reason field I proposed here in the Job success policy KEP.

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

ahg-g · 2024-02-05T22:08:04Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.
+
+**Example JobSet with a Pod Failure Policy configuration for this use case**:


The simplest example is to allow a JobSet user to decide whether or not to fail it based on the exact exit code.

Add the above in the doc.

Also highlight how the Reason is used both in the .spec.failurePolicy and spec.replicatedJobs.template.spec.podFailurePolicy

Added a user story for the simplest use case, and updated the example specs to highlight that the reason fields are set to matching values in the JobSet failure policy and the Job pod failure policy rules.

ahg-g · 2024-02-05T22:11:50Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
+controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added
+to the Job.


we need to validate that reason follows the expected pattern as discussed in https://github.com/kubernetes/kubernetes/blob/dd301d0f23a63acc2501a13049c74b38d7ebc04d/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1555

We should also validate that the reason don't match existing reasons such as BackoffLimitExceeded

Added a "validation" section which includes both of these validation steps.

mimowo · 2024-02-06T09:37:53Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+controller uses when a PodFailurePolicy triggers a Job failure.
+
+When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
+controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added


No strong view, but I think it could be preferable to let the users only specify the suffix for the reason. This would still support the use cases, but having the PodFailurePolicy prefix might be handy when debugging the job by another person than the user.

Regardless, I would like to align the approach with #4062 (thread), because this is pretty much the same use case.

-1 to a suffix

This would complicate the usage of the feature. In the example in the story, the user would have to put different strings in the spec of the jobset and the job.

I documented "the entire reason" vs "the suffix of the reason" here: https://github.com/kubernetes/enhancements/blob/65416b0d03f666024779a495f253169659c42389/keps/sig-apps/3998-job-success-completion-policy/README.md#possibility-for-the-configurable-reason-for-the-successcriteriamet-condition

This may be helpful.

In #4062 it sounds like we aren't going to add a Reason field to the Success Policy, so we don't need to align the approach here anymore.

For our case, I do think "the entire reason" is preferable to "suffix of the reason" for the reason @alculquicondor mentioned: I can see the argument for having a consistent prefix, but I think a k8s-defined prefix being prepended to the user-defined reason in an opaque manner would cause bugs and result in a more confusing user experience.

As a user, I would intuitively expect to be able to "link" the JobSet FailurePolicyRule "Reason" and the child Job's PodFailurePolicyRule "Reason" by setting them to the same value. To me, it would be non-obvious and unexpected if the Reason on the JobFailed condition was something else besides what I had defined in my PodFailurePolicyRule.

I don't like the suffix as well, it is easier to the user to simply use the exact same string in two places

I see, I'm also not thrilled with the suffix idea, feel free to ignore. Just proposed as an attempt to give the users flexibility for machine-readable reasons, at the same time still conveying the information that the reason originates from podFailurePolicy.

alculquicondor · 2024-02-06T16:20:08Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.
+
+**Example JobSet with a Pod Failure Policy configuration for this use case**:


Add the above in the doc.

Also highlight how the Reason is used both in the .spec.failurePolicy and spec.replicatedJobs.template.spec.podFailurePolicy

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

alculquicondor · 2024-02-06T16:21:30Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+controller uses when a PodFailurePolicy triggers a Job failure.
+
+When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
+controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added


-1 to a suffix

This would complicate the usage of the feature. In the example in the story, the user would have to put different strings in the spec of the jobset and the job.

alculquicondor · 2024-02-06T16:22:47Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
+controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added
+to the Job.


We should also validate that the reason don't match existing reasons such as BackoffLimitExceeded

alculquicondor · 2024-02-06T16:24:21Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+## Summary
+
+This KEP proposes to extend the Job API by adding an optional `Reason` field to `PodFailurePolicyRule`, which if specified, would be included as the reason in the `JobFailed` condition upon Job failure triggered by a `PodFailurePolicy`.


alternative names:

ConditionReason

SetConditionReason (clarifies that this is about output, not about matching the Pod condition).

I like SetConditionReason - it is more explicit, and as you mentioned, clarifies that this is about defining the output, not about matching the Pod condition. Updated the doc.

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

kannon92

You need a prod-readiness file for alpha I believe.
keps/sig-apps/prod-readiness/4443.yaml

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/kep.yaml

danielvegamyhre · 2024-02-06T18:48:18Z

@wojtek-t would you mind doing a PRR review for this KEP?

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/kep.yaml

atiratree · 2024-02-08T01:58:09Z

I ran into a similar problem while working on the Declarative Node Maintenance KEP. So far I am using a similar approach as described in this PR, but I would like to find an alternative as it is not really an elegant solution.

The real problem comes when you have multiple controllers setting the same condition and multiple clients observing it. Then there is no real consensus on what the API/value of reason really is and how it should be processed programmatically.

When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?
If there is a prefix/separator, what should the prefix or separator be? And how does this get communicated to all the actors?
User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.
Although unlikely, we are opening a possibility of external controllers managing jobs with Job API managed-by mechanism #4368 which could also fragment the reason format.

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

In my scenario I would like to have the following condition Type="EvacuationRequest", Reason="NodeMaintenance", Message="Upgrade to 1.29" and an actuator (or a list of actuators) that have triggered this condition.

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

I think the metadata ultimately should be in the condition. A new field would add another layer of indirection. And it is not really scalable (different conditions, multiple reasons). A safer way would be to put that information into annotations.

I understand that enhancing the Condition API might be a big change and problematic/cause issues. But I would like to hear sig-api-machinery opinion on this (@deads2k).

ahg-g · 2024-02-08T02:39:46Z

Thanks @atiratree, Job has its own Condition API, it is not shared with Pod.

When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?

The contract is defined in the PodFailurePolicy spec.

User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.

Can you be more concrete? what would that complexity be in the case of Job and failure reasons?

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".

atiratree · 2024-02-08T13:01:04Z

Job has its own Condition API, it is not shared with Pod.

It is not, but we try to adhere to the same standard/conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".

I meant it would be best to add it to the JobCondition, but this ties into the point above ^

The contract is defined in the PodFailurePolicy spec.

Yes, it is still a contract, but a looser one. As an entity observing just the conditions, you cannot predict what values will be there and how to react to them.

Can you be more concrete? what would that complexity be in the case of Job and failure reasons?

I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..

^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.

soltysh · 2024-02-08T13:38:18Z

I think there's a a lot of ideas floating around and disagreements, which I'd like to clarify first before pushing this topic forward. I understand that the sooner it gets out the better, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call, and work out the solution. Just to summarize, currently we have four potential solutions:

human provided Reason
machine provided Reason
new status field
additional condition with human provided information

ahg-g · 2024-02-08T14:48:29Z

I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..

^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.

Mistakes are also possible with labels and selectors, but the general point is that in this case the user sets the failure policy (via pod failure policy) and so by definition the failure reason will not be predictable.

But if the argument is that reason has to be assigned from a previously-defined and finite set of values (because it must be predictable without needing to look at the podFailurePolicy spec), then I agree we should just focus our discussion on adding a new field because we can't satisfy the predictability requirement.

, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call

I am not sure we can get to a conclusion at wg-batch since this is also an API question and so we need an api approver to be engaged in the discussion.

human provided Reason

What this option is really about is giving a name or identifier to the podFailurePolicy rule, but it wouldn't satisfy the predictability "requirement".

machine provided Reason

To me the main issue with the machine provided reason is how to encode a set or range exit codes. In the above option we are practically giving the rule an identifier, so if we want the machine to identify the rule in the reason automatically, the only option I see is providing the rule number in the list (like PodFailurePolicy-Rule0 or PodFailurePolicy-Rule1 to indicate the first and the second rule respectively).

@soltysh since you are preferring to this option, how can we address @atiratree concern about the reason being predictable?

new status field

This seems the closest to address all concerns I heard above since we are not changing the Reason semantics. So I am supportive of this option.

additional condition with human provided information

I assume you are referring to adding new field to JobCondition type, if so then this needs wider consensus across the project since it needs to apply to all APIs as per [1], so we can't resolve this at wg-batch meeting. I am wondering how we want to approach this option?

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

danielvegamyhre · 2024-04-02T17:49:21Z

@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines

tenzen-y · 2024-04-02T17:53:10Z

@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines

IIRC, the current is still within the v1.30 cycle. So, after the v1.30 cycle, we should find the next timeline.

ahg-g · 2024-04-02T19:56:02Z

Lets work on this and get it merged early, we don't need to wait for the 1.31 timelines to be posted.

danielvegamyhre · 2024-04-09T00:45:55Z

@soltysh @ahg-g I revised the KEP based on our discussion in the WG Batch meeting.

Summary:

This KEP proposes to extend the Job API by adding an optional Name field to PodFailurePolicyRule. If unset, it would default to the index of the rule in the podFailurePolicy.rules slice.

When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the JobFailed condition reason, in the format: PodFailurePolicy_{ruleName}.

kannon92 · 2024-04-10T11:35:44Z

/wg batch

kannon92 · 2024-04-10T11:37:40Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+## Proposal
+
+The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user


The field is optional but it has a default?

I read on. I think this is fine. You may run into some api tests issues as optional fields tend to not like defaults.

It doesn't have defaults, when empty it's meant to use the rule index.

kannon92 · 2024-04-10T11:41:20Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+              image: python:3.10
+              command: ["..."]
+```
+


Can we get another story on PodFailurePolicy conditions? Exit code is one way for failures but we also allow for failing based on conditions. It would be worth stating how that would work also.

The pattern is the same, what would we learn from another story?

Mostly calling out that we want to support both and making sure we have integration tests for that also.

I guess that it comes from name so it is probably not necessary as a user story but we should make sure to have integration/unit tests for conditions as well as exitcodes.

kannon92 · 2024-04-10T11:52:39Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+which already had the `JobFailed` condition reason set by the new `SetConditionReason` field, that the
+Job controller does not overwrite the reason to `PodFailurePolicy`, and that it remains set to the
+user-defined `SetConditionReason`.
+


Maybe we should cover the case of PodFailurePolicy feature gate also? @mimowo when are we planning to GA PodFailurePolicy?

Yes please!

kannon92 · 2024-04-10T11:53:16Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+<!--
+This section must be completed when targeting alpha to a release.
+-->
+- Upgrade to k8s version 1.30+


Suggested change

- Upgrade to k8s version 1.30+

- Upgrade to k8s version 1.31+

kannon92 · 2024-04-10T11:54:28Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+-->
+If the optional `name` field is specified, the podFailurePolicy object size will increase by 1 byte per
+character in the `name` string.


You put an upper limit on the name also so I'd mention what it at most be.

Yes, we usually put here max size.

kannon92 · 2024-04-10T11:56:12Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+to now being configurable by the user. However, as described in the validation section below,
+we are validating against malformed/invalid inputs.
+
+


PodFailurePolicy is a beta feature so there could be a risk that this feature is blocked by PodFailurePolicy graduation.

Its a minor risk but any kind of coupling with other KEPs should be called out.

This is covered below in the business logic section.

kannon92

Looks very nice. Please make sure to fill out the rest of the PRR if you can. I have some minor additions/comments but otherwise I think its ready for review.

ahg-g

/lgtm

@soltysh I think this now reflects what we discussed in the community meeting.

ahg-g · 2024-04-12T05:42:57Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
+controller will append the name of the pod failure policy rule which triggered the failure
+to the JobFailed [condition](https://github.com/kubernetes/kubernetes/blob/6a4e93e776a35d14a61244185c848c3b5832621c/pkg/controller/job/job_controller.go#L816)
+reason. The exact format of the JobFailed condition reason will be `PodFailurePolicy-{ruleName}`.


Is the delimiter underscore or dash?

This argument holds, we should make it explicit to use one or the other across entire document.

ahg-g · 2024-04-12T05:44:06Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+              command: ["..."]
+```
+
+#### Story 2


I don't think we need another story from JobSet using exit codes, one is enough. Those detailed stories will be listed in the JobSet KEP for failurePolicy

This argument holds, either drop another JobSet story, or try to come up with a different story, which will cover more use cases.

ahg-g · 2024-04-12T05:46:43Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+              image: python:3.10
+              command: ["..."]
+```
+


The pattern is the same, what would we learn from another story?

soltysh · 2024-04-26T15:38:41Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/kep.yaml

+
+see-also:
+  - "https://github.com/kubernetes-sigs/jobset/pull/381"
+  - "https://github.com/kubernetes/enhancements/pull/3374"


Any particular reason why to use PR number rather https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md ?

soltysh · 2024-04-26T15:39:43Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+- [ ] (R) Production readiness review approved
+- [X] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes


Make sure to check appropriate boxes.

soltysh · 2024-04-26T15:47:46Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+the rule in the `podFailurePolicy.rules` slice. 
+
+When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the `JobFailed` condition reason, in the
+format: `PodFailurePolicy_{ruleName}`.


Can we make the summary a more descriptive one? Something like:

Expose more detailed pod failure information inside the JobFailed condition of a Job. This will be achieved using an optional Name field in PodFailurePolicyRule, which if unset, would default to the index of the podFailurePolicy.rules slice.

Ref: https://youtu.be/JZ9LQR_j0Rk?t=1482

soltysh · 2024-04-26T15:51:06Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+## Proposal
+
+The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user


It doesn't have defaults, when empty it's meant to use the rule index.

soltysh · 2024-04-26T15:51:54Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
+controller will append the name of the pod failure policy rule which triggered the failure
+to the JobFailed [condition](https://github.com/kubernetes/kubernetes/blob/6a4e93e776a35d14a61244185c848c3b5832621c/pkg/controller/job/job_controller.go#L816)
+reason. The exact format of the JobFailed condition reason will be `PodFailurePolicy-{ruleName}`.


This argument holds, we should make it explicit to use one or the other across entire document.

soltysh · 2024-04-26T16:04:17Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+You can take a look at one potential example of such test in:
+https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
+-->
+We can add unit tests for:


This holds.

soltysh · 2024-04-26T16:04:31Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+- feature enabled and field set
+- feature disabled and field set
+
+### Rollout, Upgrade and Rollback Planning


Yes please!

soltysh · 2024-04-26T16:13:30Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+
+###### What specific metrics should inform a rollback?
+
+<!--


I'd suggest filing this one out, piggy backing on the metrics introduced in https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md

soltysh · 2024-04-26T16:14:11Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+Pick one more of these and delete the rest.
+-->
+
+- [ ] Metrics


Same here, about re-using metrics from the other KEP.

soltysh · 2024-04-26T16:14:41Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+-->
+If the optional `name` field is specified, the podFailurePolicy object size will increase by 1 byte per
+character in the `name` string.


Yes, we usually put here max size.

atiratree · 2024-05-23T20:28:17Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+There is a risk to making a field that was previously exclusively managed by the controller,
+to now being configurable by the user. However, as described in the validation section below,
+we are validating against malformed/invalid inputs.
+


The users should be always aware of the PodFailurePolicy evaluation order. If two pods terminate at the same time with exit codes 2 and 3, only one of them will be picked up by a rule and exposed in the condition reason.

It might be surprising/racy if an external actor is not aware of this fact and only consumes the condition to trigger additional behaviour.

atiratree · 2024-05-23T20:49:28Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+## Summary
+
+This KEP proposes to extend the Job API by adding an optional `Name` field to `PodFailurePolicyRule`. If unset, it would default to the index of
+the rule in the `podFailurePolicy.rules` slice. 


Do we need the defaulting and the index?

I think it might be more convenient to consume the PodFailurePolicy reason than PodFailurePolicy-{index} by default . A user might not be interested in the index and this would complicate the parsing/consuming of the condition.

If the user has a need, then:

Name field can be added to the rule

the fail job message already mentions the index among other data

As a bonus, this would make it compatible with today's PodFailurePolicy implementation.

Also it might be confusing, if a user creates a rule with name: "7" on a first index.

atiratree · 2024-05-23T21:00:48Z

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md

+#### Defaulting
+If unset, the `Name` field will default to the index of the pod failure policy rule in the `Rules` slice.
+
+#### Validation


We should also check for the duplicates of the name.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 5, 2024

k8s-ci-robot requested review from kow3ns and soltysh February 5, 2024 21:43

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2024

danielvegamyhre force-pushed the kep-4443 branch from 56d4b81 to 0e39c11 Compare February 5, 2024 21:48

initial commit of kep 4443

02bf4dc

danielvegamyhre force-pushed the kep-4443 branch from 0e39c11 to 02bf4dc Compare February 5, 2024 22:00

danielvegamyhre mentioned this pull request Feb 5, 2024

Publish finer-grained failure reason for podFailurePolicy kubernetes/kubernetes#122972

Open

ahg-g reviewed Feb 5, 2024

View reviewed changes

mimowo mentioned this pull request Feb 6, 2024

KEP-3998: Job success/completion policy #4062

Merged

mimowo reviewed Feb 6, 2024

View reviewed changes

alculquicondor reviewed Feb 6, 2024

View reviewed changes

kannon92 reviewed Feb 6, 2024

View reviewed changes

alculquicondor reviewed Feb 6, 2024

View reviewed changes

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/kep.yaml Outdated Show resolved Hide resolved

address comments, production readiness file

9671531

danielvegamyhre force-pushed the kep-4443 branch 3 times, most recently from de03300 to ac89cc7 Compare February 6, 2024 18:39

add another user story for the simplest use case

5cdd49e

danielvegamyhre force-pushed the kep-4443 branch from ac89cc7 to 5cdd49e Compare February 6, 2024 18:41

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 6, 2024

alculquicondor reviewed Feb 6, 2024

View reviewed changes

keps/sig-apps/4443-configurable-pod-failure-policy-reasons/kep.yaml Show resolved Hide resolved

danielvegamyhre added 2 commits February 6, 2024 19:10

update approvers

277e58f

fix creation date

ba06c5f

salehsedghpour mentioned this pull request Feb 8, 2024

Add more granular failure reason for Job PodFailurePolicy #4443

Open

4 tasks

ahg-g mentioned this pull request Feb 9, 2024

☂️ Requirements for v0.4.0 release kubernetes-sigs/jobset#350

Closed

7 tasks

atiratree mentioned this pull request Feb 9, 2024

KEP-4212: Declarative Node Maintenance #4213

Open

6 tasks

revise kep based on wg batch discussion

aed6ded

danielvegamyhre force-pushed the kep-4443 branch from 2ca02d9 to aed6ded Compare April 9, 2024 00:53

k8s-ci-robot added the wg/batch Categorizes an issue or PR as relevant to WG Batch. label Apr 10, 2024

kannon92 reviewed Apr 10, 2024

View reviewed changes

ahg-g reviewed Apr 12, 2024

View reviewed changes

k8s-ci-robot assigned ahg-g Apr 12, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2024

soltysh reviewed Apr 26, 2024

View reviewed changes

danielvegamyhre mentioned this pull request Apr 26, 2024

Implement configurable failure policy. kubernetes-sigs/jobset#537

Merged

atiratree reviewed May 23, 2024

View reviewed changes

danielvegamyhre changed the title ~~KEP-4443: Configurable Job failure reason for PodFailurePolicyRule~~ KEP-4443: More granular Job failure reason added by PodFailurePolicy May 24, 2024


		When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.

		Example JobSet with a Pod Failure Policy configuration for this use case:


		## Summary

		This KEP proposes to extend the Job API by adding an optional `Reason` field to `PodFailurePolicyRule`, which if specified, would be included as the reason in the `JobFailed` condition upon Job failure triggered by a `PodFailurePolicy`.


		## Proposal

		The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user

	- Upgrade to k8s version 1.30+
	- Upgrade to k8s version 1.31+

		to now being configurable by the user. However, as described in the validation section below,
		we are validating against malformed/invalid inputs.

KEP-4443: More granular Job failure reason added by PodFailurePolicy #4479

Are you sure you want to change the base?

KEP-4443: More granular Job failure reason added by PodFailurePolicy #4479

Conversation

danielvegamyhre commented Feb 5, 2024

danielvegamyhre commented Feb 5, 2024

danielvegamyhre commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Feb 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Feb 6, 2024 • edited

Choose a reason for hiding this comment

danielvegamyhre Feb 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 left a comment

Choose a reason for hiding this comment

danielvegamyhre commented Feb 6, 2024

atiratree commented Feb 8, 2024 • edited

ahg-g commented Feb 8, 2024

atiratree commented Feb 8, 2024 • edited

soltysh commented Feb 8, 2024

ahg-g commented Feb 8, 2024 • edited

danielvegamyhre commented Apr 2, 2024

tenzen-y commented Apr 2, 2024

ahg-g commented Apr 2, 2024

danielvegamyhre commented Apr 9, 2024

kannon92 commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 left a comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Feb 6, 2024 •

edited

tenzen-y Feb 6, 2024 •

edited

danielvegamyhre Feb 6, 2024 •

edited

atiratree commented Feb 8, 2024 •

edited

atiratree commented Feb 8, 2024 •

edited

ahg-g commented Feb 8, 2024 •

edited