Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4443: More granular Job failure reason added by PodFailurePolicy #4479

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

danielvegamyhre
Copy link
Member

  • One-line PR description: Configurable Job failure reason for PodFailurePolicy
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 5, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2024
@danielvegamyhre
Copy link
Member Author

@soltysh for sig-apps lead review
tagging @ahg-g @alculquicondor @kannon92 for review as well

@danielvegamyhre
Copy link
Member Author

@deads2k would you be able to do an API review for this? Here we are proposing adding an optional Reason field to the PodFailurePolicyRule for the Job API, similar to the Reason field I proposed here in the Job success policy KEP.


When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.

**Example JobSet with a Pod Failure Policy configuration for this use case**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest example is to allow a JobSet user to decide whether or not to fail it based on the exact exit code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the above in the doc.

Also highlight how the Reason is used both in the .spec.failurePolicy and spec.replicatedJobs.template.spec.podFailurePolicy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a user story for the simplest use case, and updated the example specs to highlight that the reason fields are set to matching values in the JobSet failure policy and the Job pod failure policy rules.


When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added
to the Job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also validate that the reason don't match existing reasons such as BackoffLimitExceeded

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a "validation" section which includes both of these validation steps.

controller uses when a PodFailurePolicy triggers a Job failure.

When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added
Copy link
Contributor

@mimowo mimowo Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong view, but I think it could be preferable to let the users only specify the suffix for the reason. This would still support the use cases, but having the PodFailurePolicy prefix might be handy when debugging the job by another person than the user.

Regardless, I would like to align the approach with #4062 (thread), because this is pretty much the same use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to a suffix

This would complicate the usage of the feature. In the example in the story, the user would have to put different strings in the spec of the jobset and the job.

Copy link
Member

@tenzen-y tenzen-y Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #4062 it sounds like we aren't going to add a Reason field to the Success Policy, so we don't need to align the approach here anymore.

For our case, I do think "the entire reason" is preferable to "suffix of the reason" for the reason @alculquicondor mentioned: I can see the argument for having a consistent prefix, but I think a k8s-defined prefix being prepended to the user-defined reason in an opaque manner would cause bugs and result in a more confusing user experience.

As a user, I would intuitively expect to be able to "link" the JobSet FailurePolicyRule "Reason" and the child Job's PodFailurePolicyRule "Reason" by setting them to the same value. To me, it would be non-obvious and unexpected if the Reason on the JobFailed condition was something else besides what I had defined in my PodFailurePolicyRule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the suffix as well, it is easier to the user to simply use the exact same string in two places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'm also not thrilled with the suffix idea, feel free to ignore. Just proposed as an attempt to give the users flexibility for machine-readable reasons, at the same time still conveying the information that the reason originates from podFailurePolicy.


When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.

**Example JobSet with a Pod Failure Policy configuration for this use case**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the above in the doc.

Also highlight how the Reason is used both in the .spec.failurePolicy and spec.replicatedJobs.template.spec.podFailurePolicy

controller uses when a PodFailurePolicy triggers a Job failure.

When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to a suffix

This would complicate the usage of the feature. In the example in the story, the user would have to put different strings in the spec of the jobset and the job.


When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added
to the Job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also validate that the reason don't match existing reasons such as BackoffLimitExceeded


## Summary

This KEP proposes to extend the Job API by adding an optional `Reason` field to `PodFailurePolicyRule`, which if specified, would be included as the reason in the `JobFailed` condition upon Job failure triggered by a `PodFailurePolicy`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternative names:

  • ConditionReason
  • SetConditionReason (clarifies that this is about output, not about matching the Pod condition).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like SetConditionReason - it is more explicit, and as you mentioned, clarifies that this is about defining the output, not about matching the Pod condition. Updated the doc.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a prod-readiness file for alpha I believe.
keps/sig-apps/prod-readiness/4443.yaml

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 6, 2024
@danielvegamyhre
Copy link
Member Author

@wojtek-t would you mind doing a PRR review for this KEP?

@atiratree
Copy link
Member

atiratree commented Feb 8, 2024

I ran into a similar problem while working on the Declarative Node Maintenance KEP. So far I am using a similar approach as described in this PR, but I would like to find an alternative as it is not really an elegant solution.

The real problem comes when you have multiple controllers setting the same condition and multiple clients observing it. Then there is no real consensus on what the API/value of reason really is and how it should be processed programmatically.

  • When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?
  • If there is a prefix/separator, what should the prefix or separator be? And how does this get communicated to all the actors?
  • User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.
  • Although unlikely, we are opening a possibility of external controllers managing jobs with Job API managed-by mechanism #4368 which could also fragment the reason format.

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

In my scenario I would like to have the following condition Type="EvacuationRequest", Reason="NodeMaintenance", Message="Upgrade to 1.29" and an actuator (or a list of actuators) that have triggered this condition.

What about a new field in the Job status that is dedicated to the user-specified string, instead of putting it in the Condition?

I think the metadata ultimately should be in the condition. A new field would add another layer of indirection. And it is not really scalable (different conditions, multiple reasons). A safer way would be to put that information into annotations.

I understand that enhancing the Condition API might be a big change and problematic/cause issues. But I would like to hear sig-api-machinery opinion on this (@deads2k).

@ahg-g
Copy link
Member

ahg-g commented Feb 8, 2024

Thanks @atiratree, Job has its own Condition API, it is not shared with Pod.

When you only have a user defined value as the reason, how should other actors/higher abstraction tools react to it? What is the contract?

The contract is defined in the PodFailurePolicy spec.

User defined values can also bring chaos into this. What if they encode some value in the reason for another party to process? This adds additional complexity to reasoning about the reason.

Can you be more concrete? what would that complexity be in the case of Job and failure reasons?

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".

@atiratree
Copy link
Member

atiratree commented Feb 8, 2024

Job has its own Condition API, it is not shared with Pod.

It is not, but we try to adhere to the same standard/conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions.

A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason".

I meant it would be best to add it to the JobCondition, but this ties into the point above ^

The contract is defined in the PodFailurePolicy spec.

Yes, it is still a contract, but a looser one. As an entity observing just the conditions, you cannot predict what values will be there and how to react to them.

Can you be more concrete? what would that complexity be in the case of Job and failure reasons?

I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..

^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.

@soltysh
Copy link
Contributor

soltysh commented Feb 8, 2024

I think there's a a lot of ideas floating around and disagreements, which I'd like to clarify first before pushing this topic forward. I understand that the sooner it gets out the better, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call, and work out the solution. Just to summarize, currently we have four potential solutions:

  1. human provided Reason
  2. machine provided Reason
  3. new status field
  4. additional condition with human provided information

@ahg-g
Copy link
Member

ahg-g commented Feb 8, 2024

I cannot predict what kind of values users will encode there. A client has to know how to react to PodFailurePolicy, PodFailurePolicyBecauseXFailed, XFailed, XFailedButDoYInstead, XFailedButDoZInstead. One can imagine even longer examples..

^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API.

Mistakes are also possible with labels and selectors, but the general point is that in this case the user sets the failure policy (via pod failure policy) and so by definition the failure reason will not be predictable.

But if the argument is that reason has to be assigned from a previously-defined and finite set of values (because it must be predictable without needing to look at the podFailurePolicy spec), then I agree we should just focus our discussion on adding a new field because we can't satisfy the predictability requirement.

, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call

I am not sure we can get to a conclusion at wg-batch since this is also an API question and so we need an api approver to be engaged in the discussion.

human provided Reason

What this option is really about is giving a name or identifier to the podFailurePolicy rule, but it wouldn't satisfy the predictability "requirement".

machine provided Reason

To me the main issue with the machine provided reason is how to encode a set or range exit codes. In the above option we are practically giving the rule an identifier, so if we want the machine to identify the rule in the reason automatically, the only option I see is providing the rule number in the list (like PodFailurePolicy-Rule0 or PodFailurePolicy-Rule1 to indicate the first and the second rule respectively).

@soltysh since you are preferring to this option, how can we address @atiratree concern about the reason being predictable?

new status field

This seems the closest to address all concerns I heard above since we are not changing the Reason semantics. So I am supportive of this option.

additional condition with human provided information

I assume you are referring to adding new field to JobCondition type, if so then this needs wider consensus across the project since it needs to apply to all APIs as per [1], so we can't resolve this at wg-batch meeting. I am wondering how we want to approach this option?

[1] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

@danielvegamyhre
Copy link
Member Author

@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines

@tenzen-y
Copy link
Member

tenzen-y commented Apr 2, 2024

@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines

IIRC, the current is still within the v1.30 cycle. So, after the v1.30 cycle, we should find the next timeline.

@ahg-g
Copy link
Member

ahg-g commented Apr 2, 2024

Lets work on this and get it merged early, we don't need to wait for the 1.31 timelines to be posted.

@danielvegamyhre
Copy link
Member Author

@soltysh @ahg-g I revised the KEP based on our discussion in the WG Batch meeting.

Summary:

This KEP proposes to extend the Job API by adding an optional Name field to PodFailurePolicyRule. If unset, it would default to the index of the rule in the podFailurePolicy.rules slice.

When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the JobFailed condition reason, in the format: PodFailurePolicy_{ruleName}.

@kannon92
Copy link
Contributor

/wg batch

@k8s-ci-robot k8s-ci-robot added the wg/batch Categorizes an issue or PR as relevant to WG Batch. label Apr 10, 2024

## Proposal

The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field is optional but it has a default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read on. I think this is fine. You may run into some api tests issues as optional fields tend to not like defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have defaults, when empty it's meant to use the rule index.

image: python:3.10
command: ["..."]
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get another story on PodFailurePolicy conditions? Exit code is one way for failures but we also allow for failing based on conditions. It would be worth stating how that would work also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is the same, what would we learn from another story?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly calling out that we want to support both and making sure we have integration tests for that also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that it comes from name so it is probably not necessary as a user story but we should make sure to have integration/unit tests for conditions as well as exitcodes.

which already had the `JobFailed` condition reason set by the new `SetConditionReason` field, that the
Job controller does not overwrite the reason to `PodFailurePolicy`, and that it remains set to the
user-defined `SetConditionReason`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should cover the case of PodFailurePolicy feature gate also? @mimowo when are we planning to GA PodFailurePolicy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please!

<!--
This section must be completed when targeting alpha to a release.
-->
- Upgrade to k8s version 1.30+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Upgrade to k8s version 1.30+
- Upgrade to k8s version 1.31+

- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
If the optional `name` field is specified, the podFailurePolicy object size will increase by 1 byte per
character in the `name` string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You put an upper limit on the name also so I'd mention what it at most be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we usually put here max size.

to now being configurable by the user. However, as described in the validation section below,
we are validating against malformed/invalid inputs.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodFailurePolicy is a beta feature so there could be a risk that this feature is blocked by PodFailurePolicy graduation.

Its a minor risk but any kind of coupling with other KEPs should be called out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered below in the business logic section.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice. Please make sure to fill out the rest of the PRR if you can. I have some minor additions/comments but otherwise I think its ready for review.

Copy link
Member

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@soltysh I think this now reflects what we discussed in the community meeting.

When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
controller will append the name of the pod failure policy rule which triggered the failure
to the JobFailed [condition](https://github.com/kubernetes/kubernetes/blob/6a4e93e776a35d14a61244185c848c3b5832621c/pkg/controller/job/job_controller.go#L816)
reason. The exact format of the JobFailed condition reason will be `PodFailurePolicy-{ruleName}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the delimiter underscore or dash?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument holds, we should make it explicit to use one or the other across entire document.

command: ["..."]
```

#### Story 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another story from JobSet using exit codes, one is enough. Those detailed stories will be listed in the JobSet KEP for failurePolicy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument holds, either drop another JobSet story, or try to come up with a different story, which will cover more use cases.

image: python:3.10
command: ["..."]
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is the same, what would we learn from another story?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2024

see-also:
- "https://github.com/kubernetes-sigs/jobset/pull/381"
- "https://github.com/kubernetes/enhancements/pull/3374"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- [ ] (R) Production readiness review approved
- [X] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to check appropriate boxes.

the rule in the `podFailurePolicy.rules` slice.

When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the `JobFailed` condition reason, in the
format: `PodFailurePolicy_{ruleName}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the summary a more descriptive one? Something like:

Expose more detailed pod failure information inside the JobFailed condition of a Job. This will be achieved using an optional Name field in PodFailurePolicyRule, which if unset, would default to the index of the podFailurePolicy.rules slice.

Ref: https://youtu.be/JZ9LQR_j0Rk?t=1482


## Proposal

The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have defaults, when empty it's meant to use the rule index.

When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job
controller will append the name of the pod failure policy rule which triggered the failure
to the JobFailed [condition](https://github.com/kubernetes/kubernetes/blob/6a4e93e776a35d14a61244185c848c3b5832621c/pkg/controller/job/job_controller.go#L816)
reason. The exact format of the JobFailed condition reason will be `PodFailurePolicy-{ruleName}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument holds, we should make it explicit to use one or the other across entire document.

You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
We can add unit tests for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This holds.

- feature enabled and field set
- feature disabled and field set

### Rollout, Upgrade and Rollback Planning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please!


###### What specific metrics should inform a rollback?

<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick one more of these and delete the rest.
-->

- [ ] Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, about re-using metrics from the other KEP.

- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
If the optional `name` field is specified, the podFailurePolicy object size will increase by 1 byte per
character in the `name` string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we usually put here max size.

There is a risk to making a field that was previously exclusively managed by the controller,
to now being configurable by the user. However, as described in the validation section below,
we are validating against malformed/invalid inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The users should be always aware of the PodFailurePolicy evaluation order. If two pods terminate at the same time with exit codes 2 and 3, only one of them will be picked up by a rule and exposed in the condition reason.

It might be surprising/racy if an external actor is not aware of this fact and only consumes the condition to trigger additional behaviour.

## Summary

This KEP proposes to extend the Job API by adding an optional `Name` field to `PodFailurePolicyRule`. If unset, it would default to the index of
the rule in the `podFailurePolicy.rules` slice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the defaulting and the index?

I think it might be more convenient to consume the PodFailurePolicy reason than PodFailurePolicy-{index} by default . A user might not be interested in the index and this would complicate the parsing/consuming of the condition.

If the user has a need, then:

  • Name field can be added to the rule
  • the fail job message already mentions the index among other data

As a bonus, this would make it compatible with today's PodFailurePolicy implementation.

Also it might be confusing, if a user creates a rule with name: "7" on a first index.

#### Defaulting
If unset, the `Name` field will default to the index of the pod failure policy rule in the `Rules` slice.

#### Validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check for the duplicates of the name.

@danielvegamyhre danielvegamyhre changed the title KEP-4443: Configurable Job failure reason for PodFailurePolicyRule KEP-4443: More granular Job failure reason added by PodFailurePolicy May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

None yet