New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4443: More granular Job failure reason added by PodFailurePolicy #4479
base: master
Are you sure you want to change the base?
Conversation
danielvegamyhre
commented
Feb 5, 2024
- One-line PR description: Configurable Job failure reason for PodFailurePolicy
- Issue link: Add more granular failure reason for Job PodFailurePolicy #4443
- Other comments:
56d4b81
to
0e39c11
Compare
@soltysh for sig-apps lead review |
0e39c11
to
02bf4dc
Compare
keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md
Outdated
Show resolved
Hide resolved
|
||
When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job. | ||
|
||
**Example JobSet with a Pod Failure Policy configuration for this use case**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest example is to allow a JobSet user to decide whether or not to fail it based on the exact exit code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the above in the doc.
Also highlight how the Reason is used both in the .spec.failurePolicy
and spec.replicatedJobs.template.spec.podFailurePolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a user story for the simplest use case, and updated the example specs to highlight that the reason fields are set to matching values in the JobSet failure policy and the Job pod failure policy rules.
|
||
When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job | ||
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added | ||
to the Job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to validate that reason follows the expected pattern as discussed in https://github.com/kubernetes/kubernetes/blob/dd301d0f23a63acc2501a13049c74b38d7ebc04d/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also validate that the reason don't match existing reasons such as BackoffLimitExceeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a "validation" section which includes both of these validation steps.
controller uses when a PodFailurePolicy triggers a Job failure. | ||
|
||
When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job | ||
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong view, but I think it could be preferable to let the users only specify the suffix for the reason. This would still support the use cases, but having the PodFailurePolicy prefix might be handy when debugging the job by another person than the user.
Regardless, I would like to align the approach with #4062 (thread), because this is pretty much the same use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 to a suffix
This would complicate the usage of the feature. In the example in the story, the user would have to put different strings in the spec of the jobset and the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I documented "the entire reason" vs "the suffix of the reason" here: https://github.com/kubernetes/enhancements/blob/65416b0d03f666024779a495f253169659c42389/keps/sig-apps/3998-job-success-completion-policy/README.md#possibility-for-the-configurable-reason-for-the-successcriteriamet-condition
This may be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #4062 it sounds like we aren't going to add a Reason field to the Success Policy, so we don't need to align the approach here anymore.
For our case, I do think "the entire reason" is preferable to "suffix of the reason" for the reason @alculquicondor mentioned: I can see the argument for having a consistent prefix, but I think a k8s-defined prefix being prepended to the user-defined reason in an opaque manner would cause bugs and result in a more confusing user experience.
As a user, I would intuitively expect to be able to "link" the JobSet FailurePolicyRule "Reason" and the child Job's PodFailurePolicyRule "Reason" by setting them to the same value. To me, it would be non-obvious and unexpected if the Reason on the JobFailed
condition was something else besides what I had defined in my PodFailurePolicyRule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the suffix as well, it is easier to the user to simply use the exact same string in two places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'm also not thrilled with the suffix idea, feel free to ignore. Just proposed as an attempt to give the users flexibility for machine-readable reasons, at the same time still conveying the information that the reason originates from podFailurePolicy.
|
||
When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job. | ||
|
||
**Example JobSet with a Pod Failure Policy configuration for this use case**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the above in the doc.
Also highlight how the Reason is used both in the .spec.failurePolicy
and spec.replicatedJobs.template.spec.podFailurePolicy
keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md
Outdated
Show resolved
Hide resolved
controller uses when a PodFailurePolicy triggers a Job failure. | ||
|
||
When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job | ||
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 to a suffix
This would complicate the usage of the feature. In the example in the story, the user would have to put different strings in the spec of the jobset and the job.
|
||
When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job | ||
controller will add the reason defined in the `Reason` field to the JobFailed [condition](https://sourcegraph.com/github.com/kubernetes/kubernetes@6a4e93e776a35d14a61244185c848c3b5832621c/-/blob/pkg/controller/job/job_controller.go?L816) added | ||
to the Job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also validate that the reason don't match existing reasons such as BackoffLimitExceeded
|
||
## Summary | ||
|
||
This KEP proposes to extend the Job API by adding an optional `Reason` field to `PodFailurePolicyRule`, which if specified, would be included as the reason in the `JobFailed` condition upon Job failure triggered by a `PodFailurePolicy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative names:
ConditionReason
SetConditionReason
(clarifies that this is about output, not about matching the Pod condition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like SetConditionReason
- it is more explicit, and as you mentioned, clarifies that this is about defining the output, not about matching the Pod condition. Updated the doc.
keps/sig-apps/4443-configurable-pod-failure-policy-reasons/README.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a prod-readiness file for alpha I believe.
keps/sig-apps/prod-readiness/4443.yaml
keps/sig-apps/4443-configurable-pod-failure-policy-reasons/kep.yaml
Outdated
Show resolved
Hide resolved
de03300
to
ac89cc7
Compare
ac89cc7
to
5cdd49e
Compare
@wojtek-t would you mind doing a PRR review for this KEP? |
I ran into a similar problem while working on the Declarative Node Maintenance KEP. So far I am using a similar approach as described in this PR, but I would like to find an alternative as it is not really an elegant solution. The real problem comes when you have multiple controllers setting the same condition and multiple clients observing it. Then there is no real consensus on what the API/value of reason really is and how it should be processed programmatically.
We are constrained by the Condition API and I think the best solution would be to introduce an additional field (string/slice/map?). This would allow to encode additional metadata about the condition and ensure that the controllers and clients can easily communicate their intentions. In my scenario I would like to have the following condition
I think the metadata ultimately should be in the condition. A new field would add another layer of indirection. And it is not really scalable (different conditions, multiple reasons). A safer way would be to put that information into annotations. I understand that enhancing the Condition API might be a big change and problematic/cause issues. But I would like to hear sig-api-machinery opinion on this (@deads2k). |
Thanks @atiratree, Job has its own Condition API, it is not shared with Pod.
The contract is defined in the PodFailurePolicy spec.
Can you be more concrete? what would that complexity be in the case of Job and failure reasons?
A string is what is being proposed already, but the other option is to have a "DetailedReason" map that in our case we use to set an entry with the key "UserDefinedFailureReason". |
It is not, but we try to adhere to the same standard/conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
I meant it would be best to add it to the JobCondition, but this ties into the point above ^
Yes, it is still a contract, but a looser one. As an entity observing just the conditions, you cannot predict what values will be there and how to react to them.
I cannot predict what kind of values users will encode there. A client has to know how to react to ^ What if there is a mistake. You can't easily validate this because it's looser. So far we have offered it as more tighter programmatic API. |
I think there's a a lot of ideas floating around and disagreements, which I'd like to clarify first before pushing this topic forward. I understand that the sooner it gets out the better, but I'd prefer we discuss this more in depth maybe next Thursday (2/15) during wg-batch call, and work out the solution. Just to summarize, currently we have four potential solutions:
|
Mistakes are also possible with labels and selectors, but the general point is that in this case the user sets the failure policy (via pod failure policy) and so by definition the failure reason will not be predictable. But if the argument is that reason has to be assigned from a previously-defined and finite set of values (because it must be predictable without needing to look at the podFailurePolicy spec), then I agree we should just focus our discussion on adding a new field because we can't satisfy the predictability requirement.
I am not sure we can get to a conclusion at wg-batch since this is also an API question and so we need an api approver to be engaged in the discussion.
What this option is really about is giving a name or identifier to the podFailurePolicy rule, but it wouldn't satisfy the predictability "requirement".
To me the main issue with the machine provided reason is how to encode a set or range exit codes. In the above option we are practically giving the rule an identifier, so if we want the machine to identify the rule in the reason automatically, the only option I see is providing the rule number in the list (like @soltysh since you are preferring to this option, how can we address @atiratree concern about the reason being predictable?
This seems the closest to address all concerns I heard above since we are not changing the
I assume you are referring to adding new field to JobCondition type, if so then this needs wider consensus across the project since it needs to apply to all APIs as per [1], so we can't resolve this at wg-batch meeting. I am wondering how we want to approach this option? |
@soltysh I'm getting back to work on this shortly, what is the deadline for this? We didn't make the 1.30 deadline but I don't see any information on the 1.31 deadlines |
IIRC, the current is still within the v1.30 cycle. So, after the v1.30 cycle, we should find the next timeline. |
Lets work on this and get it merged early, we don't need to wait for the 1.31 timelines to be posted. |
@soltysh @ahg-g I revised the KEP based on our discussion in the WG Batch meeting. Summary: This KEP proposes to extend the Job API by adding an optional When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the |
2ca02d9
to
aed6ded
Compare
/wg batch |
|
||
## Proposal | ||
|
||
The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field is optional but it has a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read on. I think this is fine. You may run into some api tests issues as optional fields tend to not like defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have defaults, when empty it's meant to use the rule index.
image: python:3.10 | ||
command: ["..."] | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get another story on PodFailurePolicy conditions? Exit code is one way for failures but we also allow for failing based on conditions. It would be worth stating how that would work also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern is the same, what would we learn from another story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly calling out that we want to support both and making sure we have integration tests for that also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that it comes from name so it is probably not necessary as a user story but we should make sure to have integration/unit tests for conditions as well as exitcodes.
which already had the `JobFailed` condition reason set by the new `SetConditionReason` field, that the | ||
Job controller does not overwrite the reason to `PodFailurePolicy`, and that it remains set to the | ||
user-defined `SetConditionReason`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should cover the case of PodFailurePolicy feature gate also? @mimowo when are we planning to GA PodFailurePolicy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please!
<!-- | ||
This section must be completed when targeting alpha to a release. | ||
--> | ||
- Upgrade to k8s version 1.30+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Upgrade to k8s version 1.30+ | |
- Upgrade to k8s version 1.31+ |
- Estimated amount of new objects: (e.g., new Object X for every existing Pod) | ||
--> | ||
If the optional `name` field is specified, the podFailurePolicy object size will increase by 1 byte per | ||
character in the `name` string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You put an upper limit on the name also so I'd mention what it at most be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we usually put here max size.
to now being configurable by the user. However, as described in the validation section below, | ||
we are validating against malformed/invalid inputs. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodFailurePolicy is a beta feature so there could be a risk that this feature is blocked by PodFailurePolicy graduation.
Its a minor risk but any kind of coupling with other KEPs should be called out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered below in the business logic section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice. Please make sure to fill out the rest of the PRR if you can. I have some minor additions/comments but otherwise I think its ready for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@soltysh I think this now reflects what we discussed in the community meeting.
When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job | ||
controller will append the name of the pod failure policy rule which triggered the failure | ||
to the JobFailed [condition](https://github.com/kubernetes/kubernetes/blob/6a4e93e776a35d14a61244185c848c3b5832621c/pkg/controller/job/job_controller.go#L816) | ||
reason. The exact format of the JobFailed condition reason will be `PodFailurePolicy-{ruleName}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the delimiter underscore or dash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument holds, we should make it explicit to use one or the other across entire document.
command: ["..."] | ||
``` | ||
|
||
#### Story 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need another story from JobSet using exit codes, one is enough. Those detailed stories will be listed in the JobSet KEP for failurePolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument holds, either drop another JobSet story, or try to come up with a different story, which will cover more use cases.
image: python:3.10 | ||
command: ["..."] | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern is the same, what would we learn from another story?
|
||
see-also: | ||
- "https://github.com/kubernetes-sigs/jobset/pull/381" | ||
- "https://github.com/kubernetes/enhancements/pull/3374" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why to use PR number rather https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md ?
- [ ] (R) Production readiness review approved | ||
- [X] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to check appropriate boxes.
the rule in the `podFailurePolicy.rules` slice. | ||
|
||
When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the `JobFailed` condition reason, in the | ||
format: `PodFailurePolicy_{ruleName}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the summary a more descriptive one? Something like:
Expose more detailed pod failure information inside the JobFailed
condition of a Job. This will be achieved using an optional Name
field in PodFailurePolicyRule
, which if unset, would default to the index of the podFailurePolicy.rules
slice.
|
||
## Proposal | ||
|
||
The proposal is to add an optional `Name` field to the `PodFailurePolicyRule`, allowing the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have defaults, when empty it's meant to use the rule index.
When a `PodFailurePolicyRule` matches a pod failure and the `Action` is `FailJob`, the Job | ||
controller will append the name of the pod failure policy rule which triggered the failure | ||
to the JobFailed [condition](https://github.com/kubernetes/kubernetes/blob/6a4e93e776a35d14a61244185c848c3b5832621c/pkg/controller/job/job_controller.go#L816) | ||
reason. The exact format of the JobFailed condition reason will be `PodFailurePolicy-{ruleName}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument holds, we should make it explicit to use one or the other across entire document.
You can take a look at one potential example of such test in: | ||
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 | ||
--> | ||
We can add unit tests for: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This holds.
- feature enabled and field set | ||
- feature disabled and field set | ||
|
||
### Rollout, Upgrade and Rollback Planning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please!
|
||
###### What specific metrics should inform a rollback? | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest filing this one out, piggy backing on the metrics introduced in https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
Pick one more of these and delete the rest. | ||
--> | ||
|
||
- [ ] Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, about re-using metrics from the other KEP.
- Estimated amount of new objects: (e.g., new Object X for every existing Pod) | ||
--> | ||
If the optional `name` field is specified, the podFailurePolicy object size will increase by 1 byte per | ||
character in the `name` string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we usually put here max size.
There is a risk to making a field that was previously exclusively managed by the controller, | ||
to now being configurable by the user. However, as described in the validation section below, | ||
we are validating against malformed/invalid inputs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The users should be always aware of the PodFailurePolicy evaluation order. If two pods terminate at the same time with exit codes 2 and 3, only one of them will be picked up by a rule and exposed in the condition reason.
It might be surprising/racy if an external actor is not aware of this fact and only consumes the condition to trigger additional behaviour.
## Summary | ||
|
||
This KEP proposes to extend the Job API by adding an optional `Name` field to `PodFailurePolicyRule`. If unset, it would default to the index of | ||
the rule in the `podFailurePolicy.rules` slice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the defaulting and the index?
I think it might be more convenient to consume the PodFailurePolicy
reason than PodFailurePolicy-{index}
by default . A user might not be interested in the index and this would complicate the parsing/consuming of the condition.
If the user has a need, then:
Name
field can be added to the rule- the fail job message already mentions the index among other data
As a bonus, this would make it compatible with today's PodFailurePolicy implementation.
Also it might be confusing, if a user creates a rule with name: "7"
on a first index.
#### Defaulting | ||
If unset, the `Name` field will default to the index of the pod failure policy rule in the `Rules` slice. | ||
|
||
#### Validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also check for the duplicates of the name.