Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job: Support for the SuccessPolicy #123412

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Feb 21, 2024

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

I implemented the following items to support the JobSuccessPolicy:

  • Extended Job Controller and Uni Tests
  • Extended Job Validation Webhooks and Uni Tests
  • Added Integration Test for this feature.

Please see more details in the KEP.

Which issue(s) this PR fixes:

Tracking issue kubernetes/enhancements#3998

Special notes for your reviewer:

New Condition Name

ref: #123412 (comment)

Votes:

Neutral: @mimowo

Difference from KEP

Most of the implementation completely follows specifications decided in KEP.
There are 3 differences from the KEP.

  1. The type of .spec.successPolicy.criteria[*].succeededCount. I replaced the int with int32 because we used to select the int32 for the number of indexes like the .spec.completions.
  2. I added validations when updating Job Conditions. I added the following validations based on Support for the Job managedBy field (alpha) #123273 (comment):
    • RejectSuccessCriteriaMetWithFailedCondition: Reject transitions of conditions Failed <-> SuccessCriteriaMet
    • RejectSuccessCriteriaMetWithFailureTargetCondition: Reject transitions of conditions FailureTarget <-> SuccessCriteriaMet
    • RejectSuccessCriteriaMetForAlreadyCompleteJob: Reject transitions of conditions Complete -> SuccessCriteria
    • RejectSuccessCriteriaMetForNonIndexedJob: Reject the addition of SuccessCriteria for NonIndexed Job.
  3. I replaced the API name, SuccessPolicyCriteria (JSON=criteria) with SuccessPolicyRule (JSON=rules) based on Job: Support for the SuccessPolicy #123412 (comment).

Additionally, this PR depends on #123273 in the validateIndexesFormat:

// Extended the [validateIndexesFormat](https://github.com/kubernetes/kubernetes/pull/123273)
// TODO: Once #123273 is merged into the master, we should rebase these commits.
func validateIndexesFormat(indexesStr string, completions int) (*int, error) {

, IsJobComplete, IsJobFailed, IsConditionTrue:

func IsJobComplete(job *batch.Job) bool {
return IsConditionTrue(job.Status.Conditions, batch.JobComplete)
}
func IsJobFailed(job *batch.Job) bool {
return IsConditionTrue(job.Status.Conditions, batch.JobFailed)
}
func IsConditionTrue(list []batch.JobCondition, cType batch.JobConditionType) bool {
for _, c := range list {
if c.Type == cType && c.Status == api.ConditionTrue {
return true
}
}
return false
}

, and JobStatusValidationOptions:

type JobStatusValidationOptions struct {

So, we need to merge #123273, first.

Does this PR introduce a user-facing change?

Add alpha-level support for the SuccessPolicy in Jobs

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3998-job-success-completion-policy

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/code-generation sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 21, 2024
@tenzen-y tenzen-y force-pushed the add-new-jobsuccesspolicy-api branch 4 times, most recently from 78c556b to d6e5e67 Compare February 21, 2024 13:16
@kannon92
Copy link
Contributor

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 21, 2024
@kannon92
Copy link
Contributor

/cc

@tenzen-y tenzen-y force-pushed the add-new-jobsuccesspolicy-api branch 2 times, most recently from 454be08 to b6f1b25 Compare February 21, 2024 14:41
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 21, 2024
staging/src/k8s.io/api/batch/v1/types.go Outdated Show resolved Hide resolved
staging/src/k8s.io/api/batch/v1/types.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job controller changes generally lgtm, left mostly readability comments.

One place I'm not yet understanding the change is in this comment: https://github.com/kubernetes/kubernetes/pull/123412/files#r1516065458. A comment would do, or drop the check if not needed.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

@atiratree @mimowo I addressed all your comments. PTAL, thanks!

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

I'm still working on addressing Aldo's comments.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

/test pull-kubernetes-node-e2e-containerd

@tenzen-y tenzen-y force-pushed the add-new-jobsuccesspolicy-api branch from 803aa3f to 33155da Compare March 7, 2024 19:34
@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2024

The changes in pkg/controller/job LGTM.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

@alculquicondor I addressed all comments! PTAL, thanks!

test/integration/job/job_test.go Outdated Show resolved Hide resolved
@@ -875,6 +878,11 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (rErr error) {
}
jobCtx.podsWithDelayedDeletionPerIndex = getPodsWithDelayedDeletionPerIndex(logger, jobCtx)
}
if jobCtx.finishedCondition == nil && hasSuccessCriteriaMetCondition(jobCtx.job) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is too different from my last review or I just missed it.

We had agreed in the KEP that if, in a given reconcile loop, a Job qualifies for both SuccessCriteriaMet and FailureTarget (or Backoff or any other error), the success would win. It doesn't look like it, because we are checking matchSuccessPolicy last.

This is somewhat important because the job controller might be busy at the time when a Pod finishes successfully, so by the time it has a chance to run, multiple other Pods might have failed. But again, the success policy should win.

But I'll consider the above as part of #117303 and we can fix it before the test freeze.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is too different from my last review or I just missed it.

We had agreed in the KEP that if, in a given reconcile loop, a Job qualifies for both SuccessCriteriaMet and FailureTarget (or Backoff or any other error), the success would win. It doesn't look like it, because we are checking matchSuccessPolicy last.

I changed this based on #123412 (comment)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the add-new-jobsuccesspolicy-api branch from aa0ef26 to e216742 Compare March 7, 2024 20:49
@alculquicondor
Copy link
Member

/lgtm
/hold cancel
Please work on #117303 before the test freeze.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 7, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: fb204ff04b33a16575ea7f27ca616e95d1942860

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

/lgtm /hold cancel Please work on #117303 before the test freeze.

Sure.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

Kubernetes e2e suite: [It] [sig-network] Services should complete a service status lifecycle [Conformance]
{ failed [FAILED] failed to locate Service test-service-sq55n in namespace services-2473: timed out waiting for the condition
In [It] at: k8s.io/kubernetes/test/e2e/network/service.go:3530 @ 03/07/24 21:09:02.702
}

This is an irrelevant error.
/test pull-kubernetes-conformance-kind-ipv6-parallel

@alculquicondor
Copy link
Member

As per Exception approval

/milestone v1.30

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Mar 7, 2024
@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

Kubernetes e2e suite: [It] [sig-node] Mount propagation should propagate mounts within defined scopes
{ failed [FAILED] Told to stop trying after 2.030s. The phase of Pod slave is Failed which is unexpected. In [It] at: k8s.io/kubernetes/test/e2e/framework/pod/pod_client.go:106 @ 03/07/24 21:09:00.22 }

This is an irrelevant error.
/test pull-kubernetes-e2e-kind-ipv6

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

Kubernetes e2e suite: [It] [sig-network] Services should complete a service status lifecycle [Conformance] expand_less | 1m2s
-- | --
{ failed [FAILED] failed to locate Service test-service-sq55n in namespace services-2473: timed out waiting for the condition In [It] at: k8s.io/kubernetes/test/e2e/network/service.go:3530 @ 03/07/24 21:09:02.702 }

/test pull-kubernetes-conformance-kind-ipv6-parallel

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

Kubernetes e2e suite: [It] [sig-apps] Job should apply changes to a job status [Conformance] expand_less | 12s
-- | --
{ failed [FAILED] patched object should have the applied condition with LastTransitionTime time.Date(2024, time.March, 7, 21, 31, 16, 0, time.UTC), got time.Date(2024, time.March, 7, 21, 31, 15, 0, time.Local) instead In [It] at: k8s.io/kubernetes/test/e2e/apps/job.go:901 @ 03/07/24 21:31:16.371 }

Let me check it.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 7, 2024

Kubernetes e2e suite: [It] [sig-apps] Job should apply changes to a job status [Conformance] expand_less 12s
{ failed [FAILED] patched object should have the applied condition with LastTransitionTime time.Date(2024, time.March, 7, 21, 31, 16, 0, time.UTC), got time.Date(2024, time.March, 7, 21, 31, 15, 0, time.Local) instead In [It] at: k8s.io/kubernetes/test/e2e/apps/job.go:901 @ 03/07/24 21:31:16.371 }

Let me check it.

This already has been tracked by #123799

/test pull-kubernetes-e2e-gce

@k8s-ci-robot k8s-ci-robot merged commit 364ef33 into kubernetes:master Mar 7, 2024
16 checks passed
@tenzen-y tenzen-y deleted the add-new-jobsuccesspolicy-api branch March 7, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.30
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet