Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sidecar Containers] Sidecar containers finish time needs to be accounted for in job controller #124942

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AxeZhan
Copy link
Member

@AxeZhan AxeZhan commented May 18, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Today the function getFinishTimeFromContainers (pkg/controller/job/backoff_utils.go) only account for the regular containers finish time,
presumably assuming that init containers have completed before before.
However, with the sidecar (restartable init) containers, sidecar containers will always finish later than
regular containers. And those needs to be accounted for the calculation.

Which issue(s) this PR fixes:

Fixes #124937

Special notes for your reviewer:

Does this PR introduce a user-facing change?

The sidecar finish time will be accounted when calculating the job's finish time.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: <https://github.com/kubernetes/enhancements/issues/753>

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels May 18, 2024
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 18, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 18, 2024
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

pkg/controller/job/backoff_utils.go Outdated Show resolved Hide resolved
pkg/controller/job/backoff_utils_test.go Outdated Show resolved Hide resolved
@AxeZhan AxeZhan force-pushed the getFinishTimeFromContainers branch from e8cf4da to b7d9a05 Compare May 20, 2024 08:06
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 20, 2024
@AxeZhan AxeZhan force-pushed the getFinishTimeFromContainers branch from b7d9a05 to 020b059 Compare May 20, 2024 08:26
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 20, 2024
@AxeZhan AxeZhan force-pushed the getFinishTimeFromContainers branch from 020b059 to 57e2681 Compare May 20, 2024 08:32
@AxeZhan AxeZhan force-pushed the getFinishTimeFromContainers branch from 57e2681 to e4348a1 Compare May 20, 2024 10:31
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @alculquicondor
/cc @atiratree

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f084e0312fa47363218ca898c6a51c71a0c8ca03

Comment on lines 187 to 190
// We need to check InitContainerStatuses here also,
// because with the sidecar (restartable init) containers,
// sidecar containers will always finish later than regular containers.
return latestFinishTime(finishTime, p.Status.InitContainerStatuses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking the feature gate here to not break the existing behavior?

Copy link
Member Author

@AxeZhan AxeZhan May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a sidecar container is already created, we should always take it into consideration. Regardless of whether the featuregate is enabled.

And for other init containers(not sidecar). I think this won't break the existing behavior as they will always terminate before the regular container.

// We need to check InitContainerStatuses here also,
// because with the sidecar (restartable init) containers,
// sidecar containers will always finish later than regular containers.
return latestFinishTime(finishTime, p.Status.InitContainerStatuses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to only check finish time of restartable init containers - otherwise we may change the current behavior for some Pod that behaves badly and reports incorrect finish time for the regular init container. I am mostly concerned of some badly written controllers that set this time to some random large value for the regular init containers.

Since sidecars are new, accounting for them is not supposed to break annything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly concerned of some badly written controllers that set this time to some random large value for the regular init containers.

Well, I'm not aware of this...
I added check in the second commit, let's wait for some reviews.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the simplicity of the previous solution, but I'm not very familiar with the sidecar containers yet.

I am mostly concerned of some badly written controllers that set this time to some random large value for the regular init containers.

Would the container finish time, in case of sidecar containers, be set by a user-supplied controller? I think if the finish at time is set by kubelet then we could rely on this.

cc @alculquicondor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is set by kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the code and check the InitContainers regardless of the restartPolicy. For safety, we just wrap with the SidecarContainers feature-gate check as suggested in #124942 (comment).

This way we complicate the code only until the feature-gate is around. WDYT @alculquicondor @SergeyKanzhelev ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only worry is to encounter some crazy controller like one that made me write this comment: #124918. Semantically I am all for not checking for restartPolicy. Breaking some barely supported scenario is what keeping me uneasy here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I agree with the current implementation that restricts the change to initContainers with restartPolicy: Always.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AxeZhan
Once this PR has been reviewed and has the lgtm label, please ask for approval from alculquicondor. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AxeZhan AxeZhan force-pushed the getFinishTimeFromContainers branch from aa3d4b0 to 3a2a500 Compare May 21, 2024 03:58
if containerState.State.Terminated == nil {
return nil
finishTime := latestFinishTime(nil, p.Status.ContainerStatuses, nil)
// We need to check InitContainerStatuses here also,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code doesn't look risky, but we should be consistent and check the feature gate to skip the additional calculations when the feature is disabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, most of cases it is. However there is almost no validation on consistency of statuses: #124918

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying that the code i this PR is more risky than it appears?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only risk is for the crazy scenarios when some sort of virtual kubelet was setting some bad values for init containers and it used to be ignored before. This is why I suggest only add a logic for sidecars.

@AxeZhan
Copy link
Member Author

AxeZhan commented May 30, 2024

Added check of feature gate to skip the additional calculations.

@alculquicondor
Copy link
Member

Quick question: Is there any other component in Kubernetes that cares about the latest termination timestamp of a Pod?

@AxeZhan
Copy link
Member Author

AxeZhan commented May 31, 2024

Quick question: Is there any other component in Kubernetes that cares about the latest termination timestamp of a Pod?

Not sure about this. But I check the code, only find kubelet and job controller using this field State.Terminated.FinishedAt.
So only these two components should care about the latest termination.

@alculquicondor
Copy link
Member

Wondering if the code for getFinishTimeFromContainers(p *v1.Pod) should be in some library package. But if no one else needs it, we can leave it under pkg/controller/job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

[Sidecar Containers] Sidecar containers finish time needs to be accounted for in job controller
6 participants