Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The endpoint is lost when the APIServer is restored. #124547

Open
Black-max12138 opened this issue Apr 26, 2024 · 5 comments
Open

The endpoint is lost when the APIServer is restored. #124547

Black-max12138 opened this issue Apr 26, 2024 · 5 comments
Labels
area/controller-manager kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@Black-max12138
Copy link

What happened?

It happens when the apiserver goes down and after a few minutes when the apiserver comes back up, some endpoints have notReadyAddresses and do not recover.It's an accidental phenomenon.
The cause is that the endpoint obtained from the Informer is not the latest. In the syncService method of endpoint_controller.go,
currentEndpoints, err := e.endpointsLister.Endpoints(service.Namespace).Get(service.Name)
Because the obtained endpoint is not the latest, the system determines that the endpoints are the same. As a result, the endpoint is not updated.
I have added the log to print the endpoint and confirmed this section.
This is what the log shows.
I0425 08:25:57.715142 11 endpoints_controller.go:423] "About to update endpoints for service" service="manager/service-mchiroer"
I0425 08:25:57.715216 11 endpoints_controller.go:516] "endpoints are equal, skipping update" service="manager/service-mchiroer"
I0425 08:25:57.715225 11 endpoints_controller.go:389] "Finished syncing service endpoints" service="manager/service-mchiroer" startTime="83.332µs"
So I think the cache in informer is not caching the latest data, which is a bug.

What did you expect to happen?

The notReadyAddresses of the endpoint should be changed to addresses when the pod status is updated.

How can we reproduce it (as minimally and precisely as possible)?

1、Stop the apiserver service of the cluster.
2、Recover the apiserver service after a few minutes.
Repeat the preceding operations. The problem will recur.

Anything else we need to know?

No response

Kubernetes version

1.28

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Black-max12138 Black-max12138 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@Black-max12138: The label(s) sig/area/controller-manager cannot be applied, because the repository doesn't have them.

In response to this:

/sig area/controller-manager

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Black-max12138
Copy link
Author

/area controller-manager

@neolit123
Copy link
Member

So I think the cache in informer is not caching the latest data, which is a bug.

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 26, 2024
@Black-max12138
Copy link
Author

Our latest discovery is that it's not a caching issue. It's a delay in pushing update events.
Therefore, we add UpdateFunc to endpointsInformer. When an update event is received, the endpoint is refreshed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

3 participants