Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync Catalog: Ignore non-ready endpoints #3874

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jukie
Copy link
Contributor

@jukie jukie commented Apr 4, 2024

Changes proposed in this PR

I recently contributed a change here from using Endpoints to EndpointSlices.
EndpointSlices offer additional state information such as readiness information but in the current approach we aren't taking advantage of this and instead only add/remove endpoints based on creation or deletions.
We can utilize the Endpoint conditions info to ignore non-ready endpoints which will trigger a cleanup of terminating endpoints or pods that have started failing readiness checks.

How I've tested this PR

  • Added tests
  • Manual build/test in live environment

How I expect reviewers to test this PR

A good way to test would be to launch a pod with a long terminationGracePeriod that stays up and initiate a deployment rollout or delete the pod. Once the pod enters terminating state you'll see the same state in the relevant EndpointSlice.
This reduces the time for cleaning up an outgoing pod or failed pod from Consul.

Checklist

fixes #3898

@jukie jukie force-pushed the fix/sync-catalog-ignore-terminating-endpoints branch from 2d02bf3 to e360996 Compare April 4, 2024 05:29
@jukie jukie changed the title Sync Catalog: Ignore endpoints in terminating state Sync Catalog: Ignore non-ready endpoints Apr 4, 2024
@jukie
Copy link
Contributor Author

jukie commented Apr 20, 2024

@david-yu could you possibly help get a review on this please?

@jukie
Copy link
Contributor Author

jukie commented Apr 30, 2024

bumpety bump in case anyone's watching

@ndhanushkodi
Copy link
Contributor

@jukie taking a look at this now

@ndhanushkodi ndhanushkodi mentioned this pull request May 6, 2024
2 tasks
Copy link
Contributor

@ndhanushkodi ndhanushkodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that the relevant acceptance tests are passing, and the changes look good to me. Just one small comment to address!

Tomorrow I want to make sure the failing peering acceptance tests are unrelated, and if you're able to make that one small test case change then I think we can merge after.

@ndhanushkodi ndhanushkodi added pr/no-changelog PR does not need a corresponding .changelog entry pr/no-backport signals that a PR will not contain a backport label labels May 7, 2024
@ndhanushkodi ndhanushkodi removed the pr/no-changelog PR does not need a corresponding .changelog entry label May 8, 2024
@ndhanushkodi
Copy link
Contributor

Hey @jukie I was about to merge yesterday but then I was thinking more about the use cases you listed in the issue, and since Consul supports health checks, should we instead use Consul healthchecks to reflect the readiness of an endpoint, rather than ignoring it entirely if its non-ready? That way the user can configure Consul as they wish based on the health checks. I talked to the team and the consensus was that it makes more sense to use the health checks to reflect the readiness status.

If you're willing to make that change, I'll keep an eye on this PR (or a new PR if you make one, feel free to tag me) for those changes, or I can bring up the changes for prioritization within the team and work on it myself (but I'm not sure when I'll be able to commit to having this done). Thanks so much for your effort on this.

@jukie
Copy link
Contributor Author

jukie commented May 9, 2024

@ndhanushkodi could you explain how that would work? I think that may be in addition to or an extension of this change vs instead of.
Since consul-k8s is already watching for updates to the EndpointSlice objects that seems like the most efficient place to react to changes in readiness in real time. If a user wants to always publish endpoints they can also set spec.publishNotReadyAddresses and that would already be supported.

@jukie
Copy link
Contributor Author

jukie commented May 9, 2024

I'd be willing to make a new PR to support consul health checks as well but I'd like to understand the opinion that it wouldn't be a separate feature contribution. The existing behavior reacts to Service Endpoint changes and this extends that to make use of the additional state provided by EndpointSlices.

@ndhanushkodi
Copy link
Contributor

I'd like to understand the opinion that it wouldn't be a separate feature contribution. The existing behavior reacts to Service Endpoint changes and this extends that to make use of the additional state provided by EndpointSlices.

Thanks for your willingness to contribute and so sorry for the late response!

The existing behavior seems to register all endpoints in the endpoint slice, regardless of the condition of that endpoint, so if I created a service where one of the endpoints went into a non-ready state, that would still get registered as a service instance in Consul. This PR would change the behavior to no longer register service instances in Consul for endpoints that are in a non-ready state. My thought with the suggestion above was we should continue to register non-ready endpoints into Consul as we do today, but instead mark the health check unhealthy. Rather than changing the behavior to not register the non-ready endpoints, I'm proposing the health check change as the way to deal with non-ready endpoints.

For the cases that use the function registerServiceInstance, I'm imagining making a change here where we change the health based on the endpoint condition rather than always marking it as passing. For Nodeport services which doesn't use registerServiceInstance, I'm thinking we'd have to manually add a check here and line 679.

Let me know if I'm misunderstanding anything you're saying, or if this makes sense.

If a user wants to always publish endpoints they can also set spec.publishNotReadyAddresses and that would already be supported.

This is for K8s Services right? Whether the endpoint addresses should exist in the K8s Endpoint itself? This change is about whether the K8s Endpoint addresses that exist should get synced to Consul if they are non-ready so I think that feature doesn't quite match the behavior I mentioned in my suggestion.

Again, let me know if I'm misunderstanding what you mean!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport signals that a PR will not contain a backport label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync Catalog: Utilize EndpointSlice conditions when endpoints are updated
2 participants