Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services: retry failed Nomad service deregistrations from client #20596

Merged
merged 1 commit into from
May 16, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented May 15, 2024

When the allocation is stopped, we deregister the service in the alloc runner's PreKill hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them.

Add a short retry loop to the RemoveWorkload method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590.

Ref: #16616

@tgross tgross force-pushed the services-retry-deregister branch from 30ae71a to 35fa1a0 Compare May 15, 2024 18:06
@tgross tgross added this to the 1.8.0 milestone May 15, 2024
@tgross tgross force-pushed the services-retry-deregister branch from 35fa1a0 to 00288fd Compare May 15, 2024 18:09
@tgross tgross added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line and removed backport/1.5.x backport to 1.5.x release line labels May 15, 2024
@tgross tgross marked this pull request as ready for review May 15, 2024 18:21
When the allocation is stopped, we deregister the service in the alloc runner's
`PreKill` hook. This ensures we delete the service registration and wait for the
shutdown delay before shutting down the tasks, so that workloads can drain their
connections. However, the call to remove the workload only logs errors and never
retries them.

Add a short retry loop to the `RemoveWorkload` method for Nomad services, so
that transient errors give us an extra opportunity to deregister the service
before the tasks are stopped, before we need to fall back to the data integrity
improvements implemented in #20590.

Ref: #16616
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants