Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: be more tolerant of a pageserver's startup time #7552

Open
jcsp opened this issue Apr 30, 2024 · 1 comment
Open

storage controller: be more tolerant of a pageserver's startup time #7552

jcsp opened this issue Apr 30, 2024 · 1 comment
Labels
c/storage/controller Component: Storage Controller t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Contributor

jcsp commented Apr 30, 2024

We mark a node offline after MAX_UNAVAILABLE_INTERVAL_DEFAULT (30 seconds) of failures to respond to heartbeats.

We should be more generous during startup: when a pageserver sends us a re-attach request, we should tip off the heartbeater to be more generous. Currently the pageserver's processing of the re-attach respond can be quite time consuming.

This is similar to the k8s distinction between a readiness check and a status check: we should be more tolerant when waiting for readiness during startup, than when checking for responsiveness during normal runtime.

(The actual init_tenant_mgr slowness is addressed in #7553, but this ticket still stands: we should be more tolerant during startup than we are during normal operation.)

@jcsp jcsp added t/bug Issue Type: Bug c/storage/controller Component: Storage Controller labels Apr 30, 2024
@jcsp
Copy link
Contributor Author

jcsp commented Apr 30, 2024

@VladLazar let's pick this up as part of the rolling restart work: we should flip the node into its more tolerant mode on:

  • Notification that it is draining (or a separate hook for "I'm about to shut this down"?)
  • Starting to handle re-attach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant