Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Graceful storage controller cluster restarts RFC #7704

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented May 10, 2024

RFC for "Graceful Restarts of Storage Controller Managed Clusters".

POC which implements nearly everything mentioned here apart from the optimizations
and some of the failure handling: #7682

Related #7387

Copy link

github-actions bot commented May 10, 2024

3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_download_remote_layers_api: release
  • test_vm_bit_clear_on_heap_lock: debug

Postgres 14

  • test_statvfs_pressure_usage: debug

Code coverage* (full report)

  • functions: 31.4% (6338 of 20191 functions)
  • lines: 47.3% (47948 of 101267 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
bb87ebf at 2024-05-15T10:07:34.895Z :recycle:

@VladLazar VladLazar requested review from jcsp and arpad-m May 13, 2024 09:28
@VladLazar VladLazar marked this pull request as ready for review May 13, 2024 09:29

Before accpeting a drain request the following validations is applied:
* Ensure that the node is known the storage controller
* Ensure that the schedulling policy is `NodeSchedulingPolicy::Active` or `NodeSchedulingPolicy::Pause`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wouldn't we also accept the request (idempotency) if the state was Draining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint is not meant to be idempotent since the polling happens on the node status endpoint. We can accept the request in Draining state, but we will refuse it further down the stack when we realise a drain is already in progress (next bullet point).


#### Storage Controller Crash

When the storage controller starts up reset the node scheduling policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit on why this is the right behavior: if the node just reached PauseForRestart and ansible is about to restart a pageserver, why is it okay for the controller to revert the state to Active?

Same for below sections: e.g. it's not obvious why we should revert a node to Active state rather than noticing it is Draining and starting the drain task again. Maybe this is intentional to get simplicity (and rely on an external poller to re-enter the draining state), if so let's make that explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is intentional to get simplicity (and rely on an external poller to re-enter the draining state), if so let's make that explicit

Right. It keeps things simpler.

e.g. it's not obvious why we should revert a node to Active state rather than noticing it is Draining and starting the drain task again

At that point the storage controller doesn't know if it's still meant to drain the node. We could assume that drains are always cancelled nicely from ansible, but I'd rather not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem is, what if a pageserver was in Filling state and then after a restart it gets turned into Active, the intended success state. The fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?

Correct. However the restart can happen at any time. It might be that our deployment script has moved on and filling that node is no longer the correct thing to do. The storage controller implements background optimizations which will move attached tenants to achieve a reasonable distribution of the cluster.

Also, just to clarify, we don't lose the attachment. That tenant will still be attached "somewhere" and the node that we failed to fill should be a warm secondary (which facilitates the optimization mentioned earlier).

in the `Filling` scheduling policy. This hinders the scheduler, but is
otherwise harmless. A human operator can handle this by setting the scheduling
policy to `Active`, or we can bake in a fill timeout into the storage controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more failure case: what happens when another pageserver fails while we're Draining? Should we proceed with draining, or should we prioritize recovering from the other pageserver's failure even if that means giving up on the drain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting case. Thanks for pointing it out. Ideally we would like to do both. In the current state of the POC that's what we'd do. Since the ongoing reconcile limit on the drains is quite strict, it shouldn't hinder the failover much.
I think this type of behaviour would be fine in the short term, but we can also hook into the heartbeats and cancel ongoing background operations. I'll include these thoughts in the rfc.

@VladLazar VladLazar requested a review from jcsp May 13, 2024 17:48
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great and thoughtful RFC.

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.


#### Storage Controller Crash

When the storage controller starts up reset the node scheduling policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem is, what if a pageserver was in Filling state and then after a restart it gets turned into Active, the intended success state. The fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?

@VladLazar
Copy link
Contributor Author

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size,
this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download
the entire working set for the tenant. I can add this as a non-goal to the RFC.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online
before explicitly setting the attached location. Saves us a reconcile.

@arpad-m
Copy link
Member

arpad-m commented May 15, 2024

If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node.

yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants