New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

docs: Graceful storage controller cluster restarts RFC #7704

Open

VladLazar wants to merge 5 commits into main from vlad/graceful-storcon-cluster-restarts-rfc

Contributor

VladLazar commented May 10, 2024 •

edited

RFC for "Graceful Restarts of Storage Controller Managed Clusters".

POC which implements nearly everything mentioned here apart from the optimizations
and some of the failure handling: #7682

Related #7387


          docs: Graceful storage controller cluster restarts RFC

4bba434

github-actions bot commented May 10, 2024 •

edited

3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)

Flaky tests (3)

Postgres 16

test_download_remote_layers_api: release
test_vm_bit_clear_on_heap_lock: debug

Postgres 14

test_statvfs_pressure_usage: debug

Code coverage* (full report)

functions: 31.4% (6338 of 20191 functions)
lines: 47.3% (47948 of 101267 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
bb87ebf at 2024-05-15T10:07:34.895Z :recycle:}


          review: proof-reading

4ff36a2

VladLazar requested review from jcsp and arpad-m

May 13, 2024 09:28

VladLazar marked this pull request as ready for review

May 13, 2024 09:29

VladLazar mentioned this pull request

storage controller: rolling restart hooks #7387

Open

jcsp reviewed

View reviewed changes

docs/rfcs/033-storage-controller-drain-and-fill.md

+              Before accpeting a drain request the following validations is applied:
+              * Ensure that the node is known the storage controller
+              * Ensure that the schedulling policy is `NodeSchedulingPolicy::Active` or `NodeSchedulingPolicy::Pause`

Contributor

jcsp May 13, 2024

Hmm, wouldn't we also accept the request (idempotency) if the state was Draining?

Contributor Author

VladLazar May 13, 2024

The endpoint is not meant to be idempotent since the polling happens on the node status endpoint. We can accept the request in Draining state, but we will refuse it further down the stack when we realise a drain is already in progress (next bullet point).

jcsp reviewed

View reviewed changes

docs/rfcs/033-storage-controller-drain-and-fill.md


		#### Storage Controller Crash

		When the storage controller starts up reset the node scheduling policy

Contributor

jcsp May 13, 2024

Can you expand a bit on why this is the right behavior: if the node just reached PauseForRestart and ansible is about to restart a pageserver, why is it okay for the controller to revert the state to Active?

Same for below sections: e.g. it's not obvious why we should revert a node to Active state rather than noticing it is Draining and starting the drain task again. Maybe this is intentional to get simplicity (and rely on an external poller to re-enter the draining state), if so let's make that explicit

Contributor Author

VladLazar May 13, 2024

Maybe this is intentional to get simplicity (and rely on an external poller to re-enter the draining state), if so let's make that explicit

Right. It keeps things simpler.

e.g. it's not obvious why we should revert a node to Active state rather than noticing it is Draining and starting the drain task again

At that point the storage controller doesn't know if it's still meant to drain the node. We could assume that drains are always cancelled nicely from ansible, but I'd rather not.

Member

arpad-m May 14, 2024

The other problem is, what if a pageserver was in Filling state and then after a restart it gets turned into Active, the intended success state. The fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?

Contributor Author

VladLazar May 14, 2024

the fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?

Correct. However the restart can happen at any time. It might be that our deployment script has moved on and filling that node is no longer the correct thing to do. The storage controller implements background optimizations which will move attached tenants to achieve a reasonable distribution of the cluster.

Also, just to clarify, we don't lose the attachment. That tenant will still be attached "somewhere" and the node that we failed to fill should be a warm secondary (which facilitates the optimization mentioned earlier).

jcsp reviewed

View reviewed changes

docs/rfcs/033-storage-controller-drain-and-fill.md Show resolved Hide resolved

jcsp reviewed

View reviewed changes

docs/rfcs/033-storage-controller-drain-and-fill.md

+              in the `Filling` scheduling policy. This hinders the scheduler, but is
+              otherwise harmless. A human operator can handle this by setting the scheduling
+              policy to `Active`, or we can bake in a fill timeout into the storage controller.

Contributor

jcsp May 13, 2024

One more failure case: what happens when another pageserver fails while we're Draining? Should we proceed with draining, or should we prioritize recovering from the other pageserver's failure even if that means giving up on the drain?

Contributor Author

VladLazar May 13, 2024

Interesting case. Thanks for pointing it out. Ideally we would like to do both. In the current state of the POC that's what we'd do. Since the ongoing reconcile limit on the drains is quite strict, it shouldn't hinder the failover much.
I think this type of behaviour would be fine in the short term, but we can also hook into the heartbeats and cancel ongoing background operations. I'll include these thoughts in the rfc.

jcsp reviewed

View reviewed changes

docs/rfcs/033-storage-controller-drain-and-fill.md Show resolved Hide resolved

jcsp reviewed

View reviewed changes

docs/rfcs/033-storage-controller-drain-and-fill.md Show resolved Hide resolved


          review: expand on John's comments

109da23

VladLazar requested a review from jcsp

May 13, 2024 17:48

arpad-m approved these changes

View reviewed changes

Member

arpad-m left a comment

great and thoughtful RFC.

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

docs/rfcs/033-storage-controller-drain-and-fill.md


		#### Storage Controller Crash

		When the storage controller starts up reset the node scheduling policy

Member

arpad-m May 14, 2024

The other problem is, what if a pageserver was in Filling state and then after a restart it gets turned into Active, the intended success state. The fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?

Contributor Author

VladLazar commented May 14, 2024

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size,
this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download
the entire working set for the tenant. I can add this as a non-goal to the RFC.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online
before explicitly setting the attached location. Saves us a reconcile.

VladLazar added 2 commits

May 15, 2024 10:21


          Add non-HA non goal

46e65ae


          Add a note on non-HA tenants

bb87ebf

Member

arpad-m commented May 15, 2024

If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node.

yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment