New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Graceful storage controller cluster restarts RFC #7704
base: main
Are you sure you want to change the base?
Conversation
3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
bb87ebf at 2024-05-15T10:07:34.895Z :recycle: |
|
||
Before accpeting a drain request the following validations is applied: | ||
* Ensure that the node is known the storage controller | ||
* Ensure that the schedulling policy is `NodeSchedulingPolicy::Active` or `NodeSchedulingPolicy::Pause` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wouldn't we also accept the request (idempotency) if the state was Draining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoint is not meant to be idempotent since the polling happens on the node status endpoint. We can accept the request in Draining state, but we will refuse it further down the stack when we realise a drain is already in progress (next bullet point).
|
||
#### Storage Controller Crash | ||
|
||
When the storage controller starts up reset the node scheduling policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand a bit on why this is the right behavior: if the node just reached PauseForRestart and ansible is about to restart a pageserver, why is it okay for the controller to revert the state to Active?
Same for below sections: e.g. it's not obvious why we should revert a node to Active state rather than noticing it is Draining and starting the drain task again. Maybe this is intentional to get simplicity (and rely on an external poller to re-enter the draining state), if so let's make that explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is intentional to get simplicity (and rely on an external poller to re-enter the draining state), if so let's make that explicit
Right. It keeps things simpler.
e.g. it's not obvious why we should revert a node to Active state rather than noticing it is Draining and starting the drain task again
At that point the storage controller doesn't know if it's still meant to drain the node. We could assume that drains are always cancelled nicely from ansible, but I'd rather not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other problem is, what if a pageserver was in Filling state and then after a restart it gets turned into Active, the intended success state. The fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?
Correct. However the restart can happen at any time. It might be that our deployment script has moved on and filling that node is no longer the correct thing to do. The storage controller implements background optimizations which will move attached tenants to achieve a reasonable distribution of the cluster.
Also, just to clarify, we don't lose the attachment. That tenant will still be attached "somewhere" and the node that we failed to fill should be a warm secondary (which facilitates the optimization mentioned earlier).
in the `Filling` scheduling policy. This hinders the scheduler, but is | ||
otherwise harmless. A human operator can handle this by setting the scheduling | ||
policy to `Active`, or we can bake in a fill timeout into the storage controller. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more failure case: what happens when another pageserver fails while we're Draining? Should we proceed with draining, or should we prioritize recovering from the other pageserver's failure even if that means giving up on the drain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case. Thanks for pointing it out. Ideally we would like to do both. In the current state of the POC that's what we'd do. Since the ongoing reconcile limit on the drains is quite strict, it shouldn't hinder the failover much.
I think this type of behaviour would be fine in the short term, but we can also hook into the heartbeats and cancel ongoing background operations. I'll include these thoughts in the rfc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great and thoughtful RFC.
It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.
Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.
|
||
#### Storage Controller Crash | ||
|
||
When the storage controller starts up reset the node scheduling policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other problem is, what if a pageserver was in Filling state and then after a restart it gets turned into Active, the intended success state. The fill operation wouldn't be retried and therefore we'd lose attachment of tenants here, or am I wrong?
Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size,
This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online |
yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above. |
RFC for "Graceful Restarts of Storage Controller Managed Clusters".
POC which implements nearly everything mentioned here apart from the optimizations
and some of the failure handling: #7682
Related #7387