New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69
Comments
Thanks @bwplotka . What we ended up doing was something a little cruder (since this was just a test env). We stopped all receivers. rm -rf ed the recv PVs (yes, we lost 2 hours of data that was not yet persisted in Obj store) and the restarted receivers with higher replicas. It seemed to work more efficiently (less memory + cpu in aggregate) given the same workload. |
#70 might help :) |
We hit the same kind of issue when terminating a k8s node which is hosting replicas to finally loose the quorum. It took approximately 30 minutes for the quorum to be restored (no manual actions). Logs
I see 2 things here :
(I miss knowledge on how it's working internally) |
If quorum is lost, does the Receiver stop ingesting samples all together? Is there a metric which can be used to fire an alert when quorum is lost? I am struggling to understand best practices around scaling the hash ring. If |
We hit cases when introducing more replicas, Thanos controller updates the hashing which makes receive ring is unstable due to one node being expected but down. We need to find a way that improves this state, it's quite fragile at the moment.
Mitigation: Turn of thanos-receive-controller and increase replicas, then turn on controller back.
The text was updated successfully, but these errors were encountered: