Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69

Open
bwplotka opened this issue Mar 18, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@bwplotka
Copy link
Member

We hit cases when introducing more replicas, Thanos controller updates the hashing which makes receive ring is unstable due to one node being expected but down. We need to find a way that improves this state, it's quite fragile at the moment.

Mitigation: Turn of thanos-receive-controller and increase replicas, then turn on controller back.

@bwplotka bwplotka added the bug Something isn't working label Mar 18, 2021
@bjoydeep
Copy link

Thanks @bwplotka . What we ended up doing was something a little cruder (since this was just a test env). We stopped all receivers. rm -rf ed the recv PVs (yes, we lost 2 hours of data that was not yet persisted in Obj store) and the restarted receivers with higher replicas. It seemed to work more efficiently (less memory + cpu in aggregate) given the same workload.
Will try your suggestion and see how it works. But being able to increase the number of replicas on the fly dynamically is a real need ofcourse.
BTW @bwplotka do we have any recommendations on running odd vs even number of replicas.

@spaparaju
Copy link

#70 might help :)

@r0mdau
Copy link
Contributor

r0mdau commented Feb 2, 2022

We hit the same kind of issue when terminating a k8s node which is hosting replicas to finally loose the quorum.
We use a "Chaosmonkey" script that terminate randomly 1 ec2 instance per day from our EKS cluster.

It took approximately 30 minutes for the quorum to be restored (no manual actions).

Logs

level=error ts=2022-01-03T15:48:28.468568897Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:00:15.101622584Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:03:05.711160692Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:07:22.526307825Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"

I see 2 things here :

  • eliminate or reduce downtime when there are movements of pods, like scaling (this issue)
  • identify primary receivers of the quorum to schedule them on different nodes and also forward to a live primary (I can maybe create an other issue)

(I miss knowledge on how it's working internally)

@michael-burt
Copy link

If quorum is lost, does the Receiver stop ingesting samples all together? Is there a metric which can be used to fire an alert when quorum is lost?

I am struggling to understand best practices around scaling the hash ring. If http_requests_total{code=200"} on the Receiver goes to 0, does this imply that no metrics are being ingested?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants