receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69

bwplotka · 2021-03-18T16:09:53Z

We hit cases when introducing more replicas, Thanos controller updates the hashing which makes receive ring is unstable due to one node being expected but down. We need to find a way that improves this state, it's quite fragile at the moment.

Mitigation: Turn of thanos-receive-controller and increase replicas, then turn on controller back.

bjoydeep · 2021-03-18T17:33:15Z

Thanks @bwplotka . What we ended up doing was something a little cruder (since this was just a test env). We stopped all receivers. rm -rf ed the recv PVs (yes, we lost 2 hours of data that was not yet persisted in Obj store) and the restarted receivers with higher replicas. It seemed to work more efficiently (less memory + cpu in aggregate) given the same workload.
Will try your suggestion and see how it works. But being able to increase the number of replicas on the fly dynamically is a real need ofcourse.
BTW @bwplotka do we have any recommendations on running odd vs even number of replicas.

spaparaju · 2021-03-23T05:17:43Z

#70 might help :)

r0mdau · 2022-02-02T17:22:45Z

We hit the same kind of issue when terminating a k8s node which is hosting replicas to finally loose the quorum.
We use a "Chaosmonkey" script that terminate randomly 1 ec2 instance per day from our EKS cluster.

It took approximately 30 minutes for the quorum to be restored (no manual actions).

Logs

level=error ts=2022-01-03T15:48:28.468568897Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:00:15.101622584Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:03:05.711160692Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:07:22.526307825Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"

I see 2 things here :

eliminate or reduce downtime when there are movements of pods, like scaling (this issue)
identify primary receivers of the quorum to schedule them on different nodes and also forward to a live primary (I can maybe create an other issue)

(I miss knowledge on how it's working internally)

michael-burt · 2022-05-11T18:49:29Z

If quorum is lost, does the Receiver stop ingesting samples all together? Is there a metric which can be used to fire an alert when quorum is lost?

I am struggling to understand best practices around scaling the hash ring. If http_requests_total{code=200"} on the Receiver goes to 0, does this imply that no metrics are being ingested?

bwplotka added the bug Something isn't working label Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69

receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69

bwplotka commented Mar 18, 2021

bjoydeep commented Mar 18, 2021

spaparaju commented Mar 23, 2021

r0mdau commented Feb 2, 2022

michael-burt commented May 11, 2022

receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69

receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. #69

Comments

bwplotka commented Mar 18, 2021

bjoydeep commented Mar 18, 2021

spaparaju commented Mar 23, 2021

r0mdau commented Feb 2, 2022

michael-burt commented May 11, 2022