Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbnode stops ticking when a node is removed from the placement #4027

Open
BertHartm opened this issue Dec 22, 2021 · 1 comment
Open

dbnode stops ticking when a node is removed from the placement #4027

BertHartm opened this issue Dec 22, 2021 · 1 comment

Comments

@BertHartm
Copy link
Contributor

might be related to #3933

On version 1.3.0, we're seeing that when we remove a node from a placement in a cluster, many or most of the nodes stop ticking and seem to freeze up. They're still running, but the writes and reads stop and the data movement required can't happen.

We do see that database_tick_duration{quantile="0.99"} becomes NaN on these hosts. Restarting the affected nodes seems to resolve the issue.

This cluster is fairly quiet at this point, which I believe why we aren't seeing the large memory spike in 3933.

I'm trying to come up with a smaller scale reproduction to share.

@BertHartm
Copy link
Contributor Author

I'm still working on a test case, but as an additional datapoint, I used the placement/set api endpoint to bulk remove 24 nodes (of 32) in an isolation group (same cluster, same ig), and did not hit the issue, so something is different there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant