You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On version 1.3.0, we're seeing that when we remove a node from a placement in a cluster, many or most of the nodes stop ticking and seem to freeze up. They're still running, but the writes and reads stop and the data movement required can't happen.
We do see that database_tick_duration{quantile="0.99"} becomes NaN on these hosts. Restarting the affected nodes seems to resolve the issue.
This cluster is fairly quiet at this point, which I believe why we aren't seeing the large memory spike in 3933.
I'm trying to come up with a smaller scale reproduction to share.
The text was updated successfully, but these errors were encountered:
I'm still working on a test case, but as an additional datapoint, I used the placement/set api endpoint to bulk remove 24 nodes (of 32) in an isolation group (same cluster, same ig), and did not hit the issue, so something is different there.
might be related to #3933
On version 1.3.0, we're seeing that when we remove a node from a placement in a cluster, many or most of the nodes stop ticking and seem to freeze up. They're still running, but the writes and reads stop and the data movement required can't happen.
We do see that
database_tick_duration{quantile="0.99"}
becomes NaN on these hosts. Restarting the affected nodes seems to resolve the issue.This cluster is fairly quiet at this point, which I believe why we aren't seeing the large memory spike in 3933.
I'm trying to come up with a smaller scale reproduction to share.
The text was updated successfully, but these errors were encountered: