Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add saturation metric for the runFSM and run (main) goroutines. #488

Open
dnephin opened this issue Jan 31, 2022 · 0 comments
Open

Add saturation metric for the runFSM and run (main) goroutines. #488

dnephin opened this issue Jan 31, 2022 · 0 comments

Comments

@dnephin
Copy link
Contributor

dnephin commented Jan 31, 2022

The two primary goroutines used by Raft (runFSM, and run (main)) are single threaded operations. They can saturate (take more than 100% of the available time to handle the incoming workload) before the CPU of a system reaches 100% utilization.

When this happens it may be possible to observe the problem using some existing metrics (ex: fsm.apply time), but properly interpreting those metrics requires deep knowledge of how raft works. It may also be a challenge to present the data on a dashboard because it requires summing the time, and knowing the aggregation period of the metrics to interpret the summed result.

The existing metrics may also not fully capture the time, because they only measure specific operations done by those goroutines, not the full work vs idle time.

This issue proposes adding two new metrics (one for each goroutine) which measure the amount of time those goroutines spent doing work. When compared to the wall clock time, this gives us a clear signal about the saturation of these operations, and how much buffer there is before the incoming work starts to cause a backlog.

boxofrad added a commit that referenced this issue Feb 2, 2022
Adds metrics suggested in #488, to record the percentage of time the
main and FSM goroutines are busy with work vs available to accept new
work, to give operators an idea of how close they are to hitting
capacity limits.

We keep 256 samples in memory for each metric, and update gauges (at
most) once a second, possibly less if the goroutines are idle.
boxofrad added a commit that referenced this issue Feb 2, 2022
Adds metrics suggested in #488, to record the percentage of time the
main and FSM goroutines are busy with work vs available to accept new
work, to give operators an idea of how close they are to hitting
capacity limits.

We keep 256 samples in memory for each metric, and update gauges (at
most) once a second, possibly less if the goroutines are idle. This
should be ok because it's unlikely that a goroutine would go from very
high saturation to being completely idle (so at worst we'll leave the
gauge on the previous low value).
boxofrad added a commit that referenced this issue Apr 27, 2022
Adds metrics suggested in #488, to record the percentage of time the
main and FSM goroutines are busy with work vs available to accept new
work, to give operators an idea of how close they are to hitting
capacity limits.

We keep 256 samples in memory for each metric, and update gauges (at
most) once a second, possibly less if the goroutines are idle. This
should be ok because it's unlikely that a goroutine would go from very
high saturation to being completely idle (so at worst we'll leave the
gauge on the previous low value).
boxofrad added a commit that referenced this issue Apr 27, 2022
Adds metrics suggested in #488, to record the percentage of time the
main and FSM goroutines are busy with work vs available to accept new
work, to give operators an idea of how close they are to hitting
capacity limits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant