Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More io-queue metrics #82

Open
wants to merge 4 commits into
base: v23.3.x
Choose a base branch
from
Open

Conversation

StephanDollberg
Copy link
Member

Adds metrics for:

  • io-sink queue length
  • io-queue: Number of times dispatch was throttled by per-tick dispatch limit
  • io-queue: Number of times dispatch was throttled by the tokenbucket

Most patches intentionally kept rather simple (also in regards to upstreaming). The last commit (differentiating between disk feedback driven throttling and rate based throttling) could be dropped though I think it's useful.

Adds a metric for the io_sink queue length.

Large values can be a sign of backpressure from lacking free iocbs.
Adds a metric counter that counts the time where the io_queue was
throttled from dispatching requests because the maximum per polling tick
capacity was breached.
Adds a metric counter that counts how many times the io-queue was
throttled from dispatching more events because we failed to grab tokens
from the token bucket.

For cases where requests in the queue and/or average time in the queue
is high this gives a clear signal that that is caused by the tokenbucket
throttling in the io-queue. Hence it allows differentiating from other
cases such as reduced/slow polling.
Enhances the tokenbucket throttling metric to differentiate between the
two different reasons why the tokenbucket might be throttling.

When getting throttled when grabing tokens from the tokenbucket we look
at the ceil (tracking disk feedback) and head rover (tracking refill
rate). If the difference between ceil and head rover is large enough to
accomodate the pending capacity we count it as rate throttled and disk
feedback throttled otherwise.
@StephanDollberg
Copy link
Member Author

StephanDollberg commented Oct 25, 2023

Looks like half of this is already out of date because of https://github.com/scylladb/seastar/pull/1766/files

I think effectively we want to drop the last commit as otherwise we'd just have to remove it again later (24.1).

@@ -429,6 +446,10 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
}
}

if (!_handles.empty() && (dispatched >= _group.per_tick_grab_threshold())) {
_throttled_per_tick_threshold++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way we would use this metric is mostly expecting it to be zero, and if it's not zero we know we are hitting the per-tick threshold, right?

That is, it's hard to know how "bad" the situation is by the value alone, right? We don't know how many total ticks there have been?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way we would use this metric is mostly expecting it to be zero, and if it's not zero we know we are hitting the per-tick threshold, right?

Yes I think any non zero value is realistically bad.

That is, it's hard to know how "bad" the situation is by the value alone, right? We don't know how many total ticks there have been?

One reactor poll is one tick so we should be able to get a ratio of throttled ticks / poll count.

sm::description("Number of times dispatch was throttled on the per tick threshold")),
sm::make_counter("throttled_no_capacity_rate",
[this] { return _throttled_no_capacity_rate; },
sm::description("Number of times this class was throttled dispatching requests "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: this says "this class" but as far as I can tell these are all global metrics, not class specific? However it does seem like the last two could be made class-specific easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking through my reflog and I had this originally be per class.

Now the question is whether I changed it for a reason or whether I just screwed something up during rebase. 🤔

@@ -2685,6 +2685,8 @@ void reactor::register_metrics() {
// total_operations value:DERIVE:0:U
sm::make_counter("io_threaded_fallbacks", std::bind(&thread_pool::operation_count, _thread_pool.get()),
sm::description("Total number of io-threaded-fallbacks operations")),
sm::make_queue_length("io_sink_queue_length", [this] { return _io_sink.queue_length(); },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I never quite understood what these more detailed metrics are, like make_queue_length or make_total_bytes. These map to counter or gauge or whatever, but they do store a bit of additional metadata based on the make function, but I didn't really understand how they are used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think it's used anywhere right now.

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@StephanDollberg
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants