More io-queue metrics #82

StephanDollberg · 2023-10-13T15:50:54Z

Adds metrics for:

io-sink queue length
io-queue: Number of times dispatch was throttled by per-tick dispatch limit
io-queue: Number of times dispatch was throttled by the tokenbucket

Most patches intentionally kept rather simple (also in regards to upstreaming). The last commit (differentiating between disk feedback driven throttling and rate based throttling) could be dropped though I think it's useful.

Adds a metric for the io_sink queue length. Large values can be a sign of backpressure from lacking free iocbs.

Adds a metric counter that counts the time where the io_queue was throttled from dispatching requests because the maximum per polling tick capacity was breached.

Adds a metric counter that counts how many times the io-queue was throttled from dispatching more events because we failed to grab tokens from the token bucket. For cases where requests in the queue and/or average time in the queue is high this gives a clear signal that that is caused by the tokenbucket throttling in the io-queue. Hence it allows differentiating from other cases such as reduced/slow polling.

Enhances the tokenbucket throttling metric to differentiate between the two different reasons why the tokenbucket might be throttling. When getting throttled when grabing tokens from the tokenbucket we look at the ceil (tracking disk feedback) and head rover (tracking refill rate). If the difference between ceil and head rover is large enough to accomodate the pending capacity we count it as rate throttled and disk feedback throttled otherwise.

StephanDollberg · 2023-10-25T10:39:37Z

Looks like half of this is already out of date because of https://github.com/scylladb/seastar/pull/1766/files

I think effectively we want to drop the last commit as otherwise we'd just have to remove it again later (24.1).

travisdowns · 2024-02-07T23:52:03Z

src/core/fair_queue.cc

@@ -429,6 +446,10 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
        }
    }

+    if (!_handles.empty() && (dispatched >= _group.per_tick_grab_threshold())) {
+        _throttled_per_tick_threshold++;


So the way we would use this metric is mostly expecting it to be zero, and if it's not zero we know we are hitting the per-tick threshold, right?

That is, it's hard to know how "bad" the situation is by the value alone, right? We don't know how many total ticks there have been?

So the way we would use this metric is mostly expecting it to be zero, and if it's not zero we know we are hitting the per-tick threshold, right?

Yes I think any non zero value is realistically bad.

That is, it's hard to know how "bad" the situation is by the value alone, right? We don't know how many total ticks there have been?

One reactor poll is one tick so we should be able to get a ratio of throttled ticks / poll count.

travisdowns · 2024-02-07T23:59:20Z

src/core/fair_queue.cc

+                    sm::description("Number of times dispatch was throttled on the per tick threshold")),
+            sm::make_counter("throttled_no_capacity_rate",
+                    [this] { return _throttled_no_capacity_rate; },
+                    sm::description("Number of times this class was throttled dispatching requests "


question: this says "this class" but as far as I can tell these are all global metrics, not class specific? However it does seem like the last two could be made class-specific easily.

Just looking through my reflog and I had this originally be per class.

Now the question is whether I changed it for a reason or whether I just screwed something up during rebase. 🤔

travisdowns · 2024-02-08T00:02:00Z

src/core/reactor.cc

@@ -2685,6 +2685,8 @@ void reactor::register_metrics() {
            // total_operations value:DERIVE:0:U
            sm::make_counter("io_threaded_fallbacks", std::bind(&thread_pool::operation_count, _thread_pool.get()),
                    sm::description("Total number of io-threaded-fallbacks operations")),
+            sm::make_queue_length("io_sink_queue_length", [this] { return _io_sink.queue_length(); },


note: I never quite understood what these more detailed metrics are, like make_queue_length or make_total_bytes. These map to counter or gauge or whatever, but they do store a bit of additional metadata based on the make function, but I didn't really understand how they are used.

Yeah I don't think it's used anywhere right now.

travisdowns

comments

StephanDollberg · 2024-02-10T00:54:13Z

See also redpanda-data/redpanda#13824 (comment)

StephanDollberg added 4 commits October 13, 2023 14:45

reactor: Add io_sink queue length metric

b81de5a

Adds a metric for the io_sink queue length. Large values can be a sign of backpressure from lacking free iocbs.

io-queue: A metric for max tick per dispatch throttling

0fb99f6

Adds a metric counter that counts the time where the io_queue was throttled from dispatching requests because the maximum per polling tick capacity was breached.

StephanDollberg mentioned this pull request Oct 16, 2023

Follow-ups for different disk speeds issue redpanda-data/redpanda#13824

Open

piyushredpanda requested a review from travisdowns November 1, 2023 14:31

travisdowns reviewed Feb 7, 2024

View reviewed changes

travisdowns reviewed Feb 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More io-queue metrics #82

More io-queue metrics #82

StephanDollberg commented Oct 13, 2023

StephanDollberg commented Oct 25, 2023 •

edited

travisdowns Feb 7, 2024

StephanDollberg Feb 8, 2024

travisdowns Feb 7, 2024

StephanDollberg Feb 9, 2024

travisdowns Feb 8, 2024

StephanDollberg Feb 10, 2024

travisdowns left a comment

StephanDollberg commented Feb 10, 2024

More io-queue metrics #82

Are you sure you want to change the base?

More io-queue metrics #82

Conversation

StephanDollberg commented Oct 13, 2023

StephanDollberg commented Oct 25, 2023 • edited

travisdowns Feb 7, 2024

Choose a reason for hiding this comment

StephanDollberg Feb 8, 2024

Choose a reason for hiding this comment

travisdowns Feb 7, 2024

Choose a reason for hiding this comment

StephanDollberg Feb 9, 2024

Choose a reason for hiding this comment

travisdowns Feb 8, 2024

Choose a reason for hiding this comment

StephanDollberg Feb 10, 2024

Choose a reason for hiding this comment

travisdowns left a comment

Choose a reason for hiding this comment

StephanDollberg commented Feb 10, 2024

StephanDollberg commented Oct 25, 2023 •

edited