dbconsole: overload page improvements #123522

aadityasondhi · 2024-05-02T21:48:40Z

This PR contains a series of improvements to the overload page of the DB console as part of #121574. It is separated into multiple commits for ease of review.

dbconsole: remove non useful charts on the overload page

In investigations, we have found that the following charts are not
useful and frequently cause confusion:

Admission work rate
Admission Delay rate
Requests Waiting For Flow Tokens

Informs #121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.

dbconsole: reorder overload page metrics for better readability

This patch reorders the existing metrics in a more usable order:

Metrics to help determine which resource is constrained (IO, CPU)
Metrics to narrow down which AC queues are seeing requests waiting
More advanced metrics about the system health (goroutine scheduler,
L0 sublevels, etc.)

Informs #121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:

Metrics to help determine which resource is constrained (IO, CPU)
Metrics to narrow down which AC queues are seeing requests waiting
More advanced metrics about the system health (goroutine scheduler,
L0 sublevels, etc.)

dbconsole: include better names and descriptions for overload page
This patch improves the metric descriptions for the metrics on the
overload page.

Fixes #120853.

Release note (ui change): The overload page now includes descriptions for all
metrics.

dbconsole: additional higher granularity metrics for overload

This patch adds additional metrics to the overload page that allow for
more granular look at the system:

cr.store.storage.l0-sublevels
cr.node.go.scheduler_latency-p99.9

Informs #121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:

cr.store.storage.l0-sublevels
cr.node.go.scheduler_latency-p99.9

dbconsole: split Admission Queue graphs to avoid overcrowding

Informs #121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:

Foreground (regular) CPU work
Store (IO) work
Background (elastic) CPU work
Replication Admission Control, store overload on replicas

dbconsole: add elastic store metric to the overload page

This patch uses the new sperated elastic-stores metrics for queing
delay from #123890.

Informs #121572.

Release note (ui change): The Admission Queueing Delay – Store chart
now separates elastic (background) work from the regular foreground
work.

dbconsole: add elastic io token exhausted duration to overload page

This patch adds the metric elastic_io_tokens_exhausted_duration.kv
introduced in #124078.

Informs #121572.

Release note (ui change): The Admission IO Tokens Exhausted chart now
separates elastic and regular io work.

cockroach-teamcity · 2024-05-02T21:48:53Z

This change is

aadityasondhi · 2024-05-06T17:49:17Z

Some screenshots of the new page (in order of how the metrics show on the page).

https://docs.google.com/document/d/1txZ6zcAYR_8fnqR6LR53vCZ1dM6sZh0a52VUg_QIs78/edit?usp=sharing

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 39 at r4 (raw file):

      tenantSource={tenantSource}
      showMetricsInTooltip={true}
      tooltip={`CPU utilization as measured by the host, displayed per node.`}

CPU utilization for the CRDB process as measured ....

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 57 at r4 (raw file):

      tenantSource={tenantSource}
      showMetricsInTooltip={true}
      tooltip={`Duration of CPU slots exhaustion regular work, in microseconds. This metric indicates whether the CPU is overloaded and how long do we experience token exhaustion for regular work.`}

in microseconds/sec. ... we experience slot exhaustion for regular CPU work.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 80 at r4 (raw file):

      tenantSource={tenantSource}
      showMetricsInTooltip={true}
      tooltip={`Duration of IO token exhaustion, in microseconds per second. This metric indicates whether the disk is overloaded and how long do we experience token exhaustion.`}

whether the store is overloaded ...

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 99 at r4 (raw file):

      sources={storeSources}
      tenantSource={tenantSource}
      tooltip={`A 1-normalized IO overload score. A value above 1 indicates that the store is overloaded.`}

We don't want to change the definition of the metric yet, but the interpretation of this needs to change, and we can capture that in this comment. Something like:

Historically, a 1-normalized IO overload score. Currently, admission control attempts to maintain a score of 0.5.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 140 at r4 (raw file):

    <LineGraph
      title="Flow Tokens Wait Time: 99th percentile"

We should consider using the term "queuing delay" in this and other queue delay metrics. Consider calling this:

Flow Tokens Queueing Delay: 99th percentile

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 289 at r4 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99 scheduling latency for goroutines. A value above 1ms typically indicates high load.`}

... indicates high load that causes elastic CPU work to be throttled.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 311 at r4 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99.9 scheduling latency for goroutines. A value above 1ms typically indicates high load.`}

we shouldn't say "A value above 1ms typically indicates high load" since we don't really know whether it indicates high load. This is just useful in realizing whether some high percentile user observed latency could be due to goroutine scheduling latency.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 333 at r4 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`The number of Goroutines waiting per CPU. A value above 32 typically indicates high load.`}

We should say: A value above 32 (sampled at 1ms) is used by admission control to throttle regular CPU work.

Instead of saying "A value above 32 typically indicates high load" since the former is more specific. The latter may result in someone thinking < 32 is not high load.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 351 at r4 (raw file):

      sources={storeSources}
      tenantSource={tenantSource}
      tooltip={`Number of sublevels in L0 of the LSM. A value above 20 typically indicates that the store is overloaded.`}

above 10

(which happens to correspond to a score of 0.5)

aadityasondhi

TFTR! I have made the changes PTAL. I also added a new commit to split the queue graphs as we discussed offline.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @sumeerbhola)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 57 at r4 (raw file):

Previously, sumeerbhola wrote…

in microseconds/sec. ... we experience slot exhaustion for regular CPU work.

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 80 at r4 (raw file):

Previously, sumeerbhola wrote…

whether the store is overloaded ...

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 99 at r4 (raw file):

Previously, sumeerbhola wrote…

We don't want to change the definition of the metric yet, but the interpretation of this needs to change, and we can capture that in this comment. Something like:

Historically, a 1-normalized IO overload score. Currently, admission control attempts to maintain a score of 0.5.

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 140 at r4 (raw file):

Previously, sumeerbhola wrote…

We should consider using the term "queuing delay" in this and other queue delay metrics. Consider calling this:

Flow Tokens Queueing Delay: 99th percentile

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 289 at r4 (raw file):

Previously, sumeerbhola wrote…

... indicates high load that causes elastic CPU work to be throttled.

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 311 at r4 (raw file):

Previously, sumeerbhola wrote…

we shouldn't say "A value above 1ms typically indicates high load" since we don't really know whether it indicates high load. This is just useful in realizing whether some high percentile user observed latency could be due to goroutine scheduling latency.

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 333 at r4 (raw file):

Previously, sumeerbhola wrote…

We should say: A value above 32 (sampled at 1ms) is used by admission control to throttle regular CPU work.

Instead of saying "A value above 32 typically indicates high load" since the former is more specific. The latter may result in someone thinking < 32 is not high load.

Done.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 351 at r4 (raw file):

Previously, sumeerbhola wrote…

above 10

(which happens to correspond to a score of 0.5)

Done.

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 180 at r10 (raw file):

    <LineGraph
      title="Admission Queueing Delay – Store"

I just noticed there is a problem here in our code instrumentation. The metrics passed to initWorkQueue for the regular WorkQueue and elastic WorkQueue for a store are the same. So this will include queueing delay for elastic and regular. I think it is worth making a note here and mention in the tooltip that to break this down one should look at the per priority metrics.

I think we should fix this in the code and backport. And we could avoid plotting the p99 for the elastic store WorkQueue since that queueing should all be logical queueing due to replication AC, and the physical queueing effect of that is already reflected in kvadmission.flow_controller.elastic_wait_duration-p99. Though thinking some more, since the flow controller metric is at the source and the WorkQueue metric is at the destination, both can be useful -- the latter tells us the cause, and the former about who is affected.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 326 at r10 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99 scheduling latency for goroutines. A high value here indicates high load that causes background (elastic) CPU work to be throttled.`}

A value above 1ms here indicates high load ...

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 348 at r10 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99.9 scheduling latency for goroutines. A high value here indicates high load that causes background (elastic) CPU work to be throttled.`}

something generic like the following, since we don't use p99.9 in AC decision making:
A high value here can be indicative of high tail latency in various queries.

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)

aadityasondhi · 2024-05-17T17:16:37Z

@sumeerbhola could you please take a look at the last two commits. They are follow ups from #124078 and #123890.

Just adding those metrics in with their regular counterparts.

blathers-crl · 2024-05-17T18:43:04Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

It looks like your PR touches SQL parser code but doesn't add or edit parser tests. Please make sure you add or edit parser tests if you edit the parser.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

aadityasondhi · 2024-05-17T18:47:54Z

Apologies for the noise. Bad rebase.

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 155 at r39 (raw file):

    <LineGraph
      title="Admission Queueing Delay – Foreground (Regular) CPU"

nit: I have a mild preference for putting "p99" in the titles too. Since not everyone looks at the tooltip.

In investigations, we have found that the following charts are not useful and frequently cause confusion: - Admission work rate - Admission Delay rate - Requests Waiting For Flow Tokens Informs cockroachdb#121572 Release note (ui change): This patch removes "Admission Delay Rate", "Admission Work Rate", and "Requests Waiting For Flow Tokens". These charts often cause confusion and are not useful for general overload investigations.

This patch reorders the existing metrics in a more usable order: 1. Metrics to help determine which resource is constrained (IO, CPU) 2. Metrics to narrow down which AC queues are seeing requests waiting 3. More advanced metrics about the system health (goroutine scheduler, L0 sublevels, etc.) Informs cockroachdb#121572. Release note (ui change): Reordering of metrics on the overload page to help categorizing them better. They are roughly in the following order: 1. Metrics to help determine which resource is constrained (IO, CPU) 2. Metrics to narrow down which AC queues are seeing requests waiting 3. More advanced metrics about the system health (goroutine scheduler, L0 sublevels, etc.)

This patch improves the metric descriptions for the metrics on the overload page. Fixes cockroachdb#120853. Release note (ui change): The overload page now includes descriptions for all metrics.

This patch adds additional metrics to the overload page that allow for more granular look at the system: - cr.store.storage.l0-sublevels - cr.node.go.scheduler_latency-p99.9 Informs cockroachdb#121572. Release note (ui change): Two additional metrics on the overload page for better visibility into overloaded resources: - cr.store.storage.l0-sublevels - cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572. Release note (ui change): There are now 4 graphs for Admission Queue Delay: 1. Foreground (regular) CPU work 2. Store (IO) work 3. Background (elastic) CPU work 4. Replication Admission Control, store overload on replicas

This patch uses the new sperated `elastic-stores` metrics for queing delay from cockroachdb#123890. Informs cockroachdb#121572. Release note (ui change): The `Admission Queueing Delay – Store` chart now separates elastic (background) work from the regular foreground work.

This patch adds the metric `elastic_io_tokens_exhausted_duration.kv` introduced in cockroachdb#124078. Informs cockroachdb#121572. Release note (ui change): The `Admission IO Tokens Exhausted` chart now separates elastic and regular io work.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @sumeerbhola)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 155 at r39 (raw file):

Previously, sumeerbhola wrote…

nit: I have a mild preference for putting "p99" in the titles too. Since not everyone looks at the tooltip.

Done.

aadityasondhi · 2024-05-21T18:25:52Z

TFTR!

bors r=sumeerbhola

craig · 2024-05-21T18:51:24Z

Build succeeded:

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from a2c9f5e to 68ca397 Compare May 2, 2024 21:48

aadityasondhi changed the title ~~20240502.overload page fixes~~ dbconsole: overload page improvements May 2, 2024

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from 68ca397 to 499c4e5 Compare May 6, 2024 17:48

aadityasondhi marked this pull request as ready for review May 6, 2024 17:49

aadityasondhi requested a review from a team as a code owner May 6, 2024 17:49

aadityasondhi requested review from abarganier and sumeerbhola and removed request for a team May 6, 2024 17:49

sumeerbhola requested changes May 7, 2024

View reviewed changes

aadityasondhi force-pushed the 20240502.overload-page-fixes branch 2 times, most recently from beb82cc to cce4580 Compare May 7, 2024 21:29

aadityasondhi commented May 7, 2024

View reviewed changes

aadityasondhi requested a review from sumeerbhola May 7, 2024 21:58

aadityasondhi force-pushed the 20240502.overload-page-fixes branch 2 times, most recently from 94b0dac to ba86f54 Compare May 8, 2024 14:52

sumeerbhola approved these changes May 8, 2024

View reviewed changes

aadityasondhi added the backport-24.1.x Flags PRs that need to be backported to 24.1. label May 8, 2024

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from ba86f54 to 0289a10 Compare May 8, 2024 21:10

sumeerbhola approved these changes May 9, 2024

View reviewed changes

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from 0289a10 to 8a475d7 Compare May 17, 2024 16:42

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from d47a6d6 to f77fbf4 Compare May 17, 2024 18:42

aadityasondhi requested review from a team as code owners May 17, 2024 18:42

aadityasondhi requested review from andyyang890 and removed request for a team May 17, 2024 18:43

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from f77fbf4 to 7258e95 Compare May 17, 2024 18:54

aadityasondhi removed request for a team, nameisbhaskar, vidit-bhat, dt and andyyang890 May 17, 2024 18:55

sumeerbhola approved these changes May 21, 2024

View reviewed changes

aadityasondhi added 7 commits May 21, 2024 14:23

dbconsole: include better names and descriptions for overload page

01c989a

This patch improves the metric descriptions for the metrics on the overload page. Fixes cockroachdb#120853. Release note (ui change): The overload page now includes descriptions for all metrics.

aadityasondhi force-pushed the 20240502.overload-page-fixes branch from 7258e95 to c397d35 Compare May 21, 2024 18:23

aadityasondhi commented May 21, 2024

View reviewed changes

craig bot merged commit 02ca30c into cockroachdb:master May 21, 2024
21 of 22 checks passed

blathers-crl bot mentioned this pull request May 21, 2024

release-24.1: dbconsole: overload page improvements #124509

Merged

aadityasondhi deleted the 20240502.overload-page-fixes branch May 21, 2024 19:32

aadityasondhi mentioned this pull request May 21, 2024

admission: revamp db console overload page to have useful metrics #121572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbconsole: overload page improvements #123522

dbconsole: overload page improvements #123522

aadityasondhi commented May 2, 2024 •

edited

cockroach-teamcity commented May 2, 2024

aadityasondhi commented May 6, 2024

sumeerbhola left a comment

aadityasondhi left a comment

sumeerbhola left a comment

sumeerbhola left a comment

aadityasondhi commented May 17, 2024

blathers-crl bot commented May 17, 2024

aadityasondhi commented May 17, 2024

sumeerbhola left a comment

aadityasondhi left a comment

aadityasondhi commented May 21, 2024

craig bot commented May 21, 2024

dbconsole: overload page improvements #123522

dbconsole: overload page improvements #123522

Conversation

aadityasondhi commented May 2, 2024 • edited

cockroach-teamcity commented May 2, 2024

aadityasondhi commented May 6, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

aadityasondhi commented May 17, 2024

blathers-crl bot commented May 17, 2024

aadityasondhi commented May 17, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

aadityasondhi commented May 21, 2024

craig bot commented May 21, 2024

aadityasondhi commented May 2, 2024 •

edited