New taskq stats #16171

robn · 2024-05-07T05:18:37Z

Motivation and Context

I keep trying to improve performance by looking for bottlenecks with external tools, and I find something, and fix it, only to find that that thing goes away, but performance doesn't change, or changes in ways I don't expect. So I spend more and more time wondering what even is happening inside some of these black boxes. Then I decided I'd had enough, and made a start on bringing those mysteries out into the light.

So here's round 1: taskq stats. They're at the heart of so many things, and I want to know what they're up to.

Description

This gives every taskq three sets of stats:

gauges (up/down counts) of current state: thread count and activity, task count and activity
counters of events since creation: threads created/destroyed, tasks dispatched/executed/delayed, thread wakes & sleeps
min/max and rolling averages of time tasks have spent in each "stage" of the pipeline (enqueue/wait/dequeue/execute)

Each taskq gets its own kstat node in /proc/spl/kstat/taskq:

# ls /proc/spl/kstat/taskq
arc_prune.0	     summary	    z_null_int.0	z_upgrade.0
dbu_evict.0	     z_cl_int.0     z_null_iss.0	z_vdev_file.0
dp_sync_taskq.0       z_cl_iss.0     z_prefetch.0	z_wr_int.0
dp_zil_clean_taskq.0  z_flush_int.0  z_rd_int.0		z_wr_iss.0
spl_delay_taskq.0     z_flush_iss.0  z_rd_iss.0		z_zrele.0
spl_dynamic_taskq.0   z_fr_int.0     z_trim_int.0	z_zvol.0
spl_kmem_cache.0      z_fr_iss.0     z_trim_iss.0	zvol_tq-0.0
spl_system_taskq.0    z_metaslab.0   z_unlinked_drain.0

Within, we find all sorts of useful things:

$ cat /proc/spl/kstat/taskq/dp_sync_taskq.0
197 1 0x01 53 14416 1290432846656093 1291816178703629
name                            type data
threads_max                     4    8
entry_pool_min                  4    8
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    8
threads_total                   4    8
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    8
threads_created                 4    8
threads_destroyed               4    0
tasks_dispatched                4    175270
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    173948
tasks_executed_priority         4    1322
tasks_executed                  4    175270
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    126095
thread_wakeups_nowork           4    78498
thread_sleeps                   4    126103
enqueue_time_min                4    0
enqueue_time_max                4    38
enqueue_time_avg_10             4    5
enqueue_time_avg_100            4    5
enqueue_time_avg_1000           4    4
enqueue_time_avg_10000          4    0
wait_time_min                   4    0
wait_time_max                   4    42347
wait_time_avg_10                4    2
wait_time_avg_100               4    10
wait_time_avg_1000              4    1487
wait_time_avg_10000             4    3102
dequeue_time_min                4    0
dequeue_time_max                4    45
dequeue_time_avg_10             4    0
dequeue_time_avg_100            4    0
dequeue_time_avg_1000           4    1
dequeue_time_avg_10000          4    0
execute_time_min                4    0
execute_time_max                4    44618
execute_time_avg_10             4    0
execute_time_avg_100            4    2
execute_time_avg_1000           4    207
execute_time_avg_10000          4    441
task_time_min                   4    2
task_time_max                   4    44694
task_time_avg_10                4    8
task_time_avg_100               4    18
task_time_avg_1000              4    1698
task_time_avg_10000             4    3544

The meaning of most of these should be self-explanatory. The first set (down to entries_free) are current state; the second set (down to thread_sleeps) are event counters since the beginning.

The _time_ group is the most interesting, and probably the most controversial. Each entry (taskq_ent_t) now gets a timestamp at each "step" of its journey through the taskq:

when the dispatch function is called
when it is actually added to a queue
when a thread starts taking an entry off the queue
when the task in the entry is called
when the task completes

The delta between each step is then computed, and named: enqueue, wait, dequeue, execute.

For each type, the min/max delta since the epoch is recorded, and then the delta is mixed in to a moving average over the last 10, 100, 1000 and 10000 tasks. This makes it much easier to see the signs of, eg, queue lock contention (higher average enqueue/dequeue times), or too few worker threads (higher average wait times).

A "total" time from dispatch to completion is also recorded in the same way.

Finally, a summary list is included, showing all taskqs in the system, with a few key values for each. The idea is to give a "quick glance" state of the world at any given moment, and invite further digging:

# cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 8886444295 786666941903
                     | threads (current) | task on queue (current) | task time (µs)  (last 100 tasks)
taskq name           | tot [act idl] max |  pend [ norm  high] dly | avg    [enq    wait  deq   exec]
-------------------- | ----------------- | ----------------------- | --------------------------------
z_unlinked_drain.0   |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_zrele.0            |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
dp_zil_clean_taskq.0 |   4 [  0   4]   4 |     0 [    0     0]   0 |     10 [   4      4    0      1]
dp_sync_taskq.0      |   1 [  0   1]   1 |     0 [    0     0]   0 |    154 [   2     44    0    108]
z_upgrade.0          |   1 [  0   1]   4 |     0 [    0     0]   0 |     87 [   0      0    0     86]
z_prefetch.0         |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_metaslab.0         |   2 [  0   2]   2 |     0 [    0     0]   0 |     30 [   2     27    0      0]
...

Finally, this removes the old /proc/spl/taskq and taskq-all files.

Design and implementation notes

I'm aware of how busy taskqs get, and I've tried my best to not introduce additional bottlenecks. Gauge and counter stats are all regular wmsum_t types, and only read if the kstats are actually read from. The rolling averages are recorded once at the end of each task, and do not lock the taskq itself to do so, which can mean trampling but they are not intended to be accurate. I've thrown some of my performance test workloads at it (the ones that I'm trying improve) and these changes don't appear to move the needle, so I feel at least okay. about it. It'd be good to get more testing. Importantly, reading the stats does not take the taskq lock, since the stats are held separately.

The time stats have been useful already (a couple of interesting quirks discovered which I'll chase down later). They still don't quite feel right to me. I'm not sure if min/max is ever going to be useful. The averages undoubtedly are useful, but "last N tasks" is only useful during sustained operation, which many taskqs don't see. I'm actually not very good at math (moving average is my only trick 😅), so I'm hoping someone will say "oh, this is a perfect use for Numberwang!"

I'm not super persuaded about the summary list. So far, the "threads" column has been unnecessary and the "tasks on queue" column has been great (though I can take or leave "delay"). The averages column should be the most interesting, but there's not a single granularity that works for all taskq types - high-throughput queues like z_wr and z_rd really need 10000, but rarely-used taskqs might be better at 10, and there's everything in between. I did think a bit about choosing the most likely range based on total tasks, but again, it's tricky without some notion of time, and taskqs don't have a heartbeat, and I would prefer to keep it that way.

Your ideas for other stats and presentation welcome, as I would like this to be actually good. Yes, I am inviting bikeshedding 😇

Removing the old stats might not be wanted. I personally don't see the point in keeping them; they show some different things but they're mostly about taskq configuration (flags etc) which aren't especially useful to the reader and can't be seen anyway. When they're busy, dumping tasks is quite pointless and doesn't seem to work well anyway. Finally, the implementation is pretty invasive, messing around inside the taskq itself. But if someone wants to keep them, it won't particularly hurt anything; just be mess.

Finally, I would like to do something similar for FreeBSD, but since we use a lot more of the native taskqueues there, it might be an effort to actually get anything useful. I wouldn't hold this PR just for that, but if you know more about this I would like to hear about it.

How Has This Been Tested?

A lot by hand, including performance tests. A full ZTS run has been completed, but since it doesn't use anything from /proc/spl/taskq I wasn't expecting it to show much and indeed, it did not.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

include/os/linux/spl/sys/taskq.h

rincebrain · 2024-05-07T12:48:06Z

A counter for fpu_enter/fpu_exit (or just more general critical section enter/exit) calls in the taskq might also be informative, as would if your taskq triggers kmem cache growth, and I think you still have the current task in the relevant pieces of code to do it. (The only fun part might be referencing that code from inside the spl code, like why you can't easily use dbgmsg from in there.)

module/os/linux/spl/spl-taskq.c

robn · 2024-05-07T23:27:24Z

@rincebrain

A counter for fpu_enter/fpu_exit (or just more general critical section enter/exit) calls in the taskq might also be informative, as would if your taskq triggers kmem cache growth, and I think you still have the current task in the relevant pieces of code to do it.

Hmm. I hadn't really thought much about counting what the task itself does; this was all about internal taskq behaviour.

But yes, in general I think it might be interesting to know when a task did something like block, sleep, be pre-empted, or otherwise make the thread unavailable to other tasks. I probably won't try it in this PR though.

(unless I'm completely misunderstanding you of course. If so, tell me, and say "well obviously you just tap your belt buckle thrice", and then I'll do the thing)

rincebrain · 2024-05-08T00:34:05Z

Conceptually, it didn't strike me as that different a thing to track, since you're monitoring stats about how much time the tasks spent enqueued/dequeuing/waiting/delayed, and I spend a bunch of time sometimes in that question, so "how many times/how long did I spend inside/waiting on fpu sections" and the like, naturally interested me.

But no, you got the idea.

behlendorf

Aside from the level of angularity this looks nice. I've wanted for a long time to have been visibility in to the taskqs.

For spl-taskq to use the kstats infrastructure, it has to be available first. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq, one file per taskq, named for the taskq name.instance. These include a small amount of info about the taskq config, the current state of the threads and queues, and various counters for thread and queue activity since the taskq was created. To assist with decrementing queue size counters, the list an entry is on is encoded in spare bits in the entry flags. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

This adds /proc/spl/kstats/taskq/summary, which attempts to show a useful subset of stats for all taskqs in the system. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

These had minimal useful information for the admin, didn't work properly in some places, and knew far too much about taskq internals. With the new stats available, these should never be needed anymore. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

robn · 2024-06-07T23:29:24Z

Last push removes the timing stats.

# cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 640838657 117862991354
                     | threads           | tasks on queue
taskq name           | tot [act idl] max |  pend [ norm  high] dly
-------------------- | ----------------- | -----------------------
z_vdev_file.0        |   1 [  0   1]  16 |     0 [    0     0]   0
dbu_evict.0          |   1 [  0   1]   1 |     0 [    0     0]   0
arc_prune.0          |   1 [  0   1]   1 |     0 [    0     0]   0
zvol_tq-0.0          |   1 [  0   1]  32 |     0 [    0     0]   0
spl_kmem_cache.0     |   1 [  0   1]   4 |     0 [    0     0]   0
spl_dynamic_taskq.0  |   1 [  0   1]   1 |     0 [    0     0]   0
spl_delay_taskq.0    |   1 [  0   1]   4 |     0 [    0     0]   0
spl_system_taskq.0   |   1 [  0   1]  64 |     0 [    0     0]   0

# cat /proc/spl/kstat/taskq/spl_kmem_cache.0
4 1 0x01 23 6256 640915734 240317768164
name                            type data
threads_max                     4    4
entry_pool_min                  4    32
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    1
threads_total                   4    1
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    32
threads_created                 4    1
threads_destroyed               4    0
tasks_dispatched                4    969
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    969
tasks_executed_priority         4    0
tasks_executed                  4    969
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    969
thread_wakeups_nowork           4    0
thread_sleeps                   4    970

I'll keep working on measuring throughput in a future PR; this one I think is good to go.

gmelikov reviewed May 7, 2024

View reviewed changes

include/os/linux/spl/sys/taskq.h Show resolved Hide resolved

amotin reviewed May 7, 2024

View reviewed changes

module/os/linux/spl/spl-taskq.c Outdated Show resolved Hide resolved

behlendorf added the Status: Code Review Needed Ready for review and testing label May 10, 2024

robn mentioned this pull request May 16, 2024

vdev queue stats #16200

Open

13 tasks

robn force-pushed the taskq-stats branch 3 times, most recently from c7c5184 to 92f2287 Compare May 21, 2024 23:38

behlendorf reviewed May 29, 2024

View reviewed changes

robn added 4 commits June 8, 2024 08:51

spl-generic: bring up kstats subsystem before taskq

78f43f1

For spl-taskq to use the kstats infrastructure, it has to be available first. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

robn force-pushed the taskq-stats branch from 92f2287 to ce59b7d Compare June 7, 2024 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New taskq stats #16171

New taskq stats #16171

robn commented May 7, 2024

rincebrain commented May 7, 2024

robn commented May 7, 2024 •

edited

rincebrain commented May 8, 2024

behlendorf left a comment •

edited

robn commented Jun 7, 2024

New taskq stats #16171

Are you sure you want to change the base?

New taskq stats #16171

Conversation

robn commented May 7, 2024

Motivation and Context

Description

Design and implementation notes

How Has This Been Tested?

Types of changes

Checklist:

rincebrain commented May 7, 2024

robn commented May 7, 2024 • edited

rincebrain commented May 8, 2024

behlendorf left a comment • edited

Choose a reason for hiding this comment

robn commented Jun 7, 2024

robn commented May 7, 2024 •

edited

behlendorf left a comment •

edited