Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New taskq stats #16171

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

New taskq stats #16171

wants to merge 4 commits into from

Conversation

robn
Copy link
Contributor

@robn robn commented May 7, 2024

Motivation and Context

I keep trying to improve performance by looking for bottlenecks with external tools, and I find something, and fix it, only to find that that thing goes away, but performance doesn't change, or changes in ways I don't expect. So I spend more and more time wondering what even is happening inside some of these black boxes. Then I decided I'd had enough, and made a start on bringing those mysteries out into the light.

So here's round 1: taskq stats. They're at the heart of so many things, and I want to know what they're up to.

Description

This gives every taskq three sets of stats:

  • gauges (up/down counts) of current state: thread count and activity, task count and activity
  • counters of events since creation: threads created/destroyed, tasks dispatched/executed/delayed, thread wakes & sleeps
  • min/max and rolling averages of time tasks have spent in each "stage" of the pipeline (enqueue/wait/dequeue/execute)

Each taskq gets its own kstat node in /proc/spl/kstat/taskq:

# ls /proc/spl/kstat/taskq
arc_prune.0	     summary	    z_null_int.0	z_upgrade.0
dbu_evict.0	     z_cl_int.0     z_null_iss.0	z_vdev_file.0
dp_sync_taskq.0       z_cl_iss.0     z_prefetch.0	z_wr_int.0
dp_zil_clean_taskq.0  z_flush_int.0  z_rd_int.0		z_wr_iss.0
spl_delay_taskq.0     z_flush_iss.0  z_rd_iss.0		z_zrele.0
spl_dynamic_taskq.0   z_fr_int.0     z_trim_int.0	z_zvol.0
spl_kmem_cache.0      z_fr_iss.0     z_trim_iss.0	zvol_tq-0.0
spl_system_taskq.0    z_metaslab.0   z_unlinked_drain.0

Within, we find all sorts of useful things:

$ cat /proc/spl/kstat/taskq/dp_sync_taskq.0
197 1 0x01 53 14416 1290432846656093 1291816178703629
name                            type data
threads_max                     4    8
entry_pool_min                  4    8
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    8
threads_total                   4    8
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    8
threads_created                 4    8
threads_destroyed               4    0
tasks_dispatched                4    175270
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    173948
tasks_executed_priority         4    1322
tasks_executed                  4    175270
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    126095
thread_wakeups_nowork           4    78498
thread_sleeps                   4    126103
enqueue_time_min                4    0
enqueue_time_max                4    38
enqueue_time_avg_10             4    5
enqueue_time_avg_100            4    5
enqueue_time_avg_1000           4    4
enqueue_time_avg_10000          4    0
wait_time_min                   4    0
wait_time_max                   4    42347
wait_time_avg_10                4    2
wait_time_avg_100               4    10
wait_time_avg_1000              4    1487
wait_time_avg_10000             4    3102
dequeue_time_min                4    0
dequeue_time_max                4    45
dequeue_time_avg_10             4    0
dequeue_time_avg_100            4    0
dequeue_time_avg_1000           4    1
dequeue_time_avg_10000          4    0
execute_time_min                4    0
execute_time_max                4    44618
execute_time_avg_10             4    0
execute_time_avg_100            4    2
execute_time_avg_1000           4    207
execute_time_avg_10000          4    441
task_time_min                   4    2
task_time_max                   4    44694
task_time_avg_10                4    8
task_time_avg_100               4    18
task_time_avg_1000              4    1698
task_time_avg_10000             4    3544

The meaning of most of these should be self-explanatory. The first set (down to entries_free) are current state; the second set (down to thread_sleeps) are event counters since the beginning.

The _time_ group is the most interesting, and probably the most controversial. Each entry (taskq_ent_t) now gets a timestamp at each "step" of its journey through the taskq:

  • when the dispatch function is called
  • when it is actually added to a queue
  • when a thread starts taking an entry off the queue
  • when the task in the entry is called
  • when the task completes

The delta between each step is then computed, and named: enqueue, wait, dequeue, execute.

For each type, the min/max delta since the epoch is recorded, and then the delta is mixed in to a moving average over the last 10, 100, 1000 and 10000 tasks. This makes it much easier to see the signs of, eg, queue lock contention (higher average enqueue/dequeue times), or too few worker threads (higher average wait times).

A "total" time from dispatch to completion is also recorded in the same way.

Finally, a summary list is included, showing all taskqs in the system, with a few key values for each. The idea is to give a "quick glance" state of the world at any given moment, and invite further digging:

# cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 8886444295 786666941903
                     | threads (current) | task on queue (current) | task time (µs)  (last 100 tasks)
taskq name           | tot [act idl] max |  pend [ norm  high] dly | avg    [enq    wait  deq   exec]
-------------------- | ----------------- | ----------------------- | --------------------------------
z_unlinked_drain.0   |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_zrele.0            |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
dp_zil_clean_taskq.0 |   4 [  0   4]   4 |     0 [    0     0]   0 |     10 [   4      4    0      1]
dp_sync_taskq.0      |   1 [  0   1]   1 |     0 [    0     0]   0 |    154 [   2     44    0    108]
z_upgrade.0          |   1 [  0   1]   4 |     0 [    0     0]   0 |     87 [   0      0    0     86]
z_prefetch.0         |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_metaslab.0         |   2 [  0   2]   2 |     0 [    0     0]   0 |     30 [   2     27    0      0]
...

Finally, this removes the old /proc/spl/taskq and taskq-all files.

Design and implementation notes

I'm aware of how busy taskqs get, and I've tried my best to not introduce additional bottlenecks. Gauge and counter stats are all regular wmsum_t types, and only read if the kstats are actually read from. The rolling averages are recorded once at the end of each task, and do not lock the taskq itself to do so, which can mean trampling but they are not intended to be accurate. I've thrown some of my performance test workloads at it (the ones that I'm trying improve) and these changes don't appear to move the needle, so I feel at least okay. about it. It'd be good to get more testing. Importantly, reading the stats does not take the taskq lock, since the stats are held separately.

The time stats have been useful already (a couple of interesting quirks discovered which I'll chase down later). They still don't quite feel right to me. I'm not sure if min/max is ever going to be useful. The averages undoubtedly are useful, but "last N tasks" is only useful during sustained operation, which many taskqs don't see. I'm actually not very good at math (moving average is my only trick 😅), so I'm hoping someone will say "oh, this is a perfect use for Numberwang!"

I'm not super persuaded about the summary list. So far, the "threads" column has been unnecessary and the "tasks on queue" column has been great (though I can take or leave "delay"). The averages column should be the most interesting, but there's not a single granularity that works for all taskq types - high-throughput queues like z_wr and z_rd really need 10000, but rarely-used taskqs might be better at 10, and there's everything in between. I did think a bit about choosing the most likely range based on total tasks, but again, it's tricky without some notion of time, and taskqs don't have a heartbeat, and I would prefer to keep it that way.

Your ideas for other stats and presentation welcome, as I would like this to be actually good. Yes, I am inviting bikeshedding 😇

Removing the old stats might not be wanted. I personally don't see the point in keeping them; they show some different things but they're mostly about taskq configuration (flags etc) which aren't especially useful to the reader and can't be seen anyway. When they're busy, dumping tasks is quite pointless and doesn't seem to work well anyway. Finally, the implementation is pretty invasive, messing around inside the taskq itself. But if someone wants to keep them, it won't particularly hurt anything; just be mess.

Finally, I would like to do something similar for FreeBSD, but since we use a lot more of the native taskqueues there, it might be an effort to actually get anything useful. I wouldn't hold this PR just for that, but if you know more about this I would like to hear about it.

How Has This Been Tested?

A lot by hand, including performance tests. A full ZTS run has been completed, but since it doesn't use anything from /proc/spl/taskq I wasn't expecting it to show much and indeed, it did not.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@rincebrain
Copy link
Contributor

A counter for fpu_enter/fpu_exit (or just more general critical section enter/exit) calls in the taskq might also be informative, as would if your taskq triggers kmem cache growth, and I think you still have the current task in the relevant pieces of code to do it. (The only fun part might be referencing that code from inside the spl code, like why you can't easily use dbgmsg from in there.)

@robn
Copy link
Contributor Author

robn commented May 7, 2024

@rincebrain

A counter for fpu_enter/fpu_exit (or just more general critical section enter/exit) calls in the taskq might also be informative, as would if your taskq triggers kmem cache growth, and I think you still have the current task in the relevant pieces of code to do it.

Hmm. I hadn't really thought much about counting what the task itself does; this was all about internal taskq behaviour.

But yes, in general I think it might be interesting to know when a task did something like block, sleep, be pre-empted, or otherwise make the thread unavailable to other tasks. I probably won't try it in this PR though.

(unless I'm completely misunderstanding you of course. If so, tell me, and say "well obviously you just tap your belt buckle thrice", and then I'll do the thing)

@rincebrain
Copy link
Contributor

Conceptually, it didn't strike me as that different a thing to track, since you're monitoring stats about how much time the tasks spent enqueued/dequeuing/waiting/delayed, and I spend a bunch of time sometimes in that question, so "how many times/how long did I spend inside/waiting on fpu sections" and the like, naturally interested me.

But no, you got the idea.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 10, 2024
@robn robn mentioned this pull request May 16, 2024
13 tasks
@robn robn force-pushed the taskq-stats branch 3 times, most recently from c7c5184 to 92f2287 Compare May 21, 2024 23:38
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the level of angularity this looks nice. I've wanted for a long time to have been visibility in to the taskqs.

robn added 4 commits June 8, 2024 08:51
For spl-taskq to use the kstats infrastructure, it has to be available
first.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq,
one file per taskq, named for the taskq name.instance.

These include a small amount of info about the taskq config, the current
state of the threads and queues, and various counters for thread and
queue activity since the taskq was created.

To assist with decrementing queue size counters, the list an entry is on
is encoded in spare bits in the entry flags.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This adds /proc/spl/kstats/taskq/summary, which attempts to show a
useful subset of stats for all taskqs in the system.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
These had minimal useful information for the admin, didn't work properly
in some places, and knew far too much about taskq internals.

With the new stats available, these should never be needed anymore.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
@robn
Copy link
Contributor Author

robn commented Jun 7, 2024

Last push removes the timing stats.

# cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 640838657 117862991354
                     | threads           | tasks on queue
taskq name           | tot [act idl] max |  pend [ norm  high] dly
-------------------- | ----------------- | -----------------------
z_vdev_file.0        |   1 [  0   1]  16 |     0 [    0     0]   0
dbu_evict.0          |   1 [  0   1]   1 |     0 [    0     0]   0
arc_prune.0          |   1 [  0   1]   1 |     0 [    0     0]   0
zvol_tq-0.0          |   1 [  0   1]  32 |     0 [    0     0]   0
spl_kmem_cache.0     |   1 [  0   1]   4 |     0 [    0     0]   0
spl_dynamic_taskq.0  |   1 [  0   1]   1 |     0 [    0     0]   0
spl_delay_taskq.0    |   1 [  0   1]   4 |     0 [    0     0]   0
spl_system_taskq.0   |   1 [  0   1]  64 |     0 [    0     0]   0
# cat /proc/spl/kstat/taskq/spl_kmem_cache.0
4 1 0x01 23 6256 640915734 240317768164
name                            type data
threads_max                     4    4
entry_pool_min                  4    32
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    1
threads_total                   4    1
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    32
threads_created                 4    1
threads_destroyed               4    0
tasks_dispatched                4    969
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    969
tasks_executed_priority         4    0
tasks_executed                  4    969
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    969
thread_wakeups_nowork           4    0
thread_sleeps                   4    970

I'll keep working on measuring throughput in a future PR; this one I think is good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants