Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev queue stats #16200

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

vdev queue stats #16200

wants to merge 5 commits into from

Conversation

robn
Copy link
Contributor

@robn robn commented May 16, 2024

Motivation and Context

Part of my ongoing quest to understand what's happening inside the box (previously).

This time, its counters showing what vdev_queue is up to.

Description

Adds a bunch of wmsum_t counters to ever vdev_queue instance for a real device. This show current count of IOs queued and in-flight (total and broken down by class), total IOs in/out over the lifetime of the queue, and basic aggregation counters.

The counters are exposed under /proc/spl/kstat/zfs/<pool>/vdev/<guid>/queue on Linux, or kstat.zfs.<pool>.vdev.<guid>.misc.queue on FreeBSD.

# zpool status -g
  pool: tank
 state: ONLINE
config:

	NAME                      STATE     READ WRITE CKSUM
	tank                      ONLINE       0     0     0
	 11293794978541385724    ONLINE       0     0     0
	   13809318117615536196  ONLINE       0     0     0
	   1868205675291292825   ONLINE       0     0     0
	   815484099661475330    ONLINE       0     0     0
	   14246512426141088651  ONLINE       0     0     0

errors: No known data errors

# ls -l /proc/spl/kstat/zfs/tank/vdev/*/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/13809318117615536196/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/14246512426141088651/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/1868205675291292825/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/815484099661475330/queue

# cat /proc/spl/kstat/zfs/tank/vdev/13809318117615536196/queue
20 1 0x01 45 12240 3024876135 13088804505
name                            type data
io_queued                       4    0
io_syncread_queued              4    0
io_syncwrite_queued             4    0
io_asyncread_queued             4    0
io_asyncwrite_queued            4    0
io_scrub_queued                 4    0
io_removal_queued               4    0
io_initializing_queued          4    0
io_trim_queued                  4    0
io_rebuild_queued               4    0
io_active                       4    0
io_syncread_active              4    0
io_syncwrite_active             4    0
io_asyncread_active             4    0
io_asyncwrite_active            4    0
io_scrub_active                 4    0
io_removal_active               4    0
io_initializing_active          4    0
io_trim_active                  4    0
io_rebuild_active               4    0
io_enqueued_total               4    236036
io_syncread_enqueued_total      4    11
io_syncwrite_enqueued_total     4    13054
io_asyncread_enqueued_total     4    0
io_asyncwrite_enqueued_total    4    222971
io_scrub_enqueued_total         4    0
io_removal_enqueued_total       4    0
io_initializing_enqueued_total  4    0
io_trim_enqueued_total          4    0
io_rebuild_enqueued_total       4    0
io_dequeued_total               4    236036
io_syncread_dequeued_total      4    11
io_syncwrite_dequeued_total     4    13054
io_asyncread_dequeued_total     4    0
io_asyncwrite_dequeued_total    4    222971
io_scrub_dequeued_total         4    0
io_removal_dequeued_total       4    0
io_initializing_dequeued_total  4    0
io_trim_dequeued_total          4    0
io_rebuild_dequeued_total       4    0
io_aggregated_total             4    37902
io_aggregated_data_total        4    107667
io_aggregated_read_gap_total    4    0
io_aggregated_write_gap_total   4    0
io_aggregated_shrunk_total      4    0

FreeBSD:

$ sysctl kstat.zfs.tank.vdev.3686087381038636139.misc.queue
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_shrunk_total: 41
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_write_gap_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_read_gap_total: 10
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_data_total: 109
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_total: 20
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_dequeued_total: 69
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_dequeued_total: 192
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_dequeued_total: 1
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_dequeued_total: 23
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_dequeued_total: 42
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_dequeued_total: 327
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_enqueued_total: 69
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_enqueued_total: 192
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_enqueued_total: 1
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_enqueued_total: 23
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_enqueued_total: 42
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_enqueued_total: 327
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_queued: 0

Notes

The actual stats part is pretty unremarkable, being little more than the normal "sums & stats" boilerplate. They perhaps don't technically need to be wmsum_t, since all the changes are made under vq_lock anyway, but its following a common pattern and part of why I want this is to assist with removing or greatly reducing the scope of vq_lock, so this is where they'll need to be anyway.

The more interesting part of the PR is in the SPL kstats changes. These could be a separate PR, even two, but since they have no other application (yet) it seems fair to leave them here so at least there's something to test with. (I will however make them separate PRs on request).

The main part is allowing for multi-level kstat module names. I want this so I can bolt sub-object stats (like individual vdevs) under the pool stats, as you see. For Linux its not really complex, just a little more housekeeping. For FreeBSD, every kstat has its own "view" of the tree anyway, attached to the sysctl context, so its quite trivial as no cleanup code is required.

The name reuse thing, meanwhile, is the least invasive solution I could find to an annoying structural problem that came up. Every vdev_t has a vdev_queue_t that isn't easily decoupled, and now every vdev_queue_t creates some stats. During import, a tree of vdev_ts are created with the untrusted config, and then a second set with the trusted config. Both of these register kstats with the same names. The effective policy that falls out of the implementations was that the first to claim the name wins, so the untrusted vdev tree gets them. Once the pool is imported though, that tree is discarded. The trusted tree remains and becomes the active pool, but at that point it never got to register its kstats, and the original ones are gone.

Reordering the import is not really possible, as the two trees briefly coexist to copy "updated" values from the untrusted tree to the trusted (eg device paths have changed since last import). There's no comfortable way I could find to know where in the process we are, and don't create stats until the live one comes up. There's other options, like delaying kstats creation until first use, but in all these cases it felt dangerous to be mucking in pool and vdev initialisation just to satisfy a quirk of the kstats system.

So instead, I effectively just changed the policy from first-wins to last-wins, and it all works out ok. There's probably a better structed "correct" way to sort it out, but I'll leave that for the eventual stats subsystem rewrite that of course is now buzzing in the back of my head 😇.

How Has This Been Tested?

Mostly through repeated pool create -> IO -> scrub -> export -> import -> IO -> export -> unload cycles, on both Linux and FreeBSD. Once the numbers looked good and things stopped complaining about replacement names and/or panicking, I declared it good.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter
Copy link
Contributor

zpool iostat -q will show you instantaneous queue levels. Would it make sense to add a zpool iostat -q --totals to display the totals (rather than separate kstat)?

robn added 5 commits May 22, 2024 09:41
Module names are mapped directly to directory names in procfs, but
nothing is done to create the intermediate directories, or remove them.
This makes it impossible to sensibly present kstats about sub-objects.

This commit loops through '/'-separated names in the full module name,
creates a separate module for each, and hooks them up with a parent
pointer and child counter, and then unrolls this on the other side when
deleting a module.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Previously, if a kstat proc name already existed, the old one would be
kept. This makes it so the old one is discarded and the new one kept.

Arguably, a collision like this shouldn't ever happen, but during
import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats)
can exist at the same time for the same guid. There's no real way to
tell which is which without substantial refactoring in the import and
vdev init codepaths, whch is probably worthwhile but not for today.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This extends the existing special-case for zfs/poolname to split and
create any number of intermediate sysctl names, so that multi-level
module names are possible.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Normally, when trying to add a sysctl name that already exists, the
kernel rejects it with a warning. This changes the code to search for a
sysctl with the wanted name in same root. If it exists, it is destroyed,
allowing the new one to go in.

Arguably, a collision like this shouldn't ever happen, but during
import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats)
can exist at the same time for the same guid. There's no real way to
tell which is which without substantial refactoring in the import and
vdev init codepaths, whch is probably worthwhile but not for today.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Adding a bunch of gauges and counters to show in-flight and total IOs,
with per-class breakdowns, and some aggregation counters.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants