Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a diagnostic kstat for obtaining pool status #16026

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

don-brady
Copy link
Contributor

Motivation and Context

A hung pool process can be left holding the spa config lock or the spa namespace lock. If an admin wants to observe the status of a pool using the traditional zpool status, it could hang waiting for one of the locks held by the stuck process. It would be nice to observe pool status in this scenario without the risk of the inquiry hanging.

Description

Exploring Solutions

  1. Ignore the locks. Here the admin knows that the pool state is not changing (they are not adding/removing disks, changing the config, etc). So we can traverse data structures without the normal protection required to guard against them changing underneath us. Two methods come to mind.
  • (a) Add a new kstat entry that reports the pool status in JSON and ignores any locking when gathering the pool stats.
  • (b) Add an '--unsafe' (or '--ignore-locking') flag to zpool status which informs the kernel to ignore locking when gathering the pool stats.
  1. Infer that the lock is stuck (held for an extended period) and conclude that locking is not required to read the pool stats. This is somewhat a variant of 1, where the source code, instead of the admin user, is determining that it is safe to ignore locking since the pool configuration cannot be changing.

  2. Refactor the spa code to have more fine grain locking and perhaps use reader/writer locks in lieu of mutex locks to alleviate the obvious points of lock contention when a pool gets stuck. Don't hold these global scope locks across disk I/O, etc.

This change is implementing option 1a -- adding a kstat at zfs/<pool>/stats.json which ignores any locking. This kstat can be used for investigations when pools are in a hung state while holding global locks required for a traditional 'zpool status' to proceed.

NOTE: This kstat is not safe to use in conditions where pools are in the process of configuration changes (i.e., adding/removing devices). Therefore, this kstat is not intended to be a general replacement or alternative to using 'zpool status'.

Sponsored-by: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.

How Has This Been Tested?

  1. Added new zpool_status_kstat_pos test to validate the JSON output
  2. Manually tested the kstat with various pool configurations in various states of health

sample kstat output (degraded mirror):

$ sudo cat /proc/spl/kstat/zfs/tank/status.json | jq
{
  "status_json_version": 4,
  "scl_config_lock": true,
  "scan_error": 0,
  "scan_stats": {
    "func": "RESILVER",
    "state": "FINISHED",
    "start_time": 1710901463,
    "end_time": 1710901463,
    "to_examine": 165888,
    "examined": 165888,
    "processed": 316416,
    "errors": 0,
    "pass_exam": 0,
    "pass_start": 1711398055,
    "pass_scrub_pause": 0,
    "pass_scrub_spent_paused": 0,
    "pass_issued": 0,
    "issued": 113664
  },
  "state": "DEGRADED",
  "version": 5000,
  "name": "tank",
  "txg": 255,
  "pool_guid": 6645716700149381000,
  "errata": 0,
  "hostname": "discovery1",
  "com.delphix:has_per_vdev_zaps": true,
  "features_for_read": {
    "com.delphix:hole_birth": true,
    "com.delphix:embedded_data": true,
    "com.klarasystems:vdev_zaps_v2": true
  },
  "load_info": {
    "hostname": "discovery1",
    "enabled_feat": {
      "org.illumos:edonr": 0,
      "com.delphix:redaction_list_spill": 0,
      "org.zfsonlinux:large_dnode": 0,
      "com.delphix:bookmark_written": 0,
      "org.illumos:sha512": 0,
      "org.openzfs:raidz_expansion": 0,
      "org.freebsd:zstd_compress": 0,
      "com.joyent:multi_vdev_crash_dump": 0,
      "org.illumos:lz4_compress": 1,
      "org.illumos:skein": 0,
      "com.delphix:redaction_bookmarks": 0,
      "com.datto:encryption": 0,
      "com.delphix:extensible_dataset": 1,
      "com.datto:bookmark_v2": 0,
      "com.delphix:head_errlog": 1,
      "com.klarasystems:vdev_zaps_v2": 1,
      "com.delphix:hole_birth": 1,
      "com.delphix:redacted_datasets": 0,
      "org.open-zfs:large_blocks": 0,
      "com.delphix:embedded_data": 1,
      "org.openzfs:blake3": 0,
      "com.delphix:device_removal": 0,
      "org.openzfs:draid": 0,
      "com.delphix:empty_bpobj": 0,
      "com.delphix:obsolete_counts": 0,
      "org.zfsonlinux:allocation_classes": 0,
      "org.zfsonlinux:project_quota": 1,
      "org.openzfs:device_rebuild": 0,
      "com.delphix:livelist": 0,
      "com.fudosecurity:block_cloning": 0,
      "com.datto:resilver_defer": 0,
      "com.delphix:enabled_txg": 36,
      "com.delphix:spacemap_v2": 1,
      "com.delphix:zpool_checkpoint": 0,
      "org.zfsonlinux:userobj_accounting": 1,
      "org.openzfs:zilsaxattr": 0,
      "com.delphix:bookmarks": 0,
      "com.delphix:async_destroy": 0,
      "com.delphix:log_spacemap": 1,
      "com.delphix:spacemap_histogram": 18,
      "com.joyent:filesystem_limits": 0
    },
    "can_rdonly": true,
    "rewind_txg_ts": 1711139069,
    "seconds_of_rewind": -1711139069,
    "verify_meta_errors": 0,
    "verify_data_errors": 0
  },
  "spa_props": {
    "name": {
      "source": "ZPROP_SRC_NONE",
      "value": "tank"
    },
    "size": {
      "source": "ZPROP_SRC_NONE",
      "value": 503316480
    },
    "allocated": {
      "source": "ZPROP_SRC_NONE",
      "value": 167424
    },
    "free": {
      "source": "ZPROP_SRC_NONE",
      "value": 503149056
    },
    "checkpoint": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "fragmentation": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "expandsize": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "readonly": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "capacity": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "dedupratio": {
      "source": "ZPROP_SRC_NONE",
      "value": 100
    },
    "bcloneused": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "bclonesaved": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "bcloneratio": {
      "source": "ZPROP_SRC_NONE",
      "value": 100
    },
    "health": {
      "source": "ZPROP_SRC_NONE",
      "value": 6
    },
    "version": {
      "source": "ZPROP_SRC_DEFAULT",
      "value": 5000
    },
    "load_guid": {
      "source": "ZPROP_SRC_NONE",
      "value": 10978401526797912000
    },
    "freeing": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "leaked": {
      "source": "ZPROP_SRC_NONE",
      "value": 0
    },
    "guid": {
      "source": "ZPROP_SRC_NONE",
      "value": 6645716700149381000
    },
    "maxblocksize": {
      "source": "ZPROP_SRC_NONE",
      "value": 16777216
    },
    "maxdnodesize": {
      "source": "ZPROP_SRC_NONE",
      "value": 16384
    }
  },
  "initial_load_time": [
    1711398055,
    453276435
  ],
  "error_count": 0,
  "suspended": false,
  "feature_stats": {
    "com.delphix:async_destroy": 0,
    "com.delphix:empty_bpobj": 0,
    "org.illumos:lz4_compress": 1,
    "com.joyent:multi_vdev_crash_dump": 0,
    "com.delphix:spacemap_histogram": 21,
    "com.delphix:enabled_txg": 36,
    "com.delphix:hole_birth": 1,
    "com.delphix:extensible_dataset": 1,
    "com.delphix:embedded_data": 1,
    "com.delphix:bookmarks": 0,
    "com.joyent:filesystem_limits": 0,
    "org.open-zfs:large_blocks": 0,
    "org.zfsonlinux:large_dnode": 0,
    "org.illumos:sha512": 0,
    "org.illumos:skein": 0,
    "org.illumos:edonr": 0,
    "org.zfsonlinux:userobj_accounting": 1,
    "com.datto:encryption": 0,
    "org.zfsonlinux:project_quota": 1,
    "com.delphix:device_removal": 0,
    "com.delphix:obsolete_counts": 0,
    "com.delphix:zpool_checkpoint": 0,
    "com.delphix:spacemap_v2": 1,
    "org.zfsonlinux:allocation_classes": 0,
    "com.datto:resilver_defer": 0,
    "com.datto:bookmark_v2": 0,
    "com.delphix:redaction_bookmarks": 0,
    "com.delphix:redacted_datasets": 0,
    "com.delphix:bookmark_written": 0,
    "com.delphix:log_spacemap": 1,
    "com.delphix:livelist": 0,
    "org.openzfs:device_rebuild": 0,
    "org.freebsd:zstd_compress": 0,
    "org.openzfs:draid": 0,
    "org.openzfs:zilsaxattr": 0,
    "com.delphix:head_errlog": 1,
    "org.openzfs:blake3": 0,
    "com.fudosecurity:block_cloning": 0,
    "com.klarasystems:vdev_zaps_v2": 1,
    "com.delphix:redaction_list_spill": 0,
    "org.openzfs:raidz_expansion": 0
  },
  "vdev_tree": {
    "type": "root",
    "id": 0,
    "guid": 6645716700149381000,
    "vdev_children": 1,
    "children": [
      {
        "type": "mirror",
        "id": 0,
        "guid": 10500366840367100000,
        "asize": 519569408,
        "ashift": 9,
        "offline": false,
        "faulted": false,
        "degraded": false,
        "removed": false,
        "not_present": false,
        "is_log": false,
        "state": "DEGRADED",
        "vs_scan_removing": false,
        "vs_noalloc": false,
        "vs_resilver_deferred": false,
        "resilver_repair": "none",
        "initialize_state": {
          "vs_initialize_state": "VDEV_INITIALIZE_NONE",
          "vs_initialize_bytes_done": 0,
          "vs_initialize_bytes_est": 0,
          "vs_initialize_action_time": 0
        },
        "trim_state": {
          "vs_trim_state": "VDEV_UNTRIMMED",
          "vs_trim_action_time": 0,
          "vs_trim_bytes_done": 0,
          "vs_trim_bytes_est": 0
        },
        "read_errors": 0,
        "write_errors": 0,
        "checksum_errors": 0,
        "slow_ios": 0,
        "trim_errors": 0,
        "vdev_children": 2,
        "children": [
          {
            "type": "file",
            "id": 0,
            "guid": 135275325128040140,
            "asize": 519569408,
            "ashift": 9,
            "whole_disk": true,
            "offline": false,
            "faulted": false,
            "degraded": false,
            "removed": false,
            "not_present": false,
            "is_log": false,
            "path": "/var/tmp/zfs-vdev",
            "state": "HEALTHY",
            "vs_scan_removing": false,
            "vs_noalloc": false,
            "vs_resilver_deferred": false,
            "resilver_repair": "none",
            "initialize_state": {
              "vs_initialize_state": "VDEV_INITIALIZE_NONE",
              "vs_initialize_bytes_done": 0,
              "vs_initialize_bytes_est": 0,
              "vs_initialize_action_time": 0
            },
            "trim_state": {
              "vs_trim_state": "VDEV_UNTRIMMED",
              "vs_trim_action_time": 0,
              "vs_trim_bytes_done": 0,
              "vs_trim_bytes_est": 0
            },
            "read_errors": 0,
            "write_errors": 0,
            "checksum_errors": 0,
            "slow_ios": 0,
            "trim_errors": 0
          },
          {
            "type": "file",
            "id": 1,
            "guid": 15451542793245470000,
            "asize": 0,
            "ashift": 0,
            "whole_disk": true,
            "offline": false,
            "faulted": false,
            "degraded": false,
            "removed": false,
            "not_present": true,
            "is_log": false,
            "path": "/var/tmp/zfs-vdev2",
            "state": "UNAVAIL",
            "vs_scan_removing": false,
            "vs_noalloc": false,
            "vs_resilver_deferred": false,
            "resilver_repair": "none",
            "initialize_state": {
              "vs_initialize_state": "VDEV_INITIALIZE_NONE",
              "vs_initialize_bytes_done": 0,
              "vs_initialize_bytes_est": 0,
              "vs_initialize_action_time": 0
            },
            "trim_state": {
              "vs_trim_state": "VDEV_UNTRIMMED",
              "vs_trim_action_time": 0,
              "vs_trim_bytes_done": 0,
              "vs_trim_bytes_est": 0
            },
            "read_errors": 0,
            "write_errors": 0,
            "checksum_errors": 0,
            "slow_ios": 0,
            "trim_errors": 0
          }
        ]
      }
    ]
  }
}

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

This kstat output does not require taking the spa_namespace
lock, as in the case for 'zpool status'. It can be used for
investigations when pools are in a hung state while holding
global locks required for a traditional 'zpool status' to
proceed.

This kstat is not safe to use in conditions where pools are
in the process of configuration changes (i.e., adding/removing
devices).  Therefore, this kstat is not intended to be a general
replacement or alternative to using 'zpool status'.

Sponsored-by: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.

Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 26, 2024
@tonyhutter
Copy link
Contributor

Overall, I think JSON output is a good thing. It's something that's been on the ZFS wishlist since the dawn of time, and there have been some aborted attempts over the years to implement it.

Just some initial thoughts before I look at the code:

  1. Even though this is a diagnostic kstat, everyone is going to want it for normal use since it's the only way to get JSON output. It will become a set in stone API. And if we're going to do a JSON API, we should just take the plunge and add it to the zpool commands proper. Having the JSON go though zpool status also allows us to later tack on the zpool status -c|-s fields to the JSON output, which would be nice.

  2. If the desire is for a lockless option to get zpool status info (either in JSON or regular zpool status), then I like your 1b option to have a zpool status --ignore-locking. So you could do zpool status --ignore-locking or zpool status --ignore-locking --json depending on what you want. JSON in zpool status also allows you to get output from multiple pools in one command.

  3. We may want to begin by just exposing the limited, historical, zpool status|get fields, and go from there. That would give us 90% of what people want, while initially limiting the scope to well-known, documented, fields. It also gets us out of the business of API-ifying esoteric internal variables like:

  "seconds_of_rewind": -1711139069,
...
  "initial_load_time": [
    1711398055,
    453276435
  ],
...
   "id": 0,

@geoffamey
Copy link

One major advantage that I see to this being a lockless kstat file is the ability for it to be parsed and used by metrics exporters like node_exporter, which currently uses the /proc/spl/kstat/zfs/<pool>/state file (among others).

Frequently, these tools are deployed in containers, where /proc/ is visible, and they are required to operate without calling third-party binaries or even having the zfs userspace utilities installed.

@tonyhutter
Copy link
Contributor

Having the JSON go though zpool status also allows us to later tack on the zpool status -c|-s fields to the JSON output, which would be nice.
...
JSON in zpool status also allows you to get output from multiple pools in one command.

One major advantage that I see to this being a lockless kstat file is the ability for it to be parsed and used by metrics exporters like node_exporter, which currently uses the /proc/spl/kstat/zfs//state file

Just thinking out loud - we could potentially combine the two, and have zpool status --json grab the output from each of the pools in /proc/spl/kstat/zfs/<pool>/status.json, while tacking on any JSON from zpool status -c|-s output.

@usaleem-ix
Copy link
Contributor

I see we are adding a new print interface for JSON (jprint.h and jprint.c) and there is also a nvlist_to_json in spa_json_stats.c. I am just curious if all the information can be collected in nvlists and printed in JSON format using existing nvlist_print_json from libnvpair instead of adding new JSON print infrastructure.

Perhaps if there are good reasons to add a new interface for JSON, it would be better to make it available in userspace too so it can be utilized there as well.

The reason I am bringing this up here is that I am also working on something similar to this, adding JSON output for things like zfs get, zfs list, zpool get, zpool list and so on. I am currently using existing libnvpair infrastructure to collect information in nvlists and later print it in JSON format using nvlist_print_json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants