Special failsafe feature #16185

tonyhutter · 2024-05-10T22:52:16Z

Motivation and Context

Allow your special allocation class vdevs ('special' and 'dedup') to fail without data loss.

Description

Special failsafe is a new feature that allows your special allocation class vdevs ('special' and 'dedup') to fail without losing any data. It works by automatically backing up all special data to the main pool. This has the added benefit that you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device).

This behavior is controlled via two properties:

feature@special_failsafe - This feature flag enables the special failsafe subsystem. It prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support special failsafe.
special_failsafe - This pool property is the main on/off switch to control special failsafe. If you want to use special failsafe simply turn it on either at creation time or with zpool set prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). It can also inflate disk usage on dRAID pools.

Closes: #15118

Note: This is a simpler, more elegant version of my older PR: #16073

How Has This Been Tested?

Test cases added

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Haravikk · 2024-05-25T19:28:00Z

Thanks so much for working on this!

Just wanted to clarify; if a special device is removed or fails (with no redundancy), does the feature flag remain enabled? It sounds like it shouldn't need to remain enabled (and therefore needs to be re-enabled before adding a replacement, non-redundant special device) but I could be misunderstanding what's going on behind the scenes?

I'm guessing that in this case (special device fails/is removed and then replaced) the new special device will be fully empty and no existing data will be copied onto it by any means (either preemptively or reactively)? Absolutely fine if so, especially if it makes it easier to get the feature implemented at all, just wanted to check as I didn't see any notes about this case – though it might be worth mentioning in documentation either way?

tonyhutter · 2024-05-28T16:19:04Z

@Haravikk

if a special device is removed or fails (with no redundancy), does the feature flag remain enabled?

Yes, the SPA_FEATURE_SPECIAL_FAILSAFE feature flag will remain "active" if a special device fails/is_removed, which I think is what you're reffering to. This matches the behavior of SPA_FEATURE_ALLOCATION_CLASSES.

vaclavskala · 2024-06-04T04:21:55Z

What is difference between this feature and adding ssd L2ARC with secondarycache=metadata?
In both cases write will be limited by the speed of the main pool. And reads will be handled by ssd. L2ARC will need some time to be filled, but with persistent L2ARC it is not problem anymore.
And when L2ARC/special device is smaller then metadata size, L2ARC can be even faster because it will hold only hot metadata.

tonyhutter · 2024-06-04T18:22:02Z

@vaclavskala There's a lot of overlap, and in many use cases you could use either L2ARC+secondarycache=metadata, or special+special_failsafe interchangeably. There are some differences:

This PR gives you more flexibility. Consider if you could only buy two NVMe drives to speed up your redundant pool. Prior to this PR, you could not use one NVMe for L2ARC and one NVMe for special, since special wouldn't be redundant enough. With this PR you have that option. You could then have a pool with hot large blocks on L2ARC while still guaranteeing all read metadata operation will be fast with special.
L2ARC doesn't let you separate dedup data from metadata, whereas special alloc class devices do. If you have a heavily dedup'd pool, it may make more sense to dedicate all your NVMe to dedup+special_failsafe rather than L2ARC.
You can set special_small_blocks on datasets for special, but you can only set the less granular secondarycache=[all|none|metadata] to get it on L2ARC.
You can set l2arc_exclude_special to have L2ARC exclude special data. This could be useful if you're using both L2ARC with special + special_small_blocks.
special_failsafe is per-pool, but Persitent L2ARC is a module parameter (zfs_rebuild_enabled).
You may be super paranoid about your special/dedup data and simply want another copy on the pool. That way you have alloc class device data on two different mediums: NVMe (special/dedup) and HDDs (main pool). So if your NVMe PCIe switch goes down during a firmware update, you can still import the pool from the HDDs without downtime.
One downside of L2ARC is that its headers take up ARC memory. From man/man4/zfs.4:

l2arc_meta_percent=33% (uint)

Percent of ARC size allowed for L2ARC-only headers. Since L2ARC buffers are not evicted on
memory pressure, too many headers on a system with an irrationally large L2ARC can render it
slow or unusable. This parameter limits L2ARC writes and rebuilds to achieve the target.

The "irrationally large" comment here makes me think we can't just scale the L2ARC to be arbitrarily large (unlike special).

Haravikk · 2024-06-04T21:13:08Z

I would maybe also add to that list:

The contents of the special device are a lot more predictable – if properly sized and configured, and added at creation time, a special device is guaranteed to contain all special blocks, so these will always be accessed from the faster device. When you compare this to ARC/L2ARC, we don't actually have a lot of control over what stays in ARC/L2ARC between metadata only, all or nothing, and it tends not to retain infrequently used records for very long, so you'll almost certainly have to go to other devices to retrieve those. There are two cases I like to use that illustrate the benefits of this:
- Loading the contents of infrequently accessed directories – this can also be thought of as find performance, as a find search may require you to stat every entry in a directory (or directory tree), plus extended attributes in some cases. Unless the bulk of these are in ARC/L2ARC this process can be extremely slow, as it's pretty much a worst case for spinning disks (lots of often randomly distributed, tiny records). If your workload includes anything like this then you want that offloaded to an SSD.
- ZVOLs can be tuned nicely using a special device; since a ZVOL stores "blocks" of a predictable size (effectively a minimum record size for most blocks), you can exclude them while storing everything else (ZFS' metadata) on the special device. While this will be pretty much the same as an L2ARC set to secondarycache=metadata, again you can guarantee that it's all there on the special device, and never gets evicted. This means that operations for your ZVOL(s) should predictably send all "block" activity to your main pool, and all other activity to the special device – though obviously not total separation in the context of special failsafe (since the metadata is also written through to the rest of the pool).

Of course if the special device is improperly sized, L2ARC may be better/more adaptable, but with the proposed special failsafe you should actually have the option of trying with the special device at first, and if you determine that it's too small, you can remove it and re-add it as an L2ARC instead.

Special failsafe is a feature that allows your special allocation class vdevs ('special' and 'dedup') to fail without losing any data. It works by automatically backing up all special data to the pool. This has the added benefit that you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device). This behavior is controlled via two properties: 1. feature@special_failsafe - This feature flag enables the special failsafe subsystem. It prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support special failsafe. 2. special_failsafe - This pool property is the main on/off switch to control special failsafe. If you want to use special failsafe simply turn it on either at creation time or with `zpool set` prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again. Note that special failsafe may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). It can also inflate disk usage on dRAID pools. Closes: openzfs#15118 Signed-off-by: Tony Hutter <hutter2@llnl.gov>

tonyhutter mentioned this pull request May 10, 2024

Backup allocation class vdev data to the pool #16073

Closed

13 tasks

tonyhutter force-pushed the special_failsafe branch 5 times, most recently from f73ddb3 to 0b0ecbb Compare May 14, 2024 22:47

tonyhutter force-pushed the special_failsafe branch 2 times, most recently from 912657d to 0819183 Compare May 21, 2024 22:06

tonyhutter force-pushed the special_failsafe branch 4 times, most recently from 49d50d4 to feec657 Compare June 4, 2024 00:47

tonyhutter force-pushed the special_failsafe branch from feec657 to 67edb03 Compare June 4, 2024 18:59

tonyhutter force-pushed the special_failsafe branch from 67edb03 to 955e3f2 Compare June 6, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special failsafe feature #16185

Special failsafe feature #16185

tonyhutter commented May 10, 2024 •

edited

Haravikk commented May 25, 2024

tonyhutter commented May 28, 2024

vaclavskala commented Jun 4, 2024

tonyhutter commented Jun 4, 2024

Haravikk commented Jun 4, 2024 •

edited

Special failsafe feature #16185

Are you sure you want to change the base?

Special failsafe feature #16185

Conversation

tonyhutter commented May 10, 2024 • edited

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Haravikk commented May 25, 2024

tonyhutter commented May 28, 2024

vaclavskala commented Jun 4, 2024

tonyhutter commented Jun 4, 2024

Haravikk commented Jun 4, 2024 • edited

tonyhutter commented May 10, 2024 •

edited

Haravikk commented Jun 4, 2024 •

edited