Write-through special device #15118

Haravikk · 2023-07-28T23:00:45Z

Describe the feature would like to see added to OpenZFS

The idea is to allow for special devices to be configured for a pool in a "write-through" mode such that any record assigned to a special vdev is also written to a non-special vdev as well. As a result the special device will not require the same guarantee for redundancy as the rest of the pool (it could be a single disk and still be perfectly safe, in the same way as a cache device).

Initially this would simply be an option when adding the device; in this way the "write-through" special device is no different to a normal one, in that you will see no benefit until special records start being written.

At a later time (possibly as a separate feature) it would be good to also see the ability to change a special device's mode, i.e- changing an existing special device to "write-through" mode would cause all records currently stored within it to be copied to non-special vdevs, and once complete it can be lost without any data loss. While removing "write-through" mode would cause copying of special records to non-special vdevs stop (but any existing copies would remain in place, same as for a change of copies=N).

How will this feature improve OpenZFS?

This will allow for the use of a special vdev with less (or no) redundancy compared to the rest of the pool, without the risk of data loss, while still accelerating reading of special records (metadata, small files etc.).

Additional context

This feature is an alternative to #15051 (record size limit for ARC/L2ARC). While #15051 would be the easier to implement, this one is the more "correct" since it also accelerates smaller record reading but in a more predictable way since it can (with a suitable special device) guarantee that all such records are accelerated, rather than just whichever ones happen to remain in ARC/L2ARC long enough to be used.

There is also a side-issue #15226 for filling special vdevs which is important but not critical for the write-through case; while loss of a write-through special device should be safe (no data lost), once replaced there will be no special records on the new device meaning performance is lost instead, so filling is important to "reload/resilver" the new device.

The text was updated successfully, but these errors were encountered:

rincebrain · 2023-07-29T00:16:37Z

Making a special mirror mode that would write new data out redundantly with a special case of copies=2-like behavior might be feasible, conceivably - rewriting it retroactively, I'd put my money on "not unless someone spends a fortune, and probably not then either".

amotin · 2023-08-03T00:19:39Z

Special vdev may be used to solve two problems -- speed and space efficiency. From speed perspective proposed copies=2 logic may have sense, especially since a lot of metadata already use copies=2, and I don't think ZFS ever reads second copy if the first is OK. It should not be a huge deal to write second copy to normal vdevs, but reconstruction of the lost special vdev is something not supported at this time. From space efficiency though it makes no sense, since if main vdev is unable store small objects, like DRAID, then it just can't and redundant special vdev is the only solution to do it efficiently.

Haravikk · 2023-08-31T10:48:57Z

Special vdev may be used to solve two problems -- speed and space efficiency. From speed perspective proposed copies=2 logic may have sense, especially since a lot of metadata already use copies=2, and I don't think ZFS ever reads second copy if the first is OK. It should not be a huge deal to write second copy to normal vdevs, but reconstruction of the lost special vdev is something not supported at this time. From space efficiency though it makes no sense, since if main vdev is unable store small objects, like DRAID, then it just can't and redundant special vdev is the only solution to do it efficiently.

Reconstruction isn't really a priority anyway, I'll tweak the original post to make that more clear. Currently when you add a special device you gain no benefit whatsoever until you start storing records that qualify for it, so there's no real need for the "write through" version to be any different initially, I was just thinking in terms of if you do have one and it does fail, you wouldn't lose data but even if you replace it you would lose performance, which isn't the case when a mirrored special device loses a drive (other than the cost of resilvering).

"Reloading/preloading" a special device could probably be handled as a separate feature if necessary, since both cases would benefit from the ability to load up a special device from an existing pool that doesn't have one.

GregorKopka · 2023-09-30T08:42:55Z

Basically this asks for #13460 (comment)

Maybe with the addition that all reads from eligible datasets resulting in cache misses are automatically written to this special vdev.

Haravikk · 2023-10-02T16:04:49Z

While there's similarity I'm not sure it's really the same thing; the advantage of special devices is that, sized correctly, they have a guarantee about the location of the special data and therefore predictable performance.

While "pinning" data in ARC is interesting, it has issues with potentially making the ARC less efficient (infrequently used pinned data would prevent the ARC being used for more frequently used data) and actually specifying the data to pin (and keep it updated in a useful way), is complex. I've submitted a similar but distinct proposal as #15051 to allow cache behaviour to be set using a record size; this is much simpler and would make it possible to prevent large records from filling up the ARC so that more smaller records can be retained, but while this will help it doesn't provide the same guarantee that a correctly sized and configured special device does.

In general a special device is almost always superior to L2ARC, but the downside is the need for redundancy. This means that L2ARC could have an edge if we assume adding a minimum of two SSDs (since the special device needs at least a mirror to guarantee single disk recovery, while L2ARC doesn't care about disk failure at all). But when the number of disks and redundancy is identical, the performance of a special device is much more predictable once the special device is filled (hence the need for #15226.

GregorKopka · 2023-10-04T11:13:59Z

I agree that special devices are superior to L2ARC in basically all aspects, except that they can't be removed and they can't accelerate access to data already existing on-disk.

The part you might have overlooked in the comment:

But I guess it would make more sense to implement this feature as a new 'cache' vdev type (will call it C2 for this mental exercise) that, while being removeable at any time (like L2), employs random access writes by using space maps (similar to how normal data vdevs work), with a persistent lookup table (hosted on the C2-vdev itself) that translates from pool DVA into C2-LBA. Add a reader process that loads that lookup table on pool import to create C2 headers in the ARC (like persistent ARC). Then a writer process (that can be triggered eg. from changing a dataset property or a zpool subcommand) that will scan a specified dataset (or all that are marked for 'cache') to copy all eligble data from permanent pool vdevs onto the C2, so that new devices can be filled with data already existing in the pool (not only new writes). Last would be extending the read path in the same places L2 is hooked, a little patch to the L2 feeder logic to ignore data that is already in C2 (to avoid double-caching) and a SPA hook (I guess) to invalidate&free deleted data.

Also:

While "pinning" data in ARC is interesting,...

I did not suggest anything like that.

Special failsafe is a feature that allows your special allocation class vdevs ('special' and 'dedup') to fail without losing any data. It works by automatically backing up all special data to the pool. This has the added benefit that you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device). This behavior is controlled via two properties: 1. feature@special_failsafe - This feature flag enables the special failsafe subsystem. It prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support special failsafe. 2. special_failsafe - This pool property is the main on/off switch to control special failsafe. If you want to use special failsafe simply turn it on either at creation time or with `zpool set` prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again. Note that special failsafe may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). It can also inflate disk usage on dRAID pools. Closes: openzfs#15118 Signed-off-by: Tony Hutter <hutter2@llnl.gov>

Haravikk added the Type: Feature Feature request or new feature label Jul 28, 2023

Haravikk mentioned this issue Aug 31, 2023

Filling special vdevs #15226

Open

tonyhutter mentioned this issue Sep 8, 2023

zfs special devices dedicated spares #12332

Open

Haravikk mentioned this issue Sep 30, 2023

Performance improvement for metadata objects reading #14425

Open

tonyhutter mentioned this issue Apr 9, 2024

Backup allocation class vdev data to the pool #16073

Closed

13 tasks

tonyhutter linked a pull request May 10, 2024 that will close this issue

Special failsafe feature #16185

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write-through special device #15118

Write-through special device #15118

Haravikk commented Jul 28, 2023 •

edited

rincebrain commented Jul 29, 2023

amotin commented Aug 3, 2023

Haravikk commented Aug 31, 2023 •

edited

GregorKopka commented Sep 30, 2023

Haravikk commented Oct 2, 2023 •

edited

GregorKopka commented Oct 4, 2023

Write-through special device #15118

Write-through special device #15118

Comments

Haravikk commented Jul 28, 2023 • edited

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

rincebrain commented Jul 29, 2023

amotin commented Aug 3, 2023

Haravikk commented Aug 31, 2023 • edited

GregorKopka commented Sep 30, 2023

Haravikk commented Oct 2, 2023 • edited

GregorKopka commented Oct 4, 2023

Haravikk commented Jul 28, 2023 •

edited

Haravikk commented Aug 31, 2023 •

edited

Haravikk commented Oct 2, 2023 •

edited