Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write-through special device #15118

Open
Haravikk opened this issue Jul 28, 2023 · 6 comments · May be fixed by #16185
Open

Write-through special device #15118

Haravikk opened this issue Jul 28, 2023 · 6 comments · May be fixed by #16185
Labels
Type: Feature Feature request or new feature

Comments

@Haravikk
Copy link

Haravikk commented Jul 28, 2023

Describe the feature would like to see added to OpenZFS

The idea is to allow for special devices to be configured for a pool in a "write-through" mode such that any record assigned to a special vdev is also written to a non-special vdev as well. As a result the special device will not require the same guarantee for redundancy as the rest of the pool (it could be a single disk and still be perfectly safe, in the same way as a cache device).

Initially this would simply be an option when adding the device; in this way the "write-through" special device is no different to a normal one, in that you will see no benefit until special records start being written.

At a later time (possibly as a separate feature) it would be good to also see the ability to change a special device's mode, i.e- changing an existing special device to "write-through" mode would cause all records currently stored within it to be copied to non-special vdevs, and once complete it can be lost without any data loss. While removing "write-through" mode would cause copying of special records to non-special vdevs stop (but any existing copies would remain in place, same as for a change of copies=N).

How will this feature improve OpenZFS?

This will allow for the use of a special vdev with less (or no) redundancy compared to the rest of the pool, without the risk of data loss, while still accelerating reading of special records (metadata, small files etc.).

Additional context

This feature is an alternative to #15051 (record size limit for ARC/L2ARC). While #15051 would be the easier to implement, this one is the more "correct" since it also accelerates smaller record reading but in a more predictable way since it can (with a suitable special device) guarantee that all such records are accelerated, rather than just whichever ones happen to remain in ARC/L2ARC long enough to be used.

There is also a side-issue #15226 for filling special vdevs which is important but not critical for the write-through case; while loss of a write-through special device should be safe (no data lost), once replaced there will be no special records on the new device meaning performance is lost instead, so filling is important to "reload/resilver" the new device.

@Haravikk Haravikk added the Type: Feature Feature request or new feature label Jul 28, 2023
@rincebrain
Copy link
Contributor

Making a special mirror mode that would write new data out redundantly with a special case of copies=2-like behavior might be feasible, conceivably - rewriting it retroactively, I'd put my money on "not unless someone spends a fortune, and probably not then either".

@amotin
Copy link
Member

amotin commented Aug 3, 2023

Special vdev may be used to solve two problems -- speed and space efficiency. From speed perspective proposed copies=2 logic may have sense, especially since a lot of metadata already use copies=2, and I don't think ZFS ever reads second copy if the first is OK. It should not be a huge deal to write second copy to normal vdevs, but reconstruction of the lost special vdev is something not supported at this time. From space efficiency though it makes no sense, since if main vdev is unable store small objects, like DRAID, then it just can't and redundant special vdev is the only solution to do it efficiently.

@Haravikk
Copy link
Author

Haravikk commented Aug 31, 2023

Special vdev may be used to solve two problems -- speed and space efficiency. From speed perspective proposed copies=2 logic may have sense, especially since a lot of metadata already use copies=2, and I don't think ZFS ever reads second copy if the first is OK. It should not be a huge deal to write second copy to normal vdevs, but reconstruction of the lost special vdev is something not supported at this time. From space efficiency though it makes no sense, since if main vdev is unable store small objects, like DRAID, then it just can't and redundant special vdev is the only solution to do it efficiently.

Reconstruction isn't really a priority anyway, I'll tweak the original post to make that more clear. Currently when you add a special device you gain no benefit whatsoever until you start storing records that qualify for it, so there's no real need for the "write through" version to be any different initially, I was just thinking in terms of if you do have one and it does fail, you wouldn't lose data but even if you replace it you would lose performance, which isn't the case when a mirrored special device loses a drive (other than the cost of resilvering).

"Reloading/preloading" a special device could probably be handled as a separate feature if necessary, since both cases would benefit from the ability to load up a special device from an existing pool that doesn't have one.

@GregorKopka
Copy link
Contributor

Basically this asks for #13460 (comment)

Maybe with the addition that all reads from eligible datasets resulting in cache misses are automatically written to this special vdev.

@Haravikk
Copy link
Author

Haravikk commented Oct 2, 2023

While there's similarity I'm not sure it's really the same thing; the advantage of special devices is that, sized correctly, they have a guarantee about the location of the special data and therefore predictable performance.

While "pinning" data in ARC is interesting, it has issues with potentially making the ARC less efficient (infrequently used pinned data would prevent the ARC being used for more frequently used data) and actually specifying the data to pin (and keep it updated in a useful way), is complex. I've submitted a similar but distinct proposal as #15051 to allow cache behaviour to be set using a record size; this is much simpler and would make it possible to prevent large records from filling up the ARC so that more smaller records can be retained, but while this will help it doesn't provide the same guarantee that a correctly sized and configured special device does.

In general a special device is almost always superior to L2ARC, but the downside is the need for redundancy. This means that L2ARC could have an edge if we assume adding a minimum of two SSDs (since the special device needs at least a mirror to guarantee single disk recovery, while L2ARC doesn't care about disk failure at all). But when the number of disks and redundancy is identical, the performance of a special device is much more predictable once the special device is filled (hence the need for #15226.

@GregorKopka
Copy link
Contributor

I agree that special devices are superior to L2ARC in basically all aspects, except that they can't be removed and they can't accelerate access to data already existing on-disk.

The part you might have overlooked in the comment:

But I guess it would make more sense to implement this feature as a new 'cache' vdev type (will call it C2 for this mental exercise) that, while being removeable at any time (like L2), employs random access writes by using space maps (similar to how normal data vdevs work), with a persistent lookup table (hosted on the C2-vdev itself) that translates from pool DVA into C2-LBA. Add a reader process that loads that lookup table on pool import to create C2 headers in the ARC (like persistent ARC). Then a writer process (that can be triggered eg. from changing a dataset property or a zpool subcommand) that will scan a specified dataset (or all that are marked for 'cache') to copy all eligble data from permanent pool vdevs onto the C2, so that new devices can be filled with data already existing in the pool (not only new writes). Last would be extending the read path in the same places L2 is hooked, a little patch to the L2 feeder logic to ignore data that is already in C2 (to avoid double-caching) and a SPA hook (I guess) to invalidate&free deleted data.

Also:

While "pinning" data in ARC is interesting,...

I did not suggest anything like that.

@tonyhutter tonyhutter linked a pull request May 10, 2024 that will close this issue
13 tasks
tonyhutter added a commit to tonyhutter/zfs that referenced this issue May 14, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue May 21, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue May 21, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue May 28, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue May 29, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue May 30, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 4, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 4, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jun 6, 2024
Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
4 participants