Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block: add support for discard #329

Open
g2p opened this issue Mar 7, 2019 · 20 comments
Open

block: add support for discard #329

g2p opened this issue Mar 7, 2019 · 20 comments

Comments

@g2p
Copy link

g2p commented Mar 7, 2019

The idea is to support operations compatible with mirage-block-unix's discard:

discard sector n signals that the n sectors starting at sector are no longer needed and the contents may be discarded. Note the contents may not actually be deleted: this is not a "secure erase".

Which I would update to say this (matching the Unix implementation):

discard sector n signals that the n sectors starting at sector are no longer needed and the contents may be discarded. Future reads will return zeroes. Note that the contents may not actually be impossible to recover: this is not a "secure erase".

@mato
Copy link
Member

mato commented Mar 7, 2019

Yes. You probably want to look at #325 first though.

@g2p
Copy link
Author

g2p commented Mar 7, 2019

Since this interface should be sector aligned, should we break consistency and take offsets and length as sectors? Would it simplify the checking the spt backend would need to do in case the implementation is specialized there (re #325 (comment))?

@mato
Copy link
Member

mato commented Mar 7, 2019

That depends on what the backend implementation needs to do. As I've never needed to invoke a discard-type op from Linux, I don't know.

The Solo5 interface I would prefer is:

solo5_result_t solo5_block_discard(solo5_off_t offset, size_t size)

with the usual restrictions (offset must be block-size-aligned, size must be a multiple of block-size).

If you find out what the backend(s) actually need(s) to do to implement a discard (i.e. at least the hvt tender and spt bindings) I'll be able to tell you more.

@g2p
Copy link
Author

g2p commented Mar 7, 2019

With spt bindings on Linux, this will use fallocate with FALLOC_FL_PUNCH_HOLE and fallback to zeroing manually if the filesystem doesn't support it.

I'm not sure about tenders/hvt because that should just validate and pass params?

I've also looked at virtio (spec amendment); it would use the VIRTIO_BLK_T_WRITE_ZEROES command with the unmap bit set (with some fallbacks which are easy to add). The offsets and lengths are expressed in 512-byte units, but the alignment requirement may be greater.

@g2p
Copy link
Author

g2p commented Mar 7, 2019

(just to document the virtio fallbacks while the spec is fresh on my mind: if VIRTIO_BLK_F_WRITE_ZEROES is not supported, do the whole thing manually. Otherwise, call VIRTIO_BLK_T_WRITE_ZEROES with unmap set to whether VIRTIO_BLK_F_DISCARD is supported)

@mato
Copy link
Member

mato commented Mar 8, 2019

With spt bindings on Linux, this will use fallocate with FALLOC_FL_PUNCH_HOLE and fallback to zeroing manually if the filesystem doesn't support it.

That man page does not actually mention anything about the call issuing a discard (or TRIM) to the hardware. Do you have a source for this?

Anyway, so the Linux backends need to issue a fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, offset, len) call. For spt, the seccomp filter needs to ensure that SYS_fallocate can only be called with that particular value of mode. If I understand the behaviour as documented correctly, the filter should not need to check offset or len.

For hvt, you need to create a hypercall and back it with that implementation.

Before you actually start on this, can you check with the *BSD folks you have around at the retreat what the equivalent calls are there?

@mato
Copy link
Member

mato commented Mar 8, 2019

Oh. One more thing, just re-reading the manpage. It only talks about filesystem-backed fds. How do we do the equivalent if the fd is pointing directly to a block device (this is a valid and important use-case!).

@g2p
Copy link
Author

g2p commented Mar 8, 2019

Linux also implements it for block devices:
See blkdev_fallocate in https://github.com/torvalds/linux/blob/master/fs/block_dev.c

At any rate, the logic for this will be the same regardless of what the backing store is; if ENOTSUP is returned, write zeroes manually.

@g2p
Copy link
Author

g2p commented Mar 8, 2019

I don't think this exists on BSD per this recent thread: https://freebsd-arch.freebsd.narkive.com/JKZnzc4p/hole-punching-trim-etc
But since we need fallback logic regardless, that is not too problematic.

@g2p
Copy link
Author

g2p commented Mar 8, 2019

That man page does not actually mention anything about the call issuing a discard (or TRIM) to the hardware. Do you have a source for this?

That's implementation and configuration (mount options, actual device capabilities) dependent.
It's also an optimization and an implementation detail.

@mato
Copy link
Member

mato commented Mar 8, 2019

At any rate, the logic for this will be the same regardless of what the backing store is; if ENOTSUP is returned, write zeroes manually.

Right, just make sure that you do that in a single pwrite() call, that way you won't have to modify the pwrite part of the seccomp filter for spt.

@mato
Copy link
Member

mato commented Mar 9, 2019

@g2p Moving your questions from Slack here, as I'd like the design discussion to be in one place:

What would I use to allocate a buffer of zeroes from the bindings side? [...] mem_ialloc_pages + memset?

Correct.

The buffer size would be bounded, but to a fairly large value. I'm planning on using 1M for transfer efficiency, or the block size, whichever is greater.

I don't think keeping a 1M buffer around is an acceptable cost to pay just for the sake of being able to emulate discard.

Stepping back a bit, do we need to emulate it at all? Why? In my mind, discard is just an "optimization" for SSDs and/or thinly provisioned storage. If the host doesn't support it, can't it just be a no-op?

@g2p
Copy link
Author

g2p commented Mar 9, 2019

I need it to actually ensure later reads return zeroes.
If there's an error code that can unambiguously mean ENOTSUP, I'd be okay moving any fallback emulation to a layer higher than Solo5. Or an out parameter. Unfortunately it can't be done with a block_info flag as some of the implementations are based on ENOTSUP return codes, which means you don't know ahead of time.

@g2p
Copy link
Author

g2p commented Mar 9, 2019

With further thought, I'd be good with flags that ask for zeroing or not, in addition to the out parameter and higher-level fallback logic mentioned above.
I'm not sure if I would expose the additional flags to the mirage interface.
They might be good for future-proofing, I guess, but will trigger breakage on the mirage-block-unix interface.

@mato
Copy link
Member

mato commented Mar 10, 2019

I need it to actually ensure later reads return zeroes.

Ok. It seems to me then that the simplest possible interface at the Solo5 layer is that mentioned in my previous comment (#329 (comment)), with the addition of a SOLO5_R_EOPNOTSUPP to solo5_result_t, which solo5_block_discard() should return if the host/implementation does not support the operation.

In other words, no emulation, and the application flow becomes:

r = solo5_block_discard(offset, size);
if (r == SOLO5_R_EOPNOTSUPP) {
    /* It is now the application's responsibility to fall back to writing zeroes to (offset, size), if it desires that behaviour */
}

I'd be good with flags that ask for zeroing or not,

I'm not sure what you mean by that, but if you mean adding a "is discard supported" flag to the struct returned by solo5_block_info(), I don't think there is anything the application gains in the above case?

@g2p
Copy link
Author

g2p commented Mar 10, 2019

I like the EOPNOTSUPP approach very much, will do that.

Re an explicit "ask for zeroing" flag (as opposed to discarding and having uninitialized data appear), it's not something I plan to use, and after looking at the complexity of implementing both code paths in virtio, I don't think it's worth it. We'll either offer safe and fast zeroing, or let the application deal.

@mato
Copy link
Member

mato commented Mar 10, 2019

Sounds good to me.

@g2p
Copy link
Author

g2p commented Mar 10, 2019

I've started testing my virtio implementation, but it looks like Qemu doesn't support discard for virtio-blk, only for scsi. I don't know of another supported VMM on Linux so I'll drop that code for now. Still making progress on the spt/hvt backends.

@g2p
Copy link
Author

g2p commented Mar 10, 2019

@g2p
Copy link
Author

g2p commented Mar 10, 2019

I've built and run crosvm, but I only get messages from early boot, I don't think it can run Solo5's kernel.
So the virtio bits will have to be dropped for now.

g2p added a commit to g2p/solo5 that referenced this issue Mar 11, 2019
@g2p g2p mentioned this issue Mar 11, 2019
g2p added a commit to g2p/solo5 that referenced this issue Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants