Enable block operations for more than 1 block #325

ricarkol · 2019-02-19T19:22:32Z

This discussion started in PR #315

The issue is that the solo5 interface is limited to a block at a time and the block is too small. And the problem with that is that it's inefficient as it usually involves too many exits for hvt, or too many syscalls for spt. The following experiment quantifies that. This is the time in seconds to read a 1GB file sequentially for an increasing block size on both hvt and spt (*):

block		hvt		spt
512		5.690s		0.463s
1024		2.881s		0.473s
4096		0.776s		0.171s
8192		0.434s		0.126s
16384		0.259s		0.103s

spt is already pretty fast at 512, but it can be 4x faster by increasing the block size to 8192. This experiment's point is not to increase the block size necessarily, but to allow for multiblock requests instead. FS are already trying to perform IO in larger blocks (usually 4096 Bytes): it would be nice to not split them.

(*) details of the experiment. This is the unikernel: https://gist.github.com/ricarkol/b66c899edd96fd7f8fb2fbaeabad0694#file-solo5-blksize-test-c. This is how the blk size was changed: https://gist.github.com/ricarkol/b66c899edd96fd7f8fb2fbaeabad0694#file-blk-size-diff. Used an Ubuntu 18.04.1 running on an Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz. The disk was stored as a file on an SSD formatted with ext4 (caches were not dropped before each experiment).

The text was updated successfully, but these errors were encountered:

mato · 2019-02-20T08:55:58Z

The issue is that the solo5 interface is limited to a block at a time and the block is too small. And the problem with that is that it's inefficient as it usually involves too many exits for hvt, or too many syscalls for spt. The following experiment quantifies that. [...]

The following comments from #315 are also relevant to this discussion:

Regarding atomicity guarantees and the "ideal block size" at the solo5_block API level: I'd have to study the problem in depth, so I can't give any informed opinion right now.

So, concentrating on the core of the issue which could be immediately addressed, which is allowing block I/O in multiples of the block size per request: In order to do that, the following needs to be done:

hvt, spt: If the solo5 virtual block device is backed by a file on the host, we need to ensure that the guest cannot write beyond the end of the file. I.e. in any pwrite(fd, buf, count, offset) ensure that (offset + count) <= capacity holds, accounting for overflow(!).
- for hvt, this is straightforward as the tender can do the check in the hypercall implementation
- for spt, I don't think this kind of filter can be expressed with libseccomp, but I may be wrong (google around for code using libseccomp ...). If not, it would imply dropping libseccomp, writing and installing the BPF filter manually which is a fair chunk of work.
virtio: AFAIR our virtio-block code does not support writes of >1 sector, so it would need to be updated to do that, and tested.

Summary: We can enable I/O at >1 block, aligned to the block size at the block layer, but the above points need to be resolved. Regarding atomicity guarantees, I don't think enabling this would make things any "worse" (or less ill-defined) than they are now.

However, someone needs to do the above work. I don't have time for this in the foreseeable future, but am happy to review patches.

minad · 2019-03-26T07:47:16Z

@mato In #341 I just did what you proposed and dropped to BPF. The solution I have there seems to pass basic tests.

mato · 2019-06-24T14:53:59Z

I've thought about this a bit more, especially the implications for spt's seccomp filter in the light of multiple device support. As mentioned previously, the following

in any pwrite(fd, buf, count, offset) ensure that (offset + count) <= capacity holds, accounting for overflow

cannot be expressed using libseccomp.

However, we can express rules that verify:

(count <= X), where X is a multiple of the block size AND
(offset <= (capacity - block_size)).

So, if we were to define the "maximum I/O unit" as, say, X = 16kB, then the most a malicious unikernel could write beyond/extend the end of the file by would be (X - block_size) bytes. While "annoying", this is not exactly a fatal security or DoS issue.

@ricarkol What do you think? What I'm essentially saying is that we punt on trying for a 100% correct solution for now. (See also my comments about hand-written BPF at #352 (comment) )

cfcs · 2019-06-24T18:09:16Z

@mato I think that sounds like a reasonable solution. I'm not too fond of the idea of having to deal with BPF macro code here either.

Some other semi-related points:

Upstreaming a patch to libseccomp (src/gen_bpf.c looks ok-ish to modify) to let it take other syscall arguments as "data"/operands would be useful for a lot of people, and would solve this problem too.
For Linux (where this is relevant for SPT) there's setrlimit(RLIMIT_FSIZE, ..), which doesn't deal with individual files, but can be used to set an upper bound for all accessed files.

ricarkol · 2019-06-24T18:45:27Z

@mato, sounds good. And x=16kb seems like a good "maximum I/O unit". Additionally, we could also waste the last 16kbs, and have the second rule as:

(offset <= (capacity - X)).

@cfcs, changing libseccomp would be even better.

cfcs · 2019-06-25T00:59:59Z

@ricarkol making the accessible amount smaller works well for some cases (preventing the allocation of extra extents; preventing attempts to write past block device limits which will result in an error anyway), but it works very poorly in others (particularly when the unikernel expects an 8K (<=16k) device and finds itself unable to write anything at all).

I think the latter is less intuitive than the former and likely to result in more bug reports, even if this is clearly a trade-off, and the proper solution would be to generate the correct BPF code that we all agree should be there.

mato · 2019-06-25T15:20:00Z

@ricarkol:

I agree with @cfcs that "wasting" the last X kB seems like asking for trouble, that's very counter-intuitive. Also, I expect the seccomp issue to be fixed properly at some future point, so would not want to then change the behaviour.

Regarding the actual value of X (maximum I/O unit): I think we should actually make it part of the API (by adding it to struct solo5_block_info and the block device properties). This means that we can allow implementations (bindings) to implement different limits, which in the short term will be useful at least for virtio (where we'd have to re-work the code a bit to allow writes of >1 block). In the longer-term, I can also foresee cases where we would want to have some implementation-defined limit in place.

Regarding the actual "maximum I/O unit" value for hvt and spt, any preferences? Based on your experiments, 4k or 8k seems reasonable.

mato · 2019-06-25T15:21:11Z

@g2p Any comments on this, especially regarding an optimum "maximum I/O unit" size for wodan?

g2p · 2019-06-26T17:58:30Z

Wodan would ideally use a much larger IO size of up to 4MiB (the size of a SSD's erase block).
I'd be fine with wasting any unaligned end to the device, because Wodan wouldn't use it.

mato · 2019-06-27T11:50:22Z

@g2p The thing is, as things stand today all our APIs involve copying. So, ignoring the issues with seccomp, a large (e.g. 4MiB) block size is not practical as it would need at least that much fixed buffer space in the Solo5 layer (i.e. per unikernel). If you want to pack large numbers of unikernels on a single server, those numbers would quickly add up. Now, 512 bytes is obviously too small, which is why I suggested a compromise of 4k or 8k.

dinosaure · 2022-11-04T15:06:03Z

#528 improve the situation where, at least, we have an argument which allows us to specify how large the chunk is. However, as @mato pointed out, a question remains about the hvt and the spt tenders where a check must be done about a possible out of bounds access of the block-size. Currently, even with #528, we don't do such check - at least, specially for spt.

I merged #528 because it unlocks some performance issues about our access to block devices. But I will not consider this issue as solved. Moreover, we definitely should go deeper now on this side because it breaks a bit the conservative design of Solo5.

hannesm · 2022-11-04T15:20:25Z

Hi, I think with #528 we're doing great. The seccomp rules setup allow to pread64/pwrite64 only on the specific file descriptor (that was never questioned), with the length (ARG2) being the block_basic.block_size, and the offset (ARG3) being <= capacity - block_size. Thus we cannot ever read beyond the end of the file (esp. with the additional check in block_attach that the file is actually a multiple of the block size).

ricarkol mentioned this issue Feb 19, 2019

Make block size configurable for SPT #315

Closed

mato added the enhancement label Feb 28, 2019

This was referenced Mar 4, 2019

mirage-kv 2.0.0 mirage/ocaml-tar#65

Merged

block: add support for discard #329

Open

block: add support for barriers #330

Open

minad mentioned this issue Mar 26, 2019

API redesign #343

Closed

minad mentioned this issue Apr 10, 2019

Bpf seccomp #352

Closed

hannesm mentioned this issue Oct 2, 2019

MirageOS 4.0 draft release roadmap mirage/mirage#994

Closed

37 tasks

dinosaure mentioned this issue Nov 6, 2022

New release of Solo5 (v0.7.4) ocaml/opam-repository#22458

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable block operations for more than 1 block #325

Enable block operations for more than 1 block #325

ricarkol commented Feb 19, 2019 •

edited

mato commented Feb 20, 2019

minad commented Mar 26, 2019

mato commented Jun 24, 2019

cfcs commented Jun 24, 2019

ricarkol commented Jun 24, 2019

cfcs commented Jun 25, 2019

mato commented Jun 25, 2019

mato commented Jun 25, 2019

g2p commented Jun 26, 2019

mato commented Jun 27, 2019

dinosaure commented Nov 4, 2022

hannesm commented Nov 4, 2022

Enable block operations for more than 1 block #325

Enable block operations for more than 1 block #325

Comments

ricarkol commented Feb 19, 2019 • edited

mato commented Feb 20, 2019

minad commented Mar 26, 2019

mato commented Jun 24, 2019

cfcs commented Jun 24, 2019

ricarkol commented Jun 24, 2019

cfcs commented Jun 25, 2019

mato commented Jun 25, 2019

mato commented Jun 25, 2019

g2p commented Jun 26, 2019

mato commented Jun 27, 2019

dinosaure commented Nov 4, 2022

hannesm commented Nov 4, 2022

ricarkol commented Feb 19, 2019 •

edited