Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Allocator to act more like an owned arena #467

Open
Joeoc2001 opened this issue Dec 18, 2023 · 8 comments
Open

Switch Allocator to act more like an owned arena #467

Joeoc2001 opened this issue Dec 18, 2023 · 8 comments

Comments

@Joeoc2001
Copy link

Currently, the Allocator of a message::Builder functions as an API for some kind of Malloc/Free. This is fine when the message to be encoded resides in host memory that we have full control over, but for some use cases this API is overly restrictive.

Specifically for my use case, Capn' Proto should be perfect for sharing data with a resident WebAssembly instance, as the data does not need to be copied and can simply be encoded directly into the WebAssembly instance's memory. However doing so requires the blocks of memory in the instance to be allocated by an exported function by the module, which may be fallible, and which requires a mutable borrow over the host memory to be invoked, invalidating Rust's aliasing rules on the mutable pointer returned by the allocator.

An alternative, proposed partially by #78, would be to have the allocator and relevant sub-builders deal with slices instead of raw pointers. Further to this discussion, I would propose that each builder keep integer offsets into the segments in the builder, and have them borrow the segments only for the period in which they are writing the data. This would allow the allocation and deallocation methods to have mutable access to all of the segments, without invalidating any existing pointers if moving the underlying buffers (as may be the case if a WebAssembly instance opted to grow its memory on alloc invocation).

Performance

Based on zero testing, I would guess that the borrowing from the owned allocator would be mostly inlined away, aside from the one extra level of indirection added by the offset+slice combo. I'd expect most CPUs to be able to deal with this fairly smoothly since the pre-fetcher should notice that the underlying segment is cohesive, but benchmarks would be required.

Backwards compatibility

Since the Allocator already acts like it owns the data allocated, a trait covering temporary borrows of slices only when written (as described above) would be a more general case. Therefore a new trait could be introduced, and an auto-implementation from Allocators to the new trait could be provided, meaning no existing code would break.

Are there any thoughts on this? For now, I intend to encode in a host buffer and then copy the segments into the instance memory, but this obviously isn't ideal when the underlying protocol is designed to be 0-copy.

@dwrensha
Copy link
Member

I would propose that each builder keep integer offsets into the segments in the builder, and have them borrow the segments only for the period in which they are writing the data.

This would eliminate the main source of unsafe in capnproto-rust, so it does sound appealing. I expect that the performance cost would be significant, but there's no way for sure to know without actually doing some benchmarking. Note that we currently do dynamic dispatch on the Allocator trait -- I think that might mean that a lot of stuff won't get inlined.

A safety vs performance analysis here with benchmark numbers would probably be interesting to much of the broader Rust community.

@dwrensha
Copy link
Member

It would be relatively straightforward to change the interface so that capnp::message::Allocator::allocate_segment returns an error rather than panicking on out-of-memory. (Note that downstream code would then need to insert ? in a bunch of places, so making the change comes with a cost.)

@Joeoc2001 would such a change on its own help in your use cases at all?

@Joeoc2001
Copy link
Author

I'm not sure that'd solve my use case specifically, unfortunately. The issue with WebAssembly allocators with the current API design of this crate isn't just that the provided module allocators can fail to allocate, but that ultimately the allocator is just arbitrarily instructions provided by the loaded module meaning the module (and host) may do anything, including moving or remapping the module memory in host memory, making the current pointers implementation UB for a malicious, buggy or unexpected allocator implementation.

@ObsidianMinor
Copy link

ultimately the allocator is just arbitrarily instructions provided by the loaded module meaning the module (and host) may do anything, including moving or remapping the module memory in host memory, making the current pointers implementation UB for a malicious, buggy or unexpected allocator implementation

If the host can just drop the memory out from under your feet (even if you're holding a mutable reference to it) that doesn't sound like a stable place to perform complex write operations anyway. It might seriously be faster to copy segments directly into the wasm memory instead of having to do at least 3 redirects every time you want to write something somewhere.

In fact, how would this work given that the arena stores pointers to the start of each segment? Are we implying the arena doesn't actually store segments anymore? So now it's dynamic, which makes it actually 4 redirects.

  • You'd have to ask the capnp allocator where the segment currently is.
  • The capnp allocator offsets into the module memory to get the start of the segment memory.
  • Then you'd offset into the segment to reach the start of the builder.
  • Then you'd offset into the data you're building to write the value.

Now you could coalesce a lot of this I suppose and redefine the allocator trait to look something like

pub trait Allocator {
    fn alloc(&self, amount: u32) -> (SegmentId, u32);
    fn write_bit(&self, segment: SegmentId, slot: u64, value: bool);
    fn write_u8(&self, segment: SegmentId, slot: u64, value: u8);
    fn write_16(&self, segment: SegmentId, slot: u64, value: u16);
    fn write_u32(&self, segment: SegmentId, slot: u64, value: u32);
    fn write_u64(&self, segment: SegmentId, slot: u64, value: u64);
}

But once again, every single time you write anywhere you'll hit a redirect. The compiler most likely won't be able to optimize writes in any way.

@Joeoc2001
Copy link
Author

Joeoc2001 commented Feb 18, 2024

that doesn't sound like a stable place to perform complex write operations anyway

I think there's an important distinction between 'can' and 'will' here - the module alloc method isn't known at compile time by the Rust compiler, since we load the WebAssembly module at run time, so any implementation of a 0-copy encoder into a module's memory using the current Capnp API will be unsound since a malicious module could do very weird things every time a new segment is allocated. However any sane and reasonable module won't do any of this, so considering the situation unstable is more a comment on the source of the modules to be run, and not on the use case as a whole. The common case should be amenable to optimisation.

So now it's dynamic, which makes it actually 4 redirects.

Potentially, but probably not for most WebAssembly host implementations if they can guarantee that the instance memory is contiguous in host memory. Then it'd just be a double offset, where the Arena stores offsets into the instance memory giving the starts of segments instead of the raw pointers. Also, the segment locations could only change at the point of calling alloc, so all writes in a single segment before it fills will use the same static offset to map from instance to host memory.

My ideal trait would look something like this:

pub trait Allocator {
    type Segment;
    fn alloc(&mut self, amount: u32) -> (Self::SegmentId, u32);
    fn write_bit(&mut self, segment: Self::Segment, slot: u64, value: bool);
    fn write_u8(&mut self, segment: Self::Segment, slot: u64, value: u8);
    fn write_16(&mut self, segment: Self::Segment, slot: u64, value: u16);
    fn write_u32(&mut self, segment: Self::Segment, slot: u64, value: u32);
    fn write_u64(&mut self, segment: Self::Segment, slot: u64, value: u64);
}

Then the current Allocator implementations could implement Segment to be *const u8, adding no overhead to existing implementations. Or they could implement it to be usize, and have the Allocator have a Vec<Box<[u8]>> inside, allowing all unsafe code to be removed potentially at the cost of an extra offset calculation on each write (depending on what Rustc can optimise it down to).

@ObsidianMinor
Copy link

ObsidianMinor commented Feb 18, 2024

My ideal trait would look something like this:

pub trait Allocator {
    type Segment;
    fn alloc(&mut self, amount: u32) -> (Self::SegmentId, u32);
    fn write_bit(&mut self, segment: Self::Segment, slot: u64, value: bool);
    fn write_u8(&mut self, segment: Self::Segment, slot: u64, value: u8);
    fn write_16(&mut self, segment: Self::Segment, slot: u64, value: u16);
    fn write_u32(&mut self, segment: Self::Segment, slot: u64, value: u32);
    fn write_u64(&mut self, segment: Self::Segment, slot: u64, value: u64);
}

This would require every capnp builder type to be generic over the Allocator itself, which would inherently cause every downstream user type to be generic over the Allocator as well. It doesn't really lend itself to ease of use.

@dwrensha
Copy link
Member

This would require every capnp builder type to be generic over the Allocator itself,

That is something I've considered before, and similarly on the read side having every Reader type be generic over ReaderSegments. I expect that this would improve performance by eliminating a bunch of dynamic dispatch. It would also let us remove the unaligned and sync_reader feature flags:

unaligned = []

sync_reader = []

because that information could live in the traits instead.

It doesn't really lend itself to ease of use.

Yep, that's that cost. My guess is that the cost outweighs the benefits. But still-- it could be interesting to investigate more deeply to get a better sense of just what it would look like.

@dwrensha
Copy link
Member

dwrensha commented Feb 19, 2024

The bug described in #484 is maybe tangentially related to the above discussion. The bug would have been avoided if Allocator had a get_segment(u32) method. (Currently Arena assumes that segment memory does not move.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants