Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor AICI to use WebAssembly Component Model #84

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AaronFriel
Copy link

@AaronFriel AaronFriel commented Mar 30, 2024

This is a significant change to the AICI Runtime (host) and AICI Controller (guest) to use WASI components. As part of this change, a significant amount of unsafe code is removed and the protocol is simplified to remove the need for "BLOB" types and side-channels.

The protocol is documented in wit/controller.wit, and a minimal WASI runtime is provided to AI CI controllers.

Some notes:

  • The AICI runtime no longer directly reads and writes to the guest's memory. Instead, the guest provides a Runner resource (using WebAssembly Component terminology), which exposes the low-level protocol to the host as a constructor and trait with methods.
  • The Blob protocols are removed entirely, replaced by the Runner resource. This and other side-channels for communicating with the runtime, e.g. allowed tokens (logit biases) outside of MidProcessResult, are removed.
  • The (Variable) Storage and Tokenizer protocols are separate WebAssembly Components, which can be versioned independently of the runtime.
  • Types are changed to be consistent with the WebAssembly interface, e.g.: SeqId is used in far more places to avoid casts.

This is a significant change to the AICI Runtime (host) and AICI Controller
(guest) to use WASI components. As part of this change, a significant amount of
unsafe code is removed and the protocol is simplified to remove the need for
"BLOB" types and side-channels.

The protocol is documented in `wit/controller.wit`, and a minimal WASI runtime
is provided to AI CI controllers.

Some notes:
* The AICI runtime now longer directly reads and writes to the guest's memory.
  Instead, the guest provides a `Runner` resource (using WebAssembly Component
  terminology), which exposes the low-level protocol to the host as a
  constructor and trait with methods.
* The Blob protocols are removed entirely, replaced by the `Runner` resource.
  This and other side-channels for communicating with the runtime, e.g. allowed
  tokens (logit biases) outside of `MidProcessResult`, are removed.
* The (Variable) Storage and Tokenizer protocols are separate WebAssembly
  Components, which can be versioned independently of the runtime.
* Types are changed to be consistent with the WebAssembly interface, e.g.:
  `SeqId` is used in far more places to avoid casts.
@AaronFriel
Copy link
Author

AaronFriel commented Mar 30, 2024

@mmoskal There are still some a few TODOs here I believe:

  • Logging (stdout/stderr?) from the controllers
  • Handling errors, panics
  • Update: No change vs upstream. Clock - I see a SPECTRE/MELTDOWN mitigation in the upstream main, not sure if it will be straightforward to override when using the WASI runtime's builtins.
  • Update: fixed. Build errors after rebasing.

@AaronFriel
Copy link
Author

PR updated with some refactors, among which is an ergonomic improvement to exporting guests that fixes running cargo build from the workspace root.

It seemed to me that until the WASI Preview 2 target fully lands, the controllers may need to be built as libraries with type cdylib, though I couldn't find anything definitive. Between that and some of the machinery used by export!(), compiling those crates for, e.g., linux-x86-64, would error with cc.

The improved export macro hides the machinery:

#[macro_export]
macro_rules! export {
($ty:ident) => {
#[doc(hidden)]
#[cfg(target_arch = "wasm32")]
$crate::bindings::export!($ty with_types_in $crate::bindings);
};
}

@squillace
Copy link

Hi @AaronFriel, I LOOOOOVVVVEEEEE this. My team does a bunch of the infrastructure work upstream supporting wasm components, and I'd like to see how to help bring this in to the project. 🖖

@emrekiciman
Copy link
Collaborator

Thanks very much for this PR, @AaronFriel ! And thanks @squillace for helping review!

@squillace
Copy link

@AaronFriel don't fret, we'll get here. You submitted when we had KubeCon followed by easter followed by the heavens being swallowed by the moon. People are returning end of this week....

@AaronFriel
Copy link
Author

@squillace I'm in no rush, and pleased to see your review when you're able!

Sorry if I did anything to nag you - I don't think I triggered anything on my end since posting the PR?

@squillace
Copy link

nope, just don't like not communicating n prs when someone is trying to help do the right thing.

@mmoskal
Copy link
Member

mmoskal commented Apr 10, 2024

This looks great, from my non-very-well-informed POV!

Unfortunately, I'm in the middle of some work items that may affect this. In particular, I'm dropping the pre/post callbacks and only leaving the mid callback. It looks like we would be unable to run the pre/post fast enough, especially with speculative decoding (I have not considered that in the past).

I also want to support native controllers which is probably relevant here.

This may take a few weeks to finish and is quite high priority for us here.

@AaronFriel
Copy link
Author

The 1 token penalty in #68 seems very reasonable for the capabilities offered in AICI. I'm not intimately familiar with the workings of the rLLM implementations, beyond what was necessary for this PR, but from your notes it sounds like blocking the LLM holds up an entire batch, effectively a pipeline stall?

@mmoskal
Copy link
Member

mmoskal commented Apr 10, 2024

In non-speculative implementations, the pre/post happens on the "critical path" of sampling after we got the logits from the GPU but before we started the next step on the GPU (the next step needs the sampled tokens from the current step). Thus, the GPU sits idle while we hold the entire batch.

Now, in principle it would be possible to be working on two sets of sequences and swap them (running pre/post on one, while the other is already computing logits on t he GPU and vice versa). The problem with this is that it needs 2x memory for KV cache which is typically the limiting factor for batch size and thus throughput. It may be possible for the draft model though in case it's too fast even with the new only-mid strategy.

@AaronFriel
Copy link
Author

AaronFriel commented Apr 10, 2024

Does vLLM have a mechanism for adding and removing sequences from batches, or would it be simpler in AICI to effectively Promise.race() the WASM, and if mid_process exceeds the deadline, to treat it as a de-facto fork?

That is, never allow AICI to block the LLM, but you might generate logits you throw away. In those situations, backtrack and resume?

Taking this to #68 though, because it sounds like this PR is blocked on understanding that discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants