[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

AsakusaRinne · 2024-04-21T02:38:57Z

Introduction

This proposal requires lots of works and will introduce many break changes. It should be discussed in detail before it's merged into master branch. Any suggestion will be appreciated! FYI @martindevans @SignalRT

This proposal is inspired by vllm and has already had a prototype implementation in #683. Though it's far from completed, the main ideas have been manifested. If you want to know further about this proposal, please follow the example in that PR to take a try of it. The example does not have a good UI to show the process of parallel inference, but it does execute multiple sequences at the same time. You could set breakpoints in LLM.RunEngine to confirm that.

Motivations

At the very early stage of LLamaSharp, LLamaModel class was used for dealing with all things of the model, including loading, state, inference and high-level. After v0.4.1, It was splitted to LLamaWeights, LLamaExecutor and ChatSession, in which LLamaExecutor is the mid-level API to run the model and ChatSession is the high-level API.

Though this design once worked well both for developers and users, as time passed, the issues with this design have become increasingly evident. The main problems are described as following.

Batched inference is not user-friendly: As you can see in Parallel Inferencing? #623, it requires users to understand how it works and write many codes to use it. What's more, even though the low-level API was added nearly half a year ago, there has been few users really taking use of it! Obviously, we need to provide easy-to-use APIs for users to use it because the batched inference is indeed a huge improvement on performance.
Mid-level and high-level APIs need to be improved: We're currently providing executors as the mid-level APIs. However, though the design works well with chatting (regardless of batched inference), it does not support text-completion very well. As for high-level APIs. I believe we should follow the style of OpenAI APIs and [semantic-kernel(https://github.com/microsoft/semantic-kernel). However, the current design of mid-level APIs seems to make it difficult to implement this. Related issues: Garbled output from model in Unity #178, SemanticKernel ChatCompletion is Stateless #614, Create HTTP API server and provide API like OAI #269.
The current abstractions bring some unnecessary difficulties for developers: According to my experience of PR review, several core developers understand the whole design and most of the details in LLamaSharp, while others don't. However, when developing mid-level and high-level APIs, it often requires the developer understand how LLamaSharp works with llama.cpp. In fact, some processes are not related with llama.cpp backend. For example, how to schedule the batched inference, how to sample the logits and how to decide whether the inference should be stopped. We should shield things related with llama.cpp backend as much as possible in mid-level APIs so that it will be easier for new contributors to add features or bug fix of mid & high level APIs. Besides, it will make it easy for us to borrow ideas from other good LLM projects, such as transformers, vllm and ollama.

Design

The full design is like below.

In which the llama.cpp backend is like below (see #670 for auto-downloading proposal).

The design is still separated into low-level, mid-level and high-level APIs. However, the low-level part contains multiple backends.

Don't get me wrong. I am not going to introduce other backends now (though it's possible). The purpose of this design is to better abstract llama.cpp related part. Thus, mid-level implementations will only need to take use of several APIs of llama.cpp model runner, llama.cpp tokenizer and llama.cpp kv-cache manager. Some logics, such as scheduling, sampling and stopping, could be independent with the backend part..

Here is the explanation of the newly introduced components.

Model Runner: the low-level class. Things related with llama.cpp inference should be put here. It's also possible to make a different runner with other libraries as backend, though at least in the near future I don't think I will do that.
LLM Engine: a mid-level class. It defines how to process the request and generate the response, which is the core of mid-level APIs.
KvCache Engine: a mid-level class. It defines how to manage the model state (kv-cache).
Scheduler: a mid-level class. It defines how to schedule the requests and create the batch for inference.
Sequence: a mid-level class, which is the abstraction of the text-completion request in mid-level APIs.
Sampling methods: mid-level APIs, responsible for sampling tokens from logits.
Stopping criteria: mid-level APIs, which defines when the sequence generation should be stop.
Server Engine: a high-level class to provide efficient APIs for users to build their LLM server. The key feature of it is continuous batching.
Text completion / chat session: Simple high-level class based on the LLM engine to provide easy-to-use APIs and supports parallel inference.

Text completion APIs

Here is what the APIs of text completion will be like (only show the key elements).

class LLM;

static LLM LLM.WithParams(TextCompletionParams param);

static LLM LLM.FromBackend(LLamaModelParams param);

RequestOutput[] LLM.Generate(IEnumerable<string> prompts, StoppingCriteria? stoppingCriteria = null, MultiModalData? multiModalData = null);

AsyncRequestOutput[] LLM.GenerateAsync(IEnumerable<string> prompts, StoppingCriteria? stoppingCriteria = null, MultiModalData? multiModalData = null);

When using it, the code is like below.

var llm = LLM.WithParams(...).FromBackend(...);
string[] inputs = {"text1...", "text2...", "text3..."};
var outputs = llm.Generate(inputs);
// The outputs are generated with batched inference.
foreach(var output in outputs){
    // Deal with the output.
}

For APIs related with server, I'll update them after more investigations.

Conclusion

The proposal refactors most of the current designs of mid-level and high-level. Break change is the major risk of it. However, it seems that the current executors could be implemented with the mid-level APIs provided in this proposal. LLM is actually a StatelessExecutor with scheduler and better abstractions. As for InteractiveExecutor, it could be implemented with LLMEngine + KvCacheManager, because LLM chatting could be regarded as text completion with roles and kv-cache management. In this way, it's possible for us to make the changes smoothly.

It will have so many impacts that I won't rush for it. I'll leave enough time for the community to discuss about it, to correct the unreasonable parts. It's completely okay to drop it if most of the users & developers don't like it.

I prefer to aiming to make LLamaSharp a library to run LLM efficiently with easy-to-use APIs, instead of a simple wrapper of llama.cpp. That's also why we spent lots of time on performance improvement and dynamic native library selection. If we could agree on this, I believe we'll work it out soon. :)

Again, any suggestions and discussions about this proposal will be appreciated. 🤗

The text was updated successfully, but these errors were encountered:

SignalRT · 2024-04-21T07:29:17Z

@AsakusaRinne The overall idea seems good to me. But I have the following observations:

Designing the API based on the Highest APIs seems to be the right idea to me. I think that if we not propose the higher level APIs before to prototype the solution we will begin to force the Midlevel APIs to be able to implement the high level APIs.
Any use of the library different to local / desktop usage will require to escalate the solution via Web API and multiple instances of the LLM to serve the request. That means that the client to use the WEB API should be first class citizen of the library.
I think that we need to provide a template system. One of the most complex things to everybody starting to use LLMs is the time needed to begin the use a model before understand the right use of a model.

I will begin to provide feedback on the prototype.

AsakusaRinne · 2024-04-21T08:20:24Z

Any use of the library different to local / desktop usage will require to escalate the solution via Web API and multiple instances of the LLM to serve the request. That means that the client to use the WEB API should be first class citizen of the library.

Agreed. Currently we ask user to build the web API from mid-level APIs themselves and it's difficult for them to apply batched inference. The Server Engine in this proposal is to provide a class to deal with parallel inference and, as you said, multiple LLM instances. Thus it will make it easy for users to build a high-performance web API.

I think that we need to provide a template system. One of the most complex things to everybody starting to use LLMs is the time needed to begin the use a model before understand the right use of a model.

That's a good idea. However it seems that #670 and this proposal will consume all my free time, so I'm afraid I'm not available for it in the future 3 months. If you found it helpful for it to modify some parts of this proposal, I'll be more than happy to help and discuss with you. :)

martindevans · 2024-04-21T16:15:40Z

Batched inference is not user-friendly

That's mostly because it's not designed to be 😆

The BatchedExecutor is the "minimum viable product" to expose low level primitives in a safe way to C# - the main idea is that there should never be a point to using the lower level APIs, because BatchedExecutor exposes everything in a safer way without any speed cost. I think that's mostly done with the current API does not contain any pointers, doesn't expose any operations that can lead to memory leaks and lifts the fairly primitive llama.cpp API into a higher level object oriented API.

My intention with the BatchedExecutor has always been that most end users don't use it directly, instead it acts as the foundation that all of the higher level APIs can be built on. For example something like the current executors could be written so that they wrap a single Conversation object and multiple different executors can all be using the same batch which transparently speeds everything up.

I haven't been pushing for anyone to use it until recently because I've only just reached feature parity with the addition of loading/saving individual conversations in #681!

Mid level APIs

Going from this diagram I would say BatchedExecutor can currently provide:

LLM Engine: it runs the LLM, so I guess it does this 😁
Sequence: A Conversation is a sequence.
KV Cache Manager: Individual conversations can be forked (sharing cache), rewound (dropping cache items) shifted (freeing up some cache space) and there is an API for arbitrary KV amnipulations for people who know what they're doing.

Thoughts on the other parts of that diagram:

Sampling

There is the entire sampling pipeline API I developed (see here) which I think serves this. A sampling pipeline can be put together by implementing ISamplingPipeline and calling the various sampling methods. This gives direct access to the logits (so you could implement an entirely custom sampler if you wanted) but is also easy to use by just chaining some methods together if you want to (e.g. here's the default pipeline, which does a lot of things but is still fairly understandable).

Scheduler

This is a tricky one that I haven't done any work on, I assume you're meaning something to schedule when inference is run to maximise the work done in a single batch but minimise the latency? That's probably the hardest part of the batched inference, you need to bring together all the work into a batch before calling infer and definitely needs some kind of higher level system to help schedule it.

Stopping Criteria

Not something I've worked on much at all, since it comes after inference and sampling which have been my main focus. Definitely something we need though!

Other Things

I think some other things I would add to the "mid level" API list would be:

Templating. We need the low level implementation of templating - taking some text and transforming it into alernative text according to the template.

We probably also need the higher level implementation (something like ChatSession/ChatHistory) which represents the history in an object oriented way and can be manipulated in ways that make sense at lower level (e.g. rewind, fork, shift can all be done at the high level and map down into low level KV manipulations.

Embeddings. There seem to be a lot of changes coming with how llama.cpp handles embeddings - generative models, embedding models, pooling techniques etc. Our current LLamaEmbedder is very primitive, at the very least it could be made into something that uses a batch to generate lots of embeddings at once much much faster than currently.

High Level APIs

I think these would probably be better of split into separate packages? Our current high level APIs have become a bit of a mess over time as the low level has shifted underneath them, splitting into separate packages somewhat prevents that becoming an issue in the future.

That would leave LLamaSharp providing the core things that everyone needs (low and mid level) and then separate special purpose packages providing other specific usecases. e.g. individual nuget packages for:

Chat
OpenAI Style API
Semantic Kernel
Kernel Memory
RAG
Web backend

AsakusaRinne · 2024-04-21T16:35:36Z

Going from this diagram I would say BatchedExecutor can currently provide:
LLM Engine: it runs the LLM, so I guess it does this 😁
Sequence: A Conversation is a sequence.
KV Cache Manager: Individual conversations can be forked (sharing cache), rewound (dropping cache items) shifted (freeing up > some cache space) and there is an API for arbitrary KV amnipulations for people who know what they're doing.

Yes, in my prototype, I referred to the implementations of LLamaBatch. It's so lucky that there are some codes that I could take from!

Scheduler: I assume you're meaning something to schedule when inference is run to maximise the work done in a single batch but minimise the latency?

Yes, and it's also responsible for continuous batching. I think it's important for making LLM servers because the requests may come at any time.

I think some other things I would add to the "mid level" API list would be...

I could try to figure out how to make the embedding APIs better when moving on in this proposal. However currently I have no idea about the template. To reduce the duplicated works and refactoring, I think we'd better keep the prototype in Experimental util we have taken into account all the possibly major features (if this proposal is approved). 😄

I think these would probably be better of split into separate packages?

In my opinion, I would like to keep the text-completion and chat-completion classes in the main package and put others on separate packages, such as the sever Engine, OpenAI-style APIs and RAG. As you can see in #683, LLM (text-completion) is only a very simple wrapper of LLMEngine. :)

martindevans · 2024-04-21T16:43:12Z

(Just to note I haven't looked at #683 yet. I wasn't suggesting things that should be added to that specific PR, just the general direction of the project overall for the next 12 months!)

AsakusaRinne added break change feature request discussion labels Apr 21, 2024

AsakusaRinne mentioned this issue Apr 21, 2024

feat: add experimental refactorings. #683

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

AsakusaRinne commented Apr 21, 2024 •

edited

SignalRT commented Apr 21, 2024

AsakusaRinne commented Apr 21, 2024 •

edited

martindevans commented Apr 21, 2024

AsakusaRinne commented Apr 21, 2024

martindevans commented Apr 21, 2024

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

Comments

AsakusaRinne commented Apr 21, 2024 • edited

Introduction

Motivations

Design

Text completion APIs

Conclusion

SignalRT commented Apr 21, 2024

AsakusaRinne commented Apr 21, 2024 • edited

martindevans commented Apr 21, 2024

Sampling

Scheduler

Stopping Criteria

Other Things

High Level APIs

AsakusaRinne commented Apr 21, 2024

martindevans commented Apr 21, 2024

AsakusaRinne commented Apr 21, 2024 •

edited

AsakusaRinne commented Apr 21, 2024 •

edited