Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add experimental refactorings. #683

Conversation

AsakusaRinne
Copy link
Collaborator

@AsakusaRinne AsakusaRinne commented Apr 21, 2024

This is a prototype implementation for #684 . It proves how the design will be like and the feasibility of the proposal.

TODO:

  • kv-cache manager.
  • Support streaming.
  • More reasonable implementation of LLama tokenizer.
  • More reasonable implementation of Sampling methods.
  • Providing more stopping criteria.
  • Add APIs for users to create LLM.
  • Add a server engine, as described in the proposal.
  • Support multi-modal inputs.
  • Support beam-search.
  • Support Grammar.
  • Add unit tests and more examples.

How to run

I have provided an example which is named Experimental LLM. It shows how to use a very simple API to run batched inference of multiple inputs. Since streaming hasn't been supported yet, it will take you a long time before you could see the output on the console.

Plan

I won't require to merge this PR now since it's not only not at a good state, but also will introduce break changes. I'll follow the steps below to move on.

  1. Discuss with core developing team and community in detail.
  2. Add the features into Experimental namespace first, and try my best to keep other parts not changed.
  3. Include the experimental features in the new release and get feedbacks from users.
  4. Replace the executors and ChatSession with the features. Then mark them as obsolete if necessary.
  5. Move the features from Experimental namespace to root name space.

Completing the whole TODO list requires a large amount of work. So I'd like to discuss with you first and hope we could agree on it after some improvements of the proposal. It will proceed much faster if you would like to develop it together. :)

About the code

The code are under the namespce Experimental. Note that some parts are not well done but only implemented to get a simple example to run. So, please ignore the naive implementation of LLamaGreedySamplingMethod, LLamaTokenizer and SequenceLengthStoopingCriteria.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my main concern. Assuming this is the highest level API running locally and the class to use to implement the WEB API I would build an interface similar to OLlama one. I wouldn't build the interface based on string prompts but in "message" prompts to be able to make the system extensible and support llava or other kind of models. The message concept will allow any type of model support.

If we don´t think about this kind of interface for the very begining I think that we will fail the users.

public interface IOllamaApiClient
	{
		/// <summary>
		/// Gets or sets the name of the model to run requests on.
		/// </summary>
		string SelectedModel { get; set; }

		/// <summary>
		/// Sends a request to the /api/chat endpoint
		/// </summary>
		/// <param name="chatRequest">The request to send to Ollama</param>
		/// <param name="streamer">
		/// The streamer that receives parts of the answer as they are streamed by the Ollama endpoint.
		/// Can be used to update the user interface while the answer is still being generated.
		/// </param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		/// <returns>List of the returned messages including the previous context</returns>
		Task<IEnumerable<Message>> SendChat(ChatRequest chatRequest, IResponseStreamer<ChatResponseStream> streamer, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/copy endpoint to copy a model
		/// </summary>
		/// <param name="request">The parameters required to copy a model</param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task CopyModel(CopyModelRequest request, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/create endpoint to create a model
		/// </summary>
		/// <param name="request">The parameters for the model to create</param>
		/// <param name="streamer">
		/// The streamer that receives status updates as they are streamed by the Ollama endpoint.
		/// Can be used to update the user interface while the operation is running.
		/// </param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task CreateModel(CreateModelRequest request, IResponseStreamer<CreateStatus> streamer, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/delete endpoint to delete a model
		/// </summary>
		/// <param name="model">The name of the model to delete</param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task DeleteModel(string model, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/embeddings endpoint to generate embeddings
		/// </summary>
		/// <param name="request">The parameters to generate embeddings for</param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task<GenerateEmbeddingResponse> GenerateEmbeddings(GenerateEmbeddingRequest request, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/generate endpoint to get a completion
		/// </summary>
		/// <param name="request">The parameters to generate a completion</param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		/// <returns>
		/// A context object that holds the conversation history.
		/// Should be reused for further calls to this method to keep a chat going.
		/// </returns>
		Task<ConversationContextWithResponse> GetCompletion(GenerateCompletionRequest request, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/tags endpoint to get all models that are available locally
		/// </summary>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task<IEnumerable<Model>> ListLocalModels(CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/pull endpoint to pull a new model
		/// </summary>
		/// <param name="request">The request parameters</param>
		/// <param name="streamer">
		/// The streamer that receives status updates as they are streamed by the Ollama endpoint.
		/// Can be used to update the user interface while the operation is running.
		/// </param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task PullModel(PullModelRequest request, IResponseStreamer<PullStatus> streamer, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/push endpoint to push a new model
		/// </summary>
		/// <param name="request">The request parameters</param>
		/// <param name="streamer">
		/// The streamer that receives status updates as they are streamed by the Ollama endpoint.
		/// Can be used to update the user interface while the operation is running.
		/// </param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		Task PushModel(PushRequest request, IResponseStreamer<PushStatus> streamer, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/show endpoint to show the information of a model
		/// </summary>
		/// <param name="model">The name of the model the get the information for</param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		/// <returns>The model information</returns>
		Task<ShowModelResponse> ShowModelInformation(string model, CancellationToken cancellationToken = default);

		/// <summary>
		/// Sends a request to the /api/generate endpoint to get a completion and streams the returned chunks to a given streamer
		/// that can be used to update the user interface in real-time.
		/// </summary>
		/// <param name="request">The parameters to generate a completion for</param>
		/// <param name="streamer">
		/// The streamer that receives parts of the answer as they are streamed by the Ollama endpoint.
		/// Can be used to update the user interface while the answer is still being generated.
		/// </param>
		/// <param name="cancellationToken">The token to cancel the operation with</param>
		/// <returns>
		/// A context object that holds the conversation history.
		/// Should be reused for further calls to this method to keep a chat going.
		/// </returns>
		Task<ConversationContext> StreamCompletion(GenerateCompletionRequest request, IResponseStreamer<GenerateCompletionResponseStream> streamer, CancellationToken cancellationToken = default);
	}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current implementation is simplified to reduce the complexity, while multi-modal support is necessary. Which kinds of abstractions would you prefer to provide support for multi-modal? I'm wondering the following two structures of Sequence.

  1. Add a MultimodalData member to Sequence. It provides an interface to get the embeddings. The multi-modal moidel, however, would not be included in LLMEngine.
  2. Integrate the multi-modal model into LLMEngine, or add a class MultiModalLLMEngine. In this way, the input modal-data will be original data, such as image and audio.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion LLMEngine should manage any type of model, and the interface should be able to manage this. Having some abstration as Message (the properties are just esamples) woud allow us to manage what we have right now (text and images) and to extend that in the future as needed.

	public class Message
	{
		public Message(ChatRole role, string content, string[] images)
		{
			Role = role;
			Content = content;
			Images = images;
		}

		public Message(ChatRole role, string[] images)
		{
			Role = role;
			Images = images;
		}

		public Message(ChatRole? role, string content)
		{
			Role = role;
			Content = content;
		}

		// We need this for json deserialization
		public Message()
		{
		}

		/// <summary>
		/// The role of the message, either system, user or assistant
		/// </summary>
		[JsonPropertyName("role")]
		public ChatRole? Role { get; set; }

		/// <summary>
		/// The content of the message
		/// </summary>
		[JsonPropertyName("content")]
		public string Content { get; set; }

		/// <summary>
		/// Base64-encoded images (for multimodal models such as llava)
		/// </summary>
		[JsonPropertyName("images")]
		public string[] Images { get; set; }
	}

For me it will be better that kind of wrapper than just text / strings as is propose in this PR:

public void AddRequest(string requestId, string? prompt,...

In the Middle tier of the library if we use the same structure or another one in my opinion is less important, because I think that the main contract for the user will be LLMEngine anyone working on the library on the middle or lower tiers will need to know deeper the inner working of the library. Keeping the upper interface as something stable that can be extended but not changed in each version is important to be able to build tools working with LLamaSharp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds nice to add the abstraction of Message to hold both text and multimodal data. However I don't think ChatRole should be a part of it. The LLMEngine is supposed to deal with any kind of request, mainly including text-completion and chat-completion, single request and multiple request. If we add ChatRole to it, it seems to be against the purpose of LLMEngine.

I think that the main contract for the user will be LLMEngine.

In fact I'm afraid it's still too difficult for users who have little experience with LLM to use it, especially if the user has little experience in programming, either. I prefer to asking them to use higher level APIs, like the LLM, ChatLLM and ServerEngine in the proposal, and leave LLMEngine to those who are experienced with LLM or LLamaSharp to customize their high-level APIs. No matter in which way, I agree that the APIs of LLMEngine should be stable.

@martindevans
Copy link
Member

martindevans commented Apr 21, 2024

To be honest I think this PR needs to be split into many smaller PRs, which we can consider one by one. It's overwhelming at the moment, with so many new and somewhat unrelated things all added at once!

(Edit: I will do a full review when I have time later)

@AsakusaRinne
Copy link
Collaborator Author

@martindevans Actually I don't expect this PR to be merged. It's only a prototype to manifest my proposal. With these works, at least it proves to be possible to implement the proposal. It's far from completed (maybe 30%) so I'd like to discuss with you all to make sure I'm not at a wrong way before I move on. :)

Note: to make the basic text-completion example work ASAP, I used some naive implementations in sampling and detokenization. The main idea is manifested in LLMEngine, Sequence, Scheduler and ModelRunner.

/// <summary>
/// Method to sample the model output.
/// </summary>
public interface ISamplingMethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks similar to ISamplingPipeline, is it basically that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I renamed it because I was not sure if there'll be slightly different from it. Using a different name will not introduce modifications in non-experimental part.

/// <param name="result"></param>
/// <param name="skipSpecialTokens"></param>
/// <returns>The consumed tokens for decoding.</returns>
int ConvertIdsToText(IEnumerable<int> tokenIds, out string result, bool skipSpecialTokens = false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have an API like this. It's the same problem as the old DeTokenize method, sovled by the StreamingTokenDecoder.

A single character may be several tokens. So for example say the tokens [1, 2, 3] produce the character A then decoding [1, 2] will produce a broken string. Then decoding [3] will produce another broken string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an IncrementalDecodingOffset member to Sequence. I'm trying to find a way to integrate the streaming decoder better into this design but haven't made it yet. The biggest problem of using it directly is that it rents some memories and needs to be released. Though adding StreamingTokenDecoder as a member of Sequence could make it auto-released, it's a bit weird to make Sequence own a StreamingTokenDecoder because it should be a part of the tokenizer. Besides, the Sequence, as a mid-level class which may used frequently by users, is not supposed to deal with logics related with LLamaContext, while LLamaContext is required to initialize a StreamingTokenDecoder.

{
// TODO: `ApplyChatTemplate` API

// TODO: Batched Encode?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know there's no batched tokenization APIs in llama.cpp we could use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be all in to allow bypassing llama.cpp's built-in tokenizer -- no reason to lock the architecture with that as dependency.

/// <summary>
/// Stores the data, status, and other information of a sequence.
/// </summary>
public sealed class Sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be implemented as a higher level wrapper around a Conversation object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though Conversation has already provided some abilities of dealing with batched inference and sequence, I prefer not to wrap around it because it requires a BatchedExecutor, which further requires a LLamaContext. In this proposal I'd like to make Sequence a class with no coupling with llama.cpp things. Instead, I put everything related with it in model runner and tokenizer.

/// The logprobs of the output token.
/// (Token id -> logP(x_i+1 | x_0, ..., x_i))
/// </summary>
public float[]? Logprobs { get; init; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is used for every output token it would mean a huge number of allocations (an array of VocabSize floats for every token)!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm considering something like this, only get it from native when it's needed.

namespace LLama.Experimental.Core
{
// TODO: This is only the most simple implementation to run the test now. We should replace it in the future.
public class AntipromptStoppingCriteria: IStoppingCriteria
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same as the AntiPromptProcessor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but IStoppingCriteria is more flexible, which allows the sequence to be stooped because of text, token, length and any other reason (as long as the user could implement it).

namespace LLama.Experimental.Core.LLamaCpp
{
// TODO: This is only the most simple implementation to run the example. It should be replaced in the future.
public class LLamaGreedySamplingMethod: ISamplingMethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think I may have mentioned this already in another reply, but I can't find it now).

This looks like the same as the GreedySamplingPipeline. And more generally I guess ISamplingMethod is basically the same as ISamplingPipeline?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are very similar. I renamed it only because I don't want to change the current implementations because of some slight difference.

/// <summary>
/// Input special for <see cref="LLamaCppRunner"/>.
/// </summary>
public class LLamaCppRunnerInput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as LLamabatch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the same reason with above, I don't want to break the current implementations. Instead, I want to keep all things in Experimental namespace.


internal static IEnumerable<T> SkipLastImpl<T>(IEnumerable<T> source, int count)
{
return source.Take(source.Count() - count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this enumerates source twice.

@martindevans
Copy link
Member

A parts here seem to be duplicates of things we've already got, but that's something we can resolve later.

LLMEngine, Sequence, Scheduler and ModelRunner.

I'm not 100% sure I understand the purpose of the various bits. Here's my understanding:

  • LLMEngine: process requests into responses. Responsible for the complete pipeline from text in, tokenization, scheduling into batches, inference, sampling, detokenization and output text.
  • Sequence: a single stream of text (very similar to a Conversation)
  • Scheduler: Takes a load of requests and decides how to pack them into a batch for execution.
  • ModelRunner: Low level interface for inference

Please correct me if I'm wrong :)

Thoughts

To me the LLMEngine feels like it's packing too much responsibility into one class. It'll either be over abstract, to try and match every use-case, or it won't satisfy all the different needs of users.

With the various bits I've added over the last few months (e.g. BatchedExecutor, StreamingTokenDecoder, AntipromptProcessor, ISamplingPipeline) I've tried to expose all of these capabilities in ways that can be assembled together however you want, without losing any of the power or performance of the lower level API. Personally I think that's a better way to design the middle level than to try and squeeze everything into one class.

The idea of a scheduler is quite interesting. How to handle the required flow with prompt->infer->sample over multiple conversations is a tricky to use part of the BatchedExecutor API at the moment. A scheduler which can decide when to run inference and sampling or even to unload entire conversations to free up KV cache space is a possible solution, although it is going to be extremely complicated!

ModelRunner I'm a bit confused about - It seems to be a way to support entirely different backends to llama.cpp? That sounds like it would introduce a huge amount of complexity into the codebase!

@AsakusaRinne
Copy link
Collaborator Author

AsakusaRinne commented Apr 22, 2024

LLMEngine, Sequence, Scheduler and ModelRunner.

  • LLMEngine: The most important ideas of it is 1) abstract the method to run one step inference. 2) hide the details of scheduler and model runner. In current executors, we the one-step inference and the outer loop are mixed in the inferAsync method, making it less extensible. When it comes to continuous batching, obviously the abstraction of one-step inference is significant.
  • Scheduler: It's also responsible for scheduling to get good performance when there are a large number of requests.

To me the LLMEngine feels like it's packing too much responsibility into one class. It'll either be over abstract, to try and match every use-case, or it won't satisfy all the different needs of users.

I think it's already a quite simple wrapper of the scheduler and the model runner? As you can see here, it only does three things, calling scheduler, calling model runner and processing outputs (the only logic written inside LLMEngine). It's not responsible for the whole inference, but only one step.

With the various bits I've added over the last few months (e.g. BatchedExecutor, StreamingTokenDecoder, AntipromptProcessor, ISamplingPipeline) I've tried to expose all of these capabilities in ways that can be assembled together however you want, without losing any of the power or performance of the lower level API. Personally I think that's a better way to design the middle level than to try and squeeze everything into one class.

I'm not sure if you are getting me wrong, that's also what I think of. I believe we'll reach an agreement on the direction and only diverge on some details.😄

  • BatchedExecutor: IModelRunner will cover its functionalities.
  • StreamingTokenDecoder: I will introduce it but haven't come up with a good way yet, as described in the previous comments.
  • AntipromptProcessor: IStoppingCriteria is the corresponding thing, while I think it's more flexible. This name is adopted from transformers, which is one of the most popular library of LLM.
  • ISamplingPipeline: ISamplingMethods is an alias of that (with possibly slight difference) because I want to keep all the modifications in the Experimental namespace.

ModelRunner I'm a bit confused about - It seems to be a way to support entirely different backends to llama.cpp? That sounds like it would introduce a huge amount of complexity into the codebase!

As described in #684, I'm not going to introduce any new backend now, though it's possible to add such a thing with the proposal. The purpose of the abstraction of model runner is to separate things into two kinds: those related with llama.cpp and those not related with llama.cpp. The backbone of our current design, especially the mid-level (executors) mainly refers to llama.cpp examples. TBH I don't think llama.cpp examples are good enough for us to follow them only. They are examples instead of a library with well designed APIs. Coupling too deeply with it may prevent us from making LLamaSharp better. In this proposal, as you can see here, the LLamaCppModelRunner only execute one batch and get the logits before it returns. It will make us easy to refer to other popular LLM inference libraries to improve the non-llama.cpp-related things.

A scheduler which can decide when to run inference and sampling or even to unload entire conversations to free up KV cache space is a possible solution, although it is going to be extremely complicated!

Yes, that will be complicated, so we won't want to mix logics related with llama.cpp in it, which will further increase the complexity. 😆


The proposal is a big change compared with the current design, so please feel free to ask me if there's anything not clear for you. I'm open to the design and will only move on until we reach an agreement after some modifications of it. :)

Copy link
Contributor

@Lyrcaxis Lyrcaxis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid! I have a few suggestions:

  1. Add abstractions for Modals:
public interface ILlamaModal {
    bool NeedsEvaluation(IContext context, out int neededTokens);
    void Evaluate(IBatch target);
}

This could allow for more straightforward mid-level implementation & flexibility.
Modals can then have standalone functionality shareable between models.

  1. Oof:
public interface ILLamaSymbol { } // Future-proof & can handle multimodality
public struct Token : ILLamaSymbol { }      // Size == 1
public struct Image : ILLamaSymbol { }      // Size == model.DecodedImageSize
public struct Embedding : ILLamaSymbol { }  // Size == model.EmbeddingSize

// Cache: Dictionary<LLamaPos, ILLamaSymbol> cache;
// Cache: ToString() => [..]

Will allow universal and robust control of the cache with just references. (e.g. context.Remove(image))
Iffy but promising imo, especially considering Meta's multimodality promise.

  1. Only expose Batch & Cache handling via a core Context class.
public class Context {
    ILLamaSymbol[] KVCache; // Contains offsets & is easily comvertable to Dictionary.
    public ILLamaSymbol[] this[Range range] => ...; // For QOL like `context[512..]`

    IReadOnlyList<IBatch> batches;   // Name 'Batch' would make more sense instead of 'Sequence' imo
    public IBatch AddBatch() {       // Could register callbacks to the new batch automatically,
        var batch = default(IBatch); // ..handle switch of control (e.g. if modified externally)
        // .. initialize, register callbacks, pass references, etc.
        return batch; // There should be no other way to register a batch to this context.
    }
}

Idea here is that this'd be global handle to allow various 'plays' from different levels.
.. and of course that it'd have the whole representation of a lightweight cache accessible in C#.

{
// TODO: `ApplyChatTemplate` API

// TODO: Batched Encode?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be all in to allow bypassing llama.cpp's built-in tokenizer -- no reason to lock the architecture with that as dependency.

/// <summary>
/// Method to sample the model output.
/// </summary>
public interface ISamplingMethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flexibility of SamplingMetaData replacing current ISamplingPipeline's lastTokens! A lot of info could be passed that way. A slightly more flexible design could be this:

public interface ISampler<T> { T Sample(in Span<float> logits, SamplingMetaData metadata = null); }
public class BaseTokenSampler : ISampler<LLamaToken> { ... }
public interface ISamplerOption {
    bool ShouldApply(in Span<float> logits, SamplingMetadata data) => true;
    void Apply(ref Span<float> logits, SamplingMetadata data);
}

And for the highest level something like samplingParams.CreateSampler() could unify these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot for your suggestions! I like the way of ILLamaSymbol you mentioned. Actually Martin has made some similar changes before, (for example LLamaToken and LLamaSeqId). I skipped these abstractions to rush in this prototype and will add them later.

Add abstractions for Modals

This looks a bit confusing to me, could you please tell me more about your idea? When and who should call NeedsEvaluation?

IReadOnlyList batches; // Name 'Batch' would make more sense instead of 'Sequence' imo

From my point of view they are entirely two different things. Sequence is a collection of semantic-continuous information and its lengths grows during the inference, while Batch is the data fed into the model in one-step inference. Batch might contain data from multiple sequences, as you can see in llama.cpp llama_batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m glad you like it! I actually made a POC for the modals with some draft code, I’ll share tomorrow but one idea is to call them from the high level generator that has access to model.modals to digest the prompt.
Modals don’t need to have state, and would mostly be logic containers. (e.g. ImageModal would process pending image links in the llamabatch, and TextModal would finally process the tokens).

Regarding Batch and Sequence from an inference perspective it just makes more sense to me for a ‘Batch’ to be a ‘batch’ for the user to process/consume, rather than a ‘batch’ for the models internals — but I get it’s an established term for that right now :p Anyway whenever I wrote ‘batch’ above I meant what’s established as ‘sequence’!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly okay to expose batch to users, but I prefer to expose it in low-level APIs (ModelRunner in this draft) because batch is related with llama.cpp directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SamplingMetaData

I like the idea of SamplingMetaData containing more info than just the raw lastTokens that ISamplingPipeline has at the moment (although I'd probably try to come up with another name). Would be a nice simple PR if someone want to adjust the ISamplingPipeline interface to accept something like that (even if it only wraps up lastTokens for now, it's a good place to add more things in the future).

samplingParams.CreateSampler()

I don't like this idea though. One of the motivations of creating ISamplingPipeline was that we have/had configurable sampling and it's a mess - see ISamplingParams. Many of those properties don't mean anything depending on the value of other properties. Adding new sampling stages needs even more properties. Stages are not re-orderable so even with all this it's not powerful enough. New/custom sampling stages cannot be added at all.

In general I don't think designs that try to have a single all-powerful config object are very good. I think it's almost always simpler and more powerful to expose primitives that can be chained together in a nice simple chain of calls like this.

Copy link
Contributor

@Lyrcaxis Lyrcaxis Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AsakusaRinne It's certainly okay to expose batch to users, but I prefer to expose it in low-level APIs

We agree ^^ I now regret the naming I used lol.. I meant to say suggest we wrap it all up inside a Context class for all other low-level and middle-level classes.

@martindevans I don't like this idea though. [..] - see ISamplingParams. [..] it's not powerful enough.

You're right! ISamplerOption in a dynamic list would fix that in my mind, but I get where you're coming from.
I was also having some chain design in mind, but it would shift control during creation, somewhat like so:
var sampler = p.With(new CustomTemperatureSamplerOption()).With(new BaseTopPSamplerOption(0.95f)).ToSampler();

But I understand how that might be considered boilerplate/bloat and leaving it up to a pipeline might be best here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I pushed the current sampler API I experimented with a few different APIs, one was encapsulating every sampler operation into an object (with all the relevant parameters) so a pipeline would just be a list of sampler stage objects. I didn't find it very good to work with, in reality sampler stages are not always completely isolated and so it was a very leaky abstraction.

@AsakusaRinne
Copy link
Collaborator Author

ModelRunner I'm a bit confused about - It seems to be a way to support entirely different backends to llama.cpp? That sounds like it would introduce a huge amount of complexity into the codebase!

@martindevans As an addition to the previous reply, I think we have already had multiple backends (I noticed this when I was working on auto-download proposal). llava_shared could be regarded as a new backend because it has nearly all the properties a new backend have.

  • Having a separated native library file.
  • Having some P/Invoke APIs as binding.
  • Responsible for something else than text generation of llama.cpp backend.

If I'm not misunderstanding it, llava_shared is added mainly because the clip model using for LLaVA. So, fortunately, it's light-weight enough and does not required too much modifications in LLamaSharp.

What I replied before is saying I don't want to introduce an entirely different backend, for example, rwkv.cpp and TensorRT-LLM. however, if there's an audio-modal model similar to LLaVA, shall we support it, too? If the answer is yes, I think we should take that into account.

@zsogitbe
Copy link
Contributor

Rinne, your proposal seems very complicated. This of course does not mean that it is not good, but that it maybe still can be simplified.

What I would like to see in the future is an LLM engine which can process (schedule) parallel requests based on some priority rules, based on available memory (GPU), etc. Each request should have full state info which can be swapped in/out quickly when needed. I see this swapping as simply saving the state into memory (may have an option to save it into a DB/file for later processing) and sending it back to the requester (ref). Then, freeing memory and loading the other request's state if any, processing, etc. In case of the same request just keep everything going. I would not use statefull/stateless executors anymore because it is redundant with the former logic (one type of 'executor' is enough).

I am not sure what you mean by batching, but if you mean splitting the input into chunks which fit into the context, then I would do this also simply by processing a whole request first (processing all chunks of one request sequentially without stopping) and then allow the next request. Some libraries do this chunking already (for example, Semantic Kernel). Maybe it is also an option to use SK as a lower level engine. SK schedules already several tasks (text completion, chat, etc.)... only the final scheduling of the LLM engine is missing (memory management, etc.) which can be added as plugin to SK.

All details about KV cache management, sampling, etc. should be hidden and manageable by options in the LLM engine (but preferably set to good values with no need to change).

Sequence, conversation, message, text... I think that it would be better to have just one type, for example, called 'message' (byte[] ?), because that is what is going to the LLM. Based on the request this message will be simple text, chat history, image, etc.

So, as a summary, I think that you should think reversed (from the point of view of the Application which uses the library) and see how to make all of this much simpler and much more efficient.

@AsakusaRinne
Copy link
Collaborator Author

Each request should have full state info which can be swapped in/out quickly when needed. I see this swapping as simply saving the state into memory (may have an option to save it into a DB/file for later processing) and sending it back to the requester (ref). Then, freeing memory and loading the other request's state if any, processing, etc. In case of the same request just keep everything going.

Please refer to this code of the PR. A scheduler has three queues, which are waiting, running and swapped. When a new request comes in, it is pushed into waiting. In a step of the scheduling, requests in waiting will be scheduled to running if there are enough resource. When a request with higher priority comes in and the running has been full, some requests might be swapped out. I think it matches what you're trying to find, though I haven't implemented the swap in/out yet.

About batching...

I mean the technology of processing multiple sequences (conversations) at the same time to improve the throughput, which was introduced to llama.cpp in this PR for the first time.


From the view of me, the design is not really complicated. As you can see here, only 6 components are abstracted (including Sequence), in which scheduler is the only one that's entirely new. However, it does require lots of works and introduces break changes, so I opened this draft PR for discussions. :)

@zsogitbe
Copy link
Contributor

Rinne, the scheduler staff sound good!

About batching, this is a bit confusing. I think that what they are discussing in ggerganov/llama.cpp#3749 is an internal batching of tokens on the GPU and not several requests/conversations. I think that processing multiple conversations at the same time should mean that the scheduler should execute the conversations in sequence and not parallel by swapping context state only (if the model is the same). This is not batching, but scheduling, but I may misunderstand something.
I think that having several sequences/conversations in one batch is very complicated to manage and prone to error. This probably means that one part of the GPU memory is one request and an other part is an other request and they are processed at once!? And are we sure that both requests have the same parameters?

@martindevans
Copy link
Member

the design is not really complicated

I have to disagree I'm afraid and it's leaving me very conflicted about this proposal. This is trying to introduce multiple new concepts all at once, with a very high level of abstraction, trying to hide/abstract away important details and it's all built on top of a very unstable foundation (llama.cpp changes fast). I honestly can't imagine this all coming together successfully with the current approach.

Rather than trying to build this all at once almost as an entirely new library in the experimental namespace I would rather see this done in very small parts, step by step, building on top of the primitives we have rather than duplicating large amounts of them.

For example referencing back to some of my earlier comments:

  • LLamabatch/LLamaCppRunnerInput
  • ISamplingPipeline/ISamplingMethod
  • AntipromptProcessor/IStoppingCriteria

All of these are almost equivalent. If there are improvements to them to be made, let's develop those improvements in isolation and get them added to the existing interfaces!

multiple backends

I agree something like llava could be regarded as a separate backend, it provides new capabilities that could probably be fenced of into a set of extensions and the core library wouldn't need to be aware of it at all.

What I was really asking about though was multiple backends for the same thing. e.g. llama.cpp for text LLMs, but then also trying to have a separate e.g. exllama backend for text LLMs. I would be very strongly against this, we would have to abstract away a lot of the details of the backend to produce a "lowest common denominator" API between all the backends.

Further thoughts

Let's define some of the various level we have in LLamaSharp at the moment.

At the moment we have:

  • Very low level APIs (raw PInvoke calls to a C-API). In general no one should ever use these.
  • Low-ish level APIs (e.g. various SafeHandle objects). The goal is that this should be the lowest level API - this level should be as powerful and as performant as the lower level, but slightly safer.
  • Mid level APIs (e.g. ISamplingPipeline, BatchedExecutor). These are designed to present as much of the power and performance of the lower levels as possible, but wrapped up in a C# idiomatic API. Not for general consumption, but probably the level an advanced user with a decent understanding of LLM internals should use.
  • High level APIs (e.g. various executors such as StatelessExecutor). The default API a new user who just wants text in -> text out should use. It's ok to lose some of the power that lower levels have, if it makes it simpler to use.

Given these levels, does it make sense to split LLamaSharp into a new package (e.g. LLamaSharp.LowLevel) that contains the very low level and low-ish level APIs into their own package? These are basically just "pure" llama.cpp, exposed to C#. Anyone who wants to build a library on llama.cpp and not use all of our high level stuff could use this package.

This user profile actually describes... me! personally I want decent bindings to llama.cpp so I can build other libraries based on LLMs. The very high level stuff doesn't really interest me that much, it abstracts away too much for my taste.

@martindevans
Copy link
Member

About batching, this is a bit confusing ... I think that having several sequences/conversations in one batch is very complicated to manage and prone to error

@zsogitbe "Batching" in this sense means running inference over a number of separate "sequences" all at once, in one pass over the model you can produce continuations for all sequences at once. This is significantly faster than running inference for each sequence individually because LLMs are normally memory bandwidth bound, not compute bound (e.g. running 8 sequences at once might cost you about 15% speed, for 8x as many tokens, very very rough estimate because it depends a lot on your hardware). This is all exposed in the BatchedExecutor right now, have a look at some of the "batched" examples to see some of the cool things you can do with it (batching, parallel eval, forking, rewinding etc).

Each sequence needs a certain amount of memory in the KV cache (to store it's entire history). So scheduling multiple sequences is all about packing as many of them as possible into that space. This is made extra complex by the fact that sequences can (and often do) share bits of the KV cache! For example if you evaluate the system prompt (just once) you can then fork that sequence into e.g. 8 conversations which all generate different continuations but only consume the KV cache memory for the system prompt once. This is made even more complex by the fact that if you save a sequence and unload it from the KV cache, and then immediately reload it that will no longer share memory so there's potentially a very very high cost to unloading a sequence the first time you do it.

@AsakusaRinne
Copy link
Collaborator Author

AsakusaRinne commented Apr 23, 2024

strongly against multiple LLM backends

I'm not going to introduce any other LLM backend. It's only kinda a design to make things more clear (from my point of view).

Rather than trying to build this all at once almost as an entirely new library in the experimental namespace I would rather see this done in very small parts, step by step, building on top of the primitives we have rather than duplicating large amounts of them.

That's certainly okay. This PR is a draft to manifest my idea.

does it make sense to split LLamaSharp into a new package (e.g. LLamaSharp.LowLevel) that contains the very low level and low-ish level APIs into their own package?

Don't you think this idea is similar to the abstraction of LLamaCppModelRunner in my proposal? :) To me it's good to split it into more packages, such as below (please ignore the possibly poor naming).

  • LLamaSharp.Backend: hold the native library files. (Could be optional with auto-download)
  • LLamaSharp.Native: includes all the things directly coupled with llama.cpp (very low level APIs and low-ish level APIs).
  • LLamaSharp.Engine: responsible for mid-level and some high-level APIs (at least we should provide basic text-completion).
  • LLamaSharp.OpenAI, LLamaSharp.semantic-kernel...etc.: integrations along with high-level APIs.

TBH I don't think BatchedExecutor should be a part of LLamaSharp.Engine. It's reasonable to have an executor in it, but not the BatchedExecutor because it's related with llama.cpp directly. Though batching is a great technology to improve the throughput, users of mid/high level APIs are likely not to care whether we are using batching or normal inference. The text-generation quality and the speed are the most concerns from them.

Edit: making BatchedExecutor internal in the LLamaSharp.Engine sounds reasonable to me. What I mean is that we should avoid exposing details about llama.cpp backend in mid/high level APIs.

I want decent bindings to llama.cpp so I can build other libraries based on LLMs. The very high level stuff doesn't really interest me that much, it abstracts away too much for my taste.

I believe you are not alone. It seems to be a good idea to split the package though it requires much work. :)

@zsogitbe
Copy link
Contributor

  • LLamaSharp.Backend: hold the native library files. (Could be optional with auto-download)
  • LLamaSharp.Native: includes all the things directly coupled with llama.cpp (very low level APIs and low-ish level APIs).
  • LLamaSharp.Engine: responsible for mid-level and some high-level APIs (at least we should provide basic text-completion).
  • LLamaSharp.OpenAI, LLamaSharp.semantic-kernel...etc.: integrations along with high-level APIs.

TBH I don't think BatchedExecutor should be a part of LLamaSharp.Engine. It's reasonable to have an executor in it, but not the BatchedExecutor because it's related with llama.cpp directly. Though batching is a great technology to improve the throughput, users of mid/high level APIs are likely not to care whether we are using batching or normal inference. The text-generation quality and the speed are the most concerns from them.

This sounds good, but I don't like the LLamaSharp.OpenAI part. I would let KernelMemory and SemanticKernel in separate packages as they are now. I would not go in the direction of OpenAI, because that is a completely different story. KernelMemory and SemanticKernel are not OpenAI related, but can work with OpenAI.

It is important that LLamaSharp.Engine is very stable and does not need to change often if llama.cpp changes! The interface LLamaSharp.Native and LLamaSharp.Engine should be made in such a clever way that LLamaSharp.Engine can become as stable product as possible.

I do not believe in batching for Desktop applications because we adjust the context size, number of layers uploaded to the GPU, etc to optimize GPU memory usage. In this scenario batching becomes not important. In a cloud server environment where you have several GPUs and a lot of (expensive!) memory batching is interesting. It is thus important that batching does not become standard, but a choice of execution. I however still do not really understand what happens when two requests have different inference parameters (different temperature topP, etc.) and you process them in one batch together... I expect many errors here.

I believe in automatic scheduling of model processing which would then also mean the automatic (GPU) memory management. I am doing the GPU memory management now manually, freeing GPU and loading other model, etc. This should be done automatically and optimally (difficult!). This is where I expect most from Rinne.

@martindevans
Copy link
Member

I do not believe in batching for Desktop applications because we adjust the context size, number of layers uploaded to the GPU, etc to optimize GPU memory usage. In this scenario batching becomes not important.

Batching is important for desktop usage too, it provides a speedup in any scenario except:

  1. You are compute bound (tiny model or a very slow device)
  2. You only have one sequence

Even desktop apps with a single user chatting to a model might have multiple sequences (e.g. internal summarisation of history, multiple agents for reasoning etc)

I think you and maybe Rinne might be misunderstanding the BatchedExecutor. It isn't only for batching. It provides a better (safer) API to the low level primitives of llama.cpp and exposes things like KV cache manipulations as well (rewind, fork, shift etc). My long term design goal as I've been working on it is for it to act as the foundation of all future executors (even ones with a single conversation thread).

@zsogitbe
Copy link
Contributor

Yes, maybe we don't understand what you exactly mean/want to do. What I mean is that I am filling my GPU's memory completely with one request, so there is no more place for an extra sequence. I think that in this case batching (if I understand you well) will only cause an overhead which will slow down execution. But, maybe you could make a very simple example with two requests in parallel and then we can test it and see what happens. A test where you run the 2 requests sequentially and measure time and then you run the same with batching and measure time. We will then have a good idea if it is interesting.

@martindevans
Copy link
Member

@zsogitbe If you're completely filling the KV cache with a single requests then batching can't help, it can only parallelise as far as it can fit sequences into the cache.

You still get the advantage of a better API though, for example if your context is full there's a nice simple ShiftLeft(N) method to free up some space without re-evaluating anything. At this moment this kind of shifting is hardcoded into the executors and you have no control over it. This is what i mean about it not just being about batching, it's a good foundation even for a single sequence. There's also no overhead to using it like this, it's using the same low level API as the other executors.

Another nice thing about batching is you can share the cache, so if you're almost filling the cache there are some potential use cases. For example if you had a long conversation filling 90% of the cache you could fork it (costing no extra memory), ask the LLM for a summary in that fork (using up extra memory only for the summary) and then discard that fork (getting that memory back).

@AsakusaRinne
Copy link
Collaborator Author

@martindevans It seems that ideas from more people is needed to converge this discussion. Your in-progress work sounds coupled with the current design of master branch, while mine is relatively independent with it. What about introducing a package LLamaSharp.Experimental? I'll put my work on it and warn the user that we don't ensure its stability. Thus we'll have some benefits.

  1. I don't have to be blocked on my experiment. Some problems of design won't be aware unless I have finished enough work of it. As the work progresses, I could open small-size PRs to optimize some current implementations/APIs, while keeping those with controversies in the experimental package.
  2. You can also let the possible refactor alone now and concentrate on your work.
  3. This pattern is similar to separating packages that we discussed above. We could treat it as an experiment before we actually decide to do so. Even regardless of this, keeping an experimental package sounds not bad. It's not my own package, but welcome experimental features from all developers.

@AsakusaRinne
Copy link
Collaborator Author

To clarify: I was not saying we should work out two solutions and let one defeat the other. It seems that @martindevans concerns more about low/mid level APIs, while I prefer to mid/high level APIs. There might be some work of one of us dropped finally, but others could be put together. I believe it won't be a waste of time.

@zsogitbe
Copy link
Contributor

You still get the advantage of a better API though, for example if your context is full there's a nice simple ShiftLeft(N) method to free up some space without re-evaluating anything. At this moment this kind of shifting is hardcoded into the executors and you have no control over it. This is what i mean about it not just being about batching, it's a good foundation even for a single sequence. There's also no overhead to using it like this, it's using the same low level API as the other executors.

Another nice thing about batching is you can share the cache, so if you're almost filling the cache there are some potential use cases. For example if you had a long conversation filling 90% of the cache you could fork it (costing no extra memory), ask the LLM for a summary in that fork (using up extra memory only for the summary) and then discard that fork (getting that memory back).

I have the very strong feeling that this should not be called batching, but maybe load balancing and should incorporate scheduling... the whole automatic management of processing. I also think that this should happen automatically in order to prevent that the users need a PhD to use it :)

@zsogitbe
Copy link
Contributor

I feel a little bit of a conflict here. It is maybe not my place to suggest a solution, but I will try anyway. You can always neglect my suggestion :) .

I think that the best way to proceed would be if Rinne would implement scheduling and Martin would implement load balancing (batching as Martin calls it) and then this two would be merged together to a magic solution which would be the subject of many talks around the world because it would be such a great improvement to LLamaSharp.

Rinne could further improve his experimental solution and we would cherry pick each improvement and add it to the current solution. If deeper refactoring would be needed later because, for example, the code becomes a bit less logical because of the additions, then this could be done later based on the already working modules (will be much easier, then reworking the whole thing at once now without knowing if the improvements work well and are necessary).

@AsakusaRinne
Copy link
Collaborator Author

I will move the works to the experimental package in the future, so I'm going to close this draft PR. However, please feel free to leave comments here.

Our opinions differ in many ways, but this exactly the purpose of opening this draft PR -- to hear different voices to avoid being isolated in my own mind. Thank you a lot for all your comments!

@AsakusaRinne AsakusaRinne mentioned this pull request May 25, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
break change enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants