Skip to content

mariochavez/llm_server

LLM Server

LLM Server is a Ruby Rack API that hosts the llama.cpp binary in memory(1) and provides an endpoint for text completion using the configured Language Model (LLM).

(1) The server now introduces am inteactive configuration key. By default this value is set to true. I have found this mode works well with models like: Llama, Open Llama, and Vicuna. Other models like Orca model tends to allucinate, but turning off interactive model and loading the model on each request works for Orca, especially for the smaller model 3b. It responds very fast.

Overview

LLM Server serves as a convenient wrapper for the llama.cpp binary, allowing you to interact with it through a simple API. It exposes a single endpoint that accepts text input and returns the completion generated by the Language Model.

llama.cpp process is kept in memory to provide a better experience. Use any Language Model supported by llama.cpp. Please, look at the configuration section of the server to setup your model.

LlmServer.mp4

Prerequisites

To use LLM Server, ensure that you have the following components installed:

  • Ruby (version 3.2.2 or higher)
  • A llama.cpp binary. llama.cpp repository have instructions to build the binary
  • A Language Model (LLM) compatible with the llama.cpp binary. Hugging Face is a place to look for a model

Getting Started

Follow these steps to set up and run the LLM Server:

  1. Clone the LLM Server repository:
$ git clone https://github.com/your-username/llm-server.git
  1. Change to the project directory:
$ cd llm-server
  1. Install the required dependencies:
$ bundle install
  1. Copy the file config/config.yml.sampleto config/config.yml. The sample file is a template to configure your models. See bellow for more information.

  2. Start the server:

$ bin/server

This will start the server on the default port (9292). Export a PORT variable before starting the server to use a different port. Puma server starts in a single-mode with one thread to protect the llama.cpp process from parallel inferences. The Puma server enqueues requests to be served first in, first out.

Configuration

Before looking into server configuration, remember that you need at least one Large Language Model compatible with llama.cpp.

Place your models inside ./models folder.

Update the configuration file to better fit your model.

current_model: "vic-13b-1.3"
llama_bin: "../llama.cpp/main"
models_path: "./models"

models:
  "orca-3b":
    model: "orca-mini-3b.ggmlv3.q4_0.bin"
    interactive: false
    strip_before: "respuesta: "
    parameters: >
      -n 2048 -c 2048 --top_k 40 --temp 0.1 --repeat_penalty 1.2 -t 6 -ngl 1
    timeout: 90
  "vic-13b-1.3":
    model: "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin"
    suffix: "Asistente:"
    reverse_prompt: "Usuario:"
    parameters: >
      -n 2048 -c 2048 --top_k 10000 --temp 0 --repeat_penalty 1.2 -t 4 -ngl 1
    timeout: 90

The models key allows you to configure one or more models to be used by the server. Not that the server are going to use all of them at the same time.

To configure a model, use a unique helpful name, ex: open-llama-7b. Then add three parameters:

  • model: This is the name the file for that model.
  • suffix: String that suffix prompt. This is required for interactive mode.
  • reverse_prompt: This halt generation at PROMPT, return control in interactive mode. This is required for interactive mode.
  • interactive: Tells the server how to load the model. When true, model is loaded in interactive mode and it is keep in memory. When false, model is loaded on each request. This works fine for small models. By default, this value is true.
  • strip_before: When running model in non-interactive mode, you can use this to strip from response any unwanted text.
  • parameters: These are the parameters that are passed to llama.cpp process to load and run your model. It is important that model is executed as interactive to take advantage of being in memory all the time. See llama.cpp documentation to learn what other parameters to pass to the process.
  • timeout: This tells the server how much time in seconds to wait for the model to produce a response before it assumes that model did’t respond.

The first three keys tells the server how to start the Large Language Model process.

  • current_model: Has the key of a model defined in the models key. This is the model to be executed with the server.
  • llama_bin: Points to the llama.cpp binary relatively to the server path.
  • models_path: Is the path are saved. This is relative to the server path.

API Documentation

The API is simple, it send a JSON object as payload and receives a JSON object as response. You can include headers Accept and Content-Type in every request with a value application/json or you can omit them, the server will assume the value for both of them.

If you request has a different values for Accept or Content-Type then you will receive a status code 406 - Not Acceptable.

Requesting an endpoint not available will produce a 404 - Not found response. In case of trouble with the Large Language Model you receive a 503 - Server Unavailable status code.

Text Completion

Endpoint: POST /completion

Request Body: The request body should contain a JSON object with the following key:

  • prompt: The input text for which completion is requested.

Example request body:

{
  "prompt": "Who created Ruby language?"
}

Response: The response will be a JSON object containing the completion generated by the LLM and the used model.

Example response body:

{
  "model": "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin",
  "response": "The Ruby programming language was created by Yukihiro Matsumoto in the late 1990s. He wanted to create a simple, intuitive and dynamic language that could be used for various purposes such as web development, scripting and data analysis."
}

Examples

Here"s an example using curl to make a completion request:

curl -X POST -H "Content-Type: application/json" -d "{'prompt':'Who created Ruby language?'}" http://localhost:9292/completion

The response will be:

{
  "model": "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin",
  "response": "The Ruby programming language was created by Yukihiro Matsumoto in the late 1990s. He wanted to create a simple, intuitive and dynamic language that could be used for various purposes such as web development, scripting and data analysis."
}

Feel free to modify the request body and experiment with different input texts or to provide a more complex prompt for the model.

The client.

There is a gem llm_client that you can use to interact with the LLM Server.

Here is an example on how to use the gem.

response = LlmClient.completion("Who is the creator of Ruby language?")

if result.success?
  puts "Completions generated successfully"
  response = result.success
  puts "Status: #{response.status}"
  puts "Body: #{response.body}"
  puts "Headers: #{response.headers}"
  calculated_response = response.body[:response]
  puts "Calculated Response: #{calculated_response}"
else
  puts "Failed to generate completions"
  error = result.failure
  puts "Error: #{error}"
end

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/mariochavez/llm_server. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Llm Server project"s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

Conclusion

LLM Server provides a simple way to interact with the llama.cpp binary and leverage the power of your configured Language Model. You can integrate this server into your applications to facilitate text completion tasks.

About

Rack API application for Llama.cpp

Topics

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.txt

Code of conduct

Stars

Watchers

Forks

Releases

No releases published