ScaleLLM: An efficient LLM Inference solution

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more.

ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our Roadmap for more details.

News:

[03/2024] - Advanced features support for CUDA graph, prefix cache, chunked prefill and speculative decoding.
[11/2023] - First release with support for popular open-source models.

Key Features

High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more.
Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
OpenAI-compatible API: An efficient golang rest api server that compatible with OpenAI.
Huggingface models: Seamless integration with most popular HF models, supporting safetensors.
Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.

Supported Models

Models	Tensor Parallel	Quantization	Chat API	HF models examples
Aquila	Yes	Yes	Yes	BAAI/Aquila-7B, BAAI/AquilaChat-7B
Bloom	Yes	Yes	No	bigscience/bloom
Baichuan	Yes	Yes	Yes	baichuan-inc/Baichuan2-7B-Chat
ChatGLM3	Yes	Yes	Yes	THUDM/chatglm3-6b
Gemma	Yes	Yes	Yes	google/gemma-2b
GPT_j	Yes	Yes	No	EleutherAI/gpt-j-6b
GPT_NeoX	Yes	Yes	No	EleutherAI/gpt-neox-20b
GPT2	Yes	Yes	No	gpt2
InternLM	Yes	Yes	Yes	internlm/internlm-7b
Llama3/2	Yes	Yes	Yes	meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-8B, meta-llama/Llama-2-7b
Mistral	Yes	Yes	Yes	mistralai/Mistral-7B-v0.1
MPT	Yes	Yes	Yes	mosaicml/mpt-30b
Phi2	Yes	Yes	No	microsoft/phi-2
Qwen	Yes	Yes	Yes	Qwen/Qwen-72B-Chat
Yi	Yes	Yes	Yes	01-ai/Yi-6B, 01-ai/Yi-34B-Chat-4bits, 01-ai/Yi-6B-200K

If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on GitHub Issues.

Getting Started

The easiest way to get started with our project is by using the official Docker images. If you don't have Docker installed, please follow the installation instructions for your platform. Below, you will find a list of all available Docker images for our project:

Docker Image	cuda 12.1	cuda 11.8
scalellm	Yes	No
scalellm_cu118	No	Yes
scalellm-gateway	-	-
chatbot-ui	-	-

Docker Installation

You can download and install Docker from the official website: Docker Installation. To use GPUs in docker, you also need to install the NVIDIA Container Toolkit.

ScaleLLM server

Once you have Docker installed, you can run ScaleLLM Docker container with latest image using the following command:

docker pull docker.io/vectorchai/scalellm:latest
docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct \
  -e DEVICE=cuda:0 \
  docker.io/vectorchai/scalellm:latest --logtostderr

This command starts the Docker container with GPU support and various configuration options.

HF_MODEL_ID specifies which Hugging Face model you want to run.
HF_MODEL_REVISION specifies which Hugging Face model revision you want to run. By default, it is set to "main".
DEVICE specifies the device on which this model should run. By default, it is set to "auto", using all available GPUs. You can also specify specific GPUs by using "cuda:0,cuda:1", or use CPU by using "cpu".
HF_MODEL_ALLOW_PATTERN specifies which types of files are allowed to be downloaded. By default, it will be configured automatically based on tensor type. Only use this option if the default configuration is not working for you.
HUGGING_FACE_HUB_TOKEN specifies the token from huggingface for gated models. -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN

Warning

The docker image with tag 'latest' could be changed to a new version upon new release. In order to use latest image, you may need to repull the image with specific tag.

Two version of docker images are provided for cuda 12.1 and cuda 11.8. Please choose the right image for your environment.

NCCL might fall back to using the host memory if NVLink or PCI is not available. To allow NCCL to use the host memory, we added '--shm-size=1g' to the docker run command.

Although ScaleLLM supports both CPU and GPU, we recommend using GPU for better performance. CPU support is mainly for debugging and testing purposes, so the performance might be sub-optimal.

Ports and Endpoints

After running the Docker container, two ports are exposed:

Port 8888 for gRPC Server:

The gRPC server is served on 0.0.0.0:8888 by default. You can use gRPC to interact with the service.
Port 9999 for HTTP Server:

The simple HTTP server for instrument will be served on 0.0.0.0:9999 by default. This server provides various endpoints for managing and monitoring the service:
- Use curl localhost:9999/health to check the health status of the service.
- Use curl localhost:9999/metrics to export Prometheus metrics.
- Use curl localhost:9999/gflags to list all available gflags for configuration.
- add more to come...

Rest API Server

You can also start a REST API gateway with latest image using the following command:

docker pull docker.io/vectorchai/scalellm-gateway:latest
docker run -it --net=host \
  docker.io/vectorchai/scalellm-gateway:latest --logtostderr

The REST API Server is available on localhost:8080. You can use REST API requests to interact with the system. Check out the Usage Examples section for more details.

Chatbot UI

A local Chatbot UI is also available on localhost:3000. You can start it with latest image using the following command:

docker pull docker.io/vectorchai/chatbot-ui:latest
docker run -it --net=host \
  -e OPENAI_API_HOST=http://127.0.0.1:8080 \
  -e OPENAI_API_KEY=YOUR_API_KEY \
  docker.io/vectorchai/chatbot-ui:latest

Docker Compose

Using Docker Compose is the easiest way to run ScaleLLM with all the services together. If you don't have Docker Compose installed, please follow the installation doc for your platform.

curl https://raw.githubusercontent.com/vectorch-ai/ScaleLLM/main/scalellm.yml -sSf > scalellm_compose.yml
HF_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct DEVICE=cuda docker compose -f ./scalellm_compose.yml up

you will get following running services:

Chatbot UI on port 3000: localhost:3000
ScaleLLM gRPC server on port 8888: localhost:8888
ScaleLLM HTTP server for monitoring on port 9999: localhost:9999
ScaleLLM REST API server on port 8080: localhost:8080

Usage Examples

Chat Completions

You can get chat completions with the following example:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

import os
import sys
import openai

openai.api_base = "http://localhost:8080/v1"

# List available models
print("==== Available models ====")
models = openai.Model.list()

model = "meta-llama/Meta-Llama-3-8B-Instruct"

completion = openai.ChatCompletion.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"},
    ],
    max_tokens=256,
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in completion:
    content = chunk["choices"][0]["delta"].get("content")
    if content:
        print(content, end="")

Completions

For regular completions, you can use this example:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "prompt": "hello",
    "max_tokens": 32,
    "temperature": 0.7,
    "stream": true
  }'

import os
import sys
import openai

openai.api_base = "http://localhost:8080/v1"

# List available models
print("==== Available models ====")
models = openai.Model.list()

model = "meta-llama/Meta-Llama-3-8B-Instruct"

completion = openai.Completion.create(
    model=model,
    prompt="hello",
    max_tokens=256,
    temperature=0.7,
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in completion:
    content = chunk["choices"][0].get("text")
    if content:
        print(content, end="")

Advanced Features

CUDA Graph

CUDA Graph can improve performance by reducing the overhead of launching kernels. ScaleLLM supports CUDA Graph for decoding by default. In addition, It also allows user to specify which batch size to capture by setting the --cuda_graph_batch_sizes flag.

for example:

docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct \
  docker.io/vectorchai/scalellm:latest --logtostderr --enable_cuda_graph --cuda_graph_batch_sizes=1,2,4,8

The limitations of CUDA Graph could cause problems during development and debugging. If you encounter any issues related to it, you can disable CUDA Graph by setting the --enable_cuda_graph=false flag.

Prefix Cache

The KV cache is a technique that caches the intermediate kv states to avoid redundant computation during LLM inference. Prefix cache extends this idea by allowing kv caches with the same prefix to be shared among different requests.

ScaleLLM supports Prefix Cache and enables it by default. You can disable it by setting the --enable_prefix_cache=false flag.

Chunked Prefill

Chunked Prefill splits a long user prompt into multiple chunks and populates the remaining slots with decodes. This technique can improve decoding throughput and enhance the user experience caused by long stalls. However it may slightly increase Time to First Token (TTFT). ScaleLLM supports Chunked Prefill, and its behavior can be controlled by setting the following flags:

--max_tokens_per_batch: The maximum tokens for each batch, default is 512.
--max_seqs_per_batch: The maximum sequences for each batch, default is 128.

Speculative Decoding

Speculative Decoding is a common used technique to speed up LLM inference without changing distribution. During inference, it employs an economical approximation to generate speculative tokens, subsequently validated by the target model. For now, ScaleLLM supports Speculative Decoding with a draft model to generate draft tokens, which can be enabled by configuring a draft model and setting the speculative steps.

for example:

docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_DRAFT_MODEL_ID=google/gemma-2b-it \
  -e HF_MODEL_ID=google/gemma-7b-it \
  docker.io/vectorchai/scalellm:latest --logtostderr --num_speculative_tokens=5 --device=cuda:0 --draft_device=cuda:0

Quantization

Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization (GPTQ) and Activation-aware Weight Quantization (AWQ), with seamless integration into the following libraries: autogptq, exllama, exllamav2, and awq.

By default, exllamav2 is employed for GPTQ 4-bit quantization. However, you have the flexibility to choose a specific implementation by configuring the "--qlinear_gptq_impl" option, which allows you to select from exllama, exllamav2, or auto option.

Limitations

There are several known limitations we are looking to address in the coming months, including:

Only supports GPUs that newer than Turing architecture.

Contributing

If you have any questions or want to contribute, please don't hesitate to ask in our "Discussions" forum or join our "Discord" chat room. We welcome your input and contributions to make ScaleLLM even better. Please follow the Contributing.md to get started.

Acknowledgements

The following open-source projects have been used in this project, either in their original form or modified to meet our needs:

License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 476 Commits
.github/workflows		.github/workflows
cmake		cmake
docker		docker
docs		docs
gateway		gateway
proto		proto
python		python
scripts		scripts
src		src
third_party		third_party
tools		tools
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.cppcheck-suppress		.cppcheck-suppress
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
scalellm.yml		scalellm.yml
vcpkg.json		vcpkg.json

License

vectorch-ai/ScaleLLM

Folders and files

Latest commit

History

Repository files navigation

ScaleLLM: An efficient LLM Inference solution

News:

Key Features

Table of contents

Supported Models

Getting Started

Docker Installation

ScaleLLM server

Ports and Endpoints

Rest API Server

Chatbot UI

Docker Compose

Usage Examples

Chat Completions

Completions

Advanced Features

CUDA Graph

Prefix Cache

Chunked Prefill

Speculative Decoding

Quantization

Limitations

Contributing

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages