Dockerised llamafile

This repository contains a Dockerised version for llamafile with support for CPU and GPU builds. The Dockerfile.gpu is based on the official CUDA image.

Utility Scripts

The scripts directory contains a utility script to download models from the hugging face model hub and run the docker image. The script can be used as follows:

  1. Download a model from the hugging face model hub:
python scripts/ download

To pass any raw arguments to the llamafile executable, use the --extra-args flag with run and then any flag you want to pass after it.

Docker Hub

The docker images are available on Docker Hub at gauransh/llamafile-docker. The images are tagged as latest and latest-gpu for CPU and GPU builds respectively. Currently, the latest refers to v0.6 of llamafile release.


  1. Install Docker on your host machine.

Only needed for GPU usage

  1. Install nvidia-container-toolkit on your host machine.


  1. CPU ONLY: docker run -v <host-path>:/app/models -p <host-port>:<contianer-port> gauransh/llamafile-docker:latest run -m <path-to-model>

Example: docker run -v ./models:/app/models -p 7777:8080 gauransh/llamafile-docker run -m

  1. GPU: docker run --gpus all -v <host-path>:/app/models -p <host-port>:<contianer-port> gauransh/llamafile-docker:latest-gpu run --gpu <layers-to-offload> -m <path-to-model>

Example: docker run --gpus all -v ./models:/app/models -p 7777:8080 gauransh/llamafile-docker:latest-gpu run --gpu 33 -m

Model Persistance: The models weights are saved in the /app/models directory in the container. To persist the models, attach a volume to the container at this path. Check Usage above for an example.

Script Usage:

Parameters that can be passed to the scripts.

python scripts/ -h          
usage: Llamafile Docker Utility [-h] {run,download} ...

positional arguments:
    run           Run the program
    download      Download artifacts from the hugging face model hub

  -h, --help      show this help message and exit
python scripts/ run -h
usage: Llamafile Docker Utility run [-h] [-m MODEL] [--host HOST] [--extra-args ...] [--gpu GPU]

  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        The name of the model to run. E.g. url/path to model gguf. If not provided, the default model will be used.
  --host HOST           Specify the host address for the llamafile server
  --extra-args ...      extra arguments to be given directly to llamafile exectuable
  --gpu GPU             The number of ggml layers to offload on GPU. E.g. 1. If not provided, the default model will be used
python scripts/ download -h
usage: Llamafile Docker Utility download [-h] [--filename FILENAME] url

positional arguments:
  url                  The URL to download E.g.

  -h, --help           show this help message and exit
  --filename FILENAME  The filename to save the model as. E.g. mixtral-8x7b-v0.1.Q2_K.gguf. If not provided, the filename will be inferred from the URL. and
                       saved in the models directory.

Usage with OpenAI API

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
completion =
        {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
        {"role": "user", "content": "Write a limerick about python exceptions"}