ggml

Tensor library for machine learning

Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos

Features

Written in C
16-bit float support
Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
Automatic differentiation
ADAM and L-BFGS optimizers
Optimized for Apple Silicon
On x86 architectures utilizes AVX / AVX2 intrinsics
On ppc64 architectures utilizes VSX intrinsics
No third-party dependencies
Zero memory allocations during runtime

Updates

Whisper inference (example)

With ggml you can efficiently run Whisper inference on the CPU.

Memory requirements:

Model	Disk	Mem
tiny	75 MB	~280 MB
base	142 MB	~430 MB
small	466 MB	~1.0 GB
medium	1.5 GB	~2.6 GB
large	2.9 GB	~4.7 GB

GPT inference (example)

With ggml you can efficiently run GPT-2 and GPT-J inference on the CPU.

Here is how to run the example programs:

# Build ggml + examples
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2 gpt-j

# Run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM)
../examples/gpt-j/download-ggml-model.sh 6B
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"

# Install Python dependencies
python3 -m pip install -r ../requirements.txt

# Run the Cerebras-GPT 111M model
# Download from: https://huggingface.co/cerebras
python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/
./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example"

The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows:

Model	Size	Time / Token
GPT-2	117M	5 ms
GPT-2	345M	12 ms
GPT-2	774M	23 ms
GPT-2	1558M	42 ms
---	---	---
GPT-J	6B	125 ms

For more information, checkout the corresponding programs in the examples folder.

Using Metal (only with GPT-2)

For GPT-2 models, offloading to GPU is possible. Note that it will not improve inference performances but will reduce power consumption and free up the CPU for other tasks.

To enable GPU offloading on MacOS:

cmake -DGGML_METAL=ON -DBUILD_SHARED_LIBS=Off ..

# add -ngl 1
./bin/gpt-2 -t 4 -ngl 100 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

Using cuBLAS

# fix the path to point to your CUDA compiler
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..

Using clBLAST

cmake -DGGML_CLBLAST=ON ..

Resources

GGML - Large Language Models for Everyone: a description of the GGML format provided by the maintainers of the llm Rust crate, which provides Rust bindings for GGML
marella/ctransformers: Python bindings for GGML models.
go-skynet/go-ggml-transformers.cpp: Golang bindings for GGML models
smspillaz/ggml-gobject: GObject-introspectable wrapper for use of GGML on the GNOME platform.

Name		Name	Last commit message	Last commit date
Latest commit History 455 Commits
.github/workflows		.github/workflows
ci		ci
cmake		cmake
docs		docs
examples		examples
include/ggml		include/ggml
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.zig		build.zig
ggml.pc.in		ggml.pc.in
requirements.txt		requirements.txt

License

st6io/ggml

Folders and files

Latest commit

History

Repository files navigation

ggml

Features

Updates

Whisper inference (example)

GPT inference (example)

Using Metal (only with GPT-2)

Using cuBLAS

Using clBLAST

Resources

About

Resources

License

Stars

Watchers

Forks

Languages