LLaMa-int4 inference (proof of concept, not for production use)

This is built on a hard fork of gq branch of ggml C++ library. Commit history was lost and is not fully available. This code may not work on systems other than Macbook/iMac.

Dependencies

Pytorch: https://pytorch.org/get-started/locally/ Transformers: https://github.com/huggingface/transformers/#installation Sentencepiece: https://github.com/google/sentencepiece#python-module

You might also need cmake, which could be installed either via brew or from source

Usage for LLaMa inference

Clone the repo git clone https://github.com/NolanoOrg/llama-int4-mac and cd llama-int4-mac
Keep the LLaMa model weights in ../llama/save/7B relative to inside the current folder.
Create a config.json file inside ../llama/save/7B relative to current folder with following keys {"vocab_size": 32000, "n_positions": 2048, "n_embd": 4096, "n_hddn": 11008, "n_head": 32, "n_layer": 32, "rotary_dim": 64}, modify as per your model.
Convert LLaMa to ggml format cd examples/llama && python3 convert-h5-to-ggml.py ../../../llama/save/7B/ 1 -- 1 denotes fp16, 0 denotes fp32.
cd ../.. && mkdir build if not already present.
cd build && cmake .. && make llama-quantize && make llama.
Quantize the model mkdir ../models/ && ./bin/llama-quantize ../../llama/save/7B/llama-f32.binf16.bin ../models/llama7B-0-quant4.bin 2.
Switch to python app directory cd ../app and edit the prompt in tok_prompt.py.
Run the model python3 tok_prompt.py | ../build/bin/llama --model_path ../models/llama7B-0-quant4.bin --vocab ../vocab/llama_vocab_clean.txt -n [NO_OF_TOKENS_TO_GENERATE].

Changes to the original codebase

Most of the codebase for LLaMa is in examples/llama.

I edited the src/ggml.c and its header file to add new activation functions, RMSNorm and fix rope embedding among other things.

I also made changes to examples/utils.h and examples/utils.cpp to add the LLaMa model.

Credits:

This codebase is based on the ggml library.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
app		app
cmake		cmake
examples		examples
include/ggml		include/ggml
src		src
tests		tests
vocab		vocab
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

cmake

cmake

examples

examples

include/ggml

include/ggml

src

src

tests

tests

vocab

vocab

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

Repository files navigation

LLaMa-int4 inference (proof of concept, not for production use)

Dependencies

Usage for LLaMa inference

Changes to the original codebase

Credits:

License

About

Releases

Packages

Languages

License

NolanoOrg/llama-int4-quant

Folders and files

Latest commit

History

Repository files navigation

LLaMa-int4 inference (proof of concept, not for production use)

Dependencies

Usage for LLaMa inference

Changes to the original codebase

Credits:

License

About

Resources

License

Stars

Watchers

Forks

Languages