Skip to content

This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.

Notifications You must be signed in to change notification settings

mddunlap924/LLM-Inference-Serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Serving and Inference

This repository showcases examples of utilizing libraries like llamafile for efficient deployment of large language models (LLMs) on consumer-grade CPU hardware, emphasizing high-throughput and memory-efficient inference.

Introduction  •  Demo. Notebooks  •  References  •  Issues  •  TODOs

Introduction

Open source large language models (LLMs) are being democratized in a variety of applications, but most of these LLMs still face fundamental issues by demanding large memory and computational power (e.g., GPUs). To address these fundamental challenges, an increase in libraries and frameworks for LLM inference and serving are being developed.

This repository is focused on demonstrating some of these packages which provide the following benefits of low-latency, high-throughput, and cost-effectiveness. Several of the notebooks in this repo. will demonstrate how to:

  • Execute LLMs on CPUs instead of GPU hardware.
  • Execute quantized Llama-2 models that are ~4GB in size.
  • Obtain hidden dimension embeddings from GGUF models.

Demo Notebooks

llamafile

llamafile lets you distribute LLMs with a single binary file. llamafile is the fastest executable file format ever and it lets you turn LLM weights into runnable llama.cpp binaries using cosmo libc. It executes on six different OSes and can run on CPU or GPUs. The following notebooks show examples of how to call and execute LLMs using the llamafile library. The files in folder llamafile-assets were downloaded from here.


8-Core CPU Executing llamafile Command-line Binary with Mistral-7B

1.) llamafile command-line binary: this notebook demonstrates how to execute jartine/mistral-7b.llamafile from the command line and then save the model's output to a text file.

2.) llamafile with external model weights: this notebook demonstrates how to execute a LLM downloaded in the .GGUF file format using llamafile-main and then save the model's output to a text file.

LLaMa.cpp

LLaMa.cpp (or LLaMa C++) provides a lighter, more portable alternative to heavyweight frameworks. LLaMa.cpp was developed by Georgi Gerganov. It implements the Meta’s LLaMa architecture in efficient C/C++, and it is one of the most dynamic open-source communities around the LLM inference [Source].

1.) llama.cpp embeddings: this notebook demonstrates how to get hidden dimension embeddings from a single pass through a GGUF model. Once the embeddings are available they can be used for several ML/AI techniques such as classification, text-similarity, clustering, etc.

References

  1. File Formats:
  2. Libraries and Frameworks:
  3. Articles and Blogs:

Issues

This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or submit a Pull Request. 😃

TODOs

  • Feel free to raise an Issue for a feature you would like to see added.

Liked the work? Please give a star!

About

This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published