Skip to content

Latest commit

 

History

History
41 lines (33 loc) · 2.76 KB

README.md

File metadata and controls

41 lines (33 loc) · 2.76 KB

LLM functionalities

This container contains the functionality required for the LLM to work. You can find the code for the Alpaca-7B model based on LLaMA we currently use here. This container automatically downloads the fine-tuned model from S3.

This container contains the llm_runner, which is called with prompts by other system components. We have decided to host the LLM in another Docker container to ensure being able to allocate the required GPUs specifically. The LLaMa model requires about ?? of GPU RAM to run. We found that AWS EC2 G5 instances work best to run this model with a relatively high inference speed (depending on the number of tokens between 0.5 - 2.0 seconds).

LLM runner

On spinning up the container, the LLM is loaded into GPU memory if a GPU is detected. If not, the LLM is not loaded and the container quits gracefully. It usually took about 3 minutes to load the model into memory.

By hosting the model in a separate Docker container, we are able to update prompts to the model from functionalities flexibly without waiting for the model to load into memory again. On Kubernetes deployment, we could therefore restart the other containers without taking this container down.

Any module in the other Docker containers can either call the model with a batch request (batch_call_model) or a single request (call_model). This is currently done by for example the following functionalities:


Required models

If you want to define new models that the container has to load, add the model name in model_requirements.txt, e.g. this how the file currently looks:

alpaca_llm

The download_required_artefacts.sh will be run on startup of the docker container and check if all models are available. If not, and we are aws authenticated, we will download the required models.