containers · MichaelClifford · Mar 22, 2024 · Mar 21, 2024
@@ -1,34 +1,117 @@
-# RAG + Langchain
+# RAG (Retrieval Augmented Generation) Chat Application
 
-This example will deploy a local RAG application using a chromadb server, a llama.cpp model server and a python app built with langchain.  
-
-#
+This demo provides a simple recipe to help developers start to build out their own custom RAG (Retrieval Augmented Generation) applications. It consists of three main components; the Model Service, the Vector Database and the AI Application.
 
-### Deploy ChromaDB Vector Database 
-Use the existing ChromaDB image to deploy a vector store service.
+There are a few options today for local Model Serving, but this recipe will use [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) and their OpenAI compatible Model Service. There is a Containerfile provided that can be used to build this Model Service within the repo, [`playground/Containerfile`](/playground/Containerfile).
 
-* `podman pull chromadb/chroma`
-* `podman run --rm -it -p 8000:8000 chroma`
+In order for the LLM to interact with our documents, we need them stored and available in such a manner that we can retrieve a small subset of them that are relevant to our query. To do this we employ a Vector Database alongside an embedding model. The embedding model converts our documents into numerical representations, vectors, such that similarity searches can be easily performed. The Vector Database stores these vectors for us and makes them available to the LLM. In this recipe we will use [chromaDB](https://docs.trychroma.com/) as our Vector Database.
 
-### Deploy Model Service
+Our AI Application will connect to our Model Service via it's OpenAI compatible API. In this example we rely on [Langchain's](https://python.langchain.com/docs/get_started/introduction) python package to simplify communication with our Model Service and we use [Streamlit](https://streamlit.io/) for our UI layer. Below please see an example of the RAG application.     
 
-To start the model service, refer to [the playground model-service document](../playground/README.md)
+![](/assets/rag_ui.png)
 
-### Build and Deploy RAG app
-Deploy a small application that can populate the data base from the vectorDB and generate a response with the LLM.
 
-We will want to have an embedding model that we can volume mount into our running application container. You can use the code snippet below to pull a copy of the `BAAI/bge-base-en-v1.5` embedding model. 
+# Build the Application
 
+In order to build this application we will need two models, a Vector Database, a Model Service and an AI Application.  
+
+* [Download models](#download-models)
+* [Deploy the Vector Database](#deploy-the-vector-database)
+* [Build the Model Service](#build-the-model-service)
+* [Deploy the Model Service](#deploy-the-model-service)
+* [Build the AI Application](#build-the-ai-application)
+* [Deploy the AI Application](#deploy-the-ai-application)
+* [Interact with the AI Application](#interact-with-the-ai-application)
+
+### Download models
+
+If you are just getting started, we recommend using [Mistral-7B-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1). This is a well performant mid-sized model with an apache-2.0 license. In order to use it with our Model Service we need it converted and quantized into the [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md). There are a number of ways to get a GGUF version of Mistral-7B, but the simplest is to download a pre-converted one from [huggingface.co](https://huggingface.co) here: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF. There are a number of options for quantization level, but we recommend `Q4_K_M`. 
+
+The recommended model can be downloaded using the code snippet below:
+
+```bash
+cd models
+wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
+cd ../
+```
+
+_A full list of supported open models is forthcoming._  
+
+In addition to the LLM, RAG applications also require an embedding model to convert documents between natural language and vector representations. For this demo we will use [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) it is a fairly standard model for this use case and has an MIT license.    
+
+The code snippet below can be used to pull a copy of the `BAAI/bge-base-en-v1.5` embedding model and store it in your `models/` directory. 
 
 ```python 
 from huggingface_hub import snapshot_download
 snapshot_download(repo_id="BAAI/bge-base-en-v1.5",
-                cache_dir="../models/",
+                cache_dir="models/",
                 local_files_only=False)
 ```
 
-Follow the instructions below to build you container image and run it locally. 
+### Deploy the Vector Database 
+
+To deploy the Vector Database service locally, simply use the existing ChromaDB image. 
+
+```bash
+podman pull chromadb/chroma
+```
+```bash
+podman run --rm -it -p 8000:8000 chroma
+```
+
+This Vector Database is ephemeral and will need to be re-populated each time the container restarts. When implementing RAG in production, you will want a long running and backed up Vector Database.
+
+
+### Build the Model Service
+
+The complete instructions for building and deploying the Model Service can be found in the [the playground model-service document](../playground/README.md).
+
+The Model Service can be built from the root directory with the following code snippet:
+
+```bash
+podman build -t llamacppserver playground/
+```
+
+
+### Deploy the Model Service
+
+The complete instructions for building and deploying the Model Service can be found in the [the playground model-service document](../playground/README.md).
+
+The local Model Service relies on a volume mount to the localhost to access the model files. You can start your local Model Service using the following podman command:  
+```
+podman run --rm -it \
+        -p 8001:8001 \
+        -v Local/path/to/locallm/models:/locallm/models \
+        -e MODEL_PATH=models/<model-filename> \
+        -e HOST=0.0.0.0 \
+        -e PORT=8001 \
+        llamacppserver
+```
+
+### Build the AI Application
+
+Now that the Model Service is running we want to build and deploy our AI Application. Use the provided Containerfile to build the AI Application image in the `rag-langchain/` directory.
+
+```bash
+podman build -t rag rag-langchain/ -f rag-langchain/builds/Containerfile  
+```
+### Deploy the AI Application
+
+Make sure the Model Service and the Vector Database are up and running before starting this container image. When starting the AI Application container image we need to direct it to the correct `MODEL_SERVICE_ENDPOINT`. This could be any appropriately hosted Model Service (running locally or in the cloud) using an OpenAI compatible API. In our case the Model Service is running inside the podman machine so we need to provide it with the appropriate address `10.88.0.1`. The same goes for the Vector Database. Make sure the `VECTORDB_HOST` is correctly set to `10.88.0.1` for communication within the podman virtual machine.   
+
+There also needs to be a volume mount into the `models/` directory so that the application can access the embedding model as well as a volume mount into the `data/` directory where it can pull documents from to populate the Vector Database.  
+
+The following podman command can be used to run your AI Application:  
+
+```bash
+podman run --rm -it -p 8501:8501 
+-e MODEL_SERVICE_ENDPOINT=http://10.88.0.1:8001/v1 
+-e VECTORDB_HOST=10.88.0.1 
+-v Local/path/to/locallm/models/:/rag/models 
+-v Local/path/to/locallm/data:/rag/data
+rag   
+```
 
-* `podman build -t ragapp rag-langchain -f rag-langchain/builds/Containerfile`
-* `podman run --rm -it -p 8501:8501 -v Local/path/to/locallm/models/:/rag/models:Z -v Local/path/to/locallm/data:/rag/data:Z -e MODEL_SERVICE_ENDPOINT=http://10.88.0.1:8001/v1 ragapp -- -H 10.88.0.1 `
+### Interact with the AI Application
 
+Everything should now be up an running with the rag application available at [`http://localhost:8501`](http://localhost:8501). By using this recipe and getting this starting point established, users should now have an easier time customizing and building their own LLM enabled RAG applications.   
@@ -18,24 +18,21 @@
 import argparse
 import pathlib
 
-model_service = os.getenv("MODEL_SERVICE_ENDPOINT",
-                          "http://0.0.0.0:8001/v1")
-
-parser = argparse.ArgumentParser()
-parser.add_argument("-c", "--chunk_size", default=150)
-parser.add_argument("-e", "--embedding_model", default="BAAI/bge-base-en-v1.5")
-parser.add_argument("-H", "--vdb_host", default="0.0.0.0")
-parser.add_argument("-p", "--vdb_port", default="8000")
-parser.add_argument("-n", "--name", default="test_collection")
-args = parser.parse_args()
-
-vectorDB_client = HttpClient(host=args.vdb_host,
-                    port=args.vdb_port,
+model_service = os.getenv("MODEL_SERVICE_ENDPOINT","http://0.0.0.0:8001/v1")
+chunk_size = os.getenv("CHUNK_SIZE", 150)
+embedding_model = os.getenv("EMBEDDING_MODEL","BAAI/bge-base-en-v1.5")
+vdb_host = os.getenv("VECTORDB_HOST", "0.0.0.0")
+vdb_port = os.getenv("VECTORDB_PORT", "8000")
+vdb_name = os.getenv("VECTORDB_NAME", "test_collection")
+
+
+vectorDB_client = HttpClient(host=vdb_host,
+                    port=vdb_port,
                     settings=Settings(allow_reset=True,))
 
 def clear_vdb():
-    global client
-    client.delete_collection(args.name)
+    global vectorDB_client
+    vectorDB_client.delete_collection(vdb_name)
     print("clearing DB")
 
 def is_text_file(file_path):
@@ -59,16 +56,16 @@ def get_files():
 ### populate the DB ####
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
-embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=args.embedding_model)
-e = SentenceTransformerEmbeddings(model_name=args.embedding_model)
+embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embedding_model)
+e = SentenceTransformerEmbeddings(model_name=embedding_model)
 
-collection = vectorDB_client.get_or_create_collection(args.name,
+collection = vectorDB_client.get_or_create_collection(vdb_name,
                                       embedding_function=embedding_func)
 if collection.count() < 1 and data != None:
     print("populating db")
     raw_documents = TextLoader(f'{data}').load()
     text_splitter = CharacterTextSplitter(separator = ".",
-                                          chunk_size=int(args.chunk_size),
+                                          chunk_size=int(chunk_size),
                                           chunk_overlap=0)
     docs = text_splitter.split_documents(raw_documents) 
     for doc in docs:
@@ -91,7 +88,7 @@ def get_files():
     st.chat_message(msg["role"]).write(msg["content"])
 
 db = Chroma(client=vectorDB_client,
-            collection_name=args.name,
+            collection_name=vdb_name,
             embedding_function=e
     )
 retriever = db.as_retriever(threshold=0.75)