v0.1.0 Candidate

See #6 for details
harvard-lil · Mar 20, 2024 · c6537d3 · c6537d3
1 parent 2232f48
commit c6537d3
Show file tree

Hide file tree

Showing 31 changed files with 1,930 additions and 1,152 deletions.
diff --git a/.env.example b/.env.example
@@ -1,36 +1,61 @@
 #-------------------------------------------------------------------------------
-# LLM APIs settings
+# LLM APIs settings 
 #-------------------------------------------------------------------------------
-# NOTE: WARC-GPT needs to be able to interact with at least 1 LLM API.
-OLLAMA_API_URL="http://localhost:11434" # Allows for running models locally. See: https://ollama.ai/
+# NOTE: 
+# - WARC-GPT can use both OpenAI and Ollama at the same time, but needs at least one of the two.
+# - Ollama is one of the simplest ways to get started running models locally: https://ollama.ai/
+OLLAMA_API_URL="http://localhost:11434"
 
 #OPENAI_API_KEY="" 
 #OPENAI_ORG_ID=""
 
-#ANTHROPIC_API_KEY=""
-
-#COHERE_API_KEY=""
-
-#PERPLEXITYAI_API_KEY=""
+# NOTE: OPENAI_BASE_URL can be used to interact with OpenAI-compatible providers.
+# For example:
+# - https://huggingface.co/blog/tgi-messages-api
+# - https://docs.vllm.ai/en/latest/getting_started/quickstart.html#using-openai-completions-api-with-vllm
+# Make sure to specify both OPENAI_BASE_URL and OPENAI_COMPATIBLE_MODEL when doing so.
+#OPENAI_BASE_URL=""
+#OPENAI_COMPATIBLE_MODEL=""
 
 #-------------------------------------------------------------------------------
-# Prompts
+# Text Completion Prompts
 #-------------------------------------------------------------------------------
-RAG_PROMPT="
-Here is context:
+# NOTE: {history} {rag} and {request} are reserved keywords.
+TEXT_COMPLETION_BASE_PROMPT = "
+{history}
+
+You are a helpful assistant.
+
+{rag}
+
+Request: {request}
+
+Helpful response (plain text, no markdown): 
+"
+
+# NOTE: Injected into BASE prompt when relevant.
+# Inspired by LangChain's default RAG prompt.
+# {context} is a reserved keyword.
+TEXT_COMPLETION_RAG_PROMPT = "
+Here is context to help you fulfill the user's request:
 {context}
 ----------------
 Context comes from web pages that were captured as part of a web archives collection. 
 When possible, use context to answer the question asked by the user.
 If you don't know the answer, just say that you don't know, don't try to make up an answer. 
 Ignore context if it is empty or irrelevant.
-When possible and relevant, use context to cite or direct quote your sources.
+Cite and quote your sources whenever possible. Use their number (for example: [1]) and / or URL to reference them.
+
+"
 
-Question: {question}
+# NOTE: Injected into BASE prompt when relevant.
+# NOTE: {history} is a reserved keyword
+TEXT_COMPLETION_HISTORY_PROMPT = "
+Here is a summary of the conversation thus far:
+{history}
+----------------
 
-Helpful answer:"
-# Inspired by LangChain's default prompt. {context} and {question} are reserved keywords.
-# NOTE: An earlier vection of WARC-GPT used ALL CAPS CONTEXT, and QUESTION keywords.
+"
 
 #-------------------------------------------------------------------------------
 # Paths
@@ -54,6 +79,17 @@ VECTOR_SEARCH_QUERY_PREFIX="query: " # Can be used to add prefix to text embeddi
 VECTOR_SEARCH_TEXT_SPLITTER_CHUNK_OVERLAP=25 # Determines, for a given chunk of text, how many tokens must overlap with adjacent chunks.
 VECTOR_SEARCH_SEARCH_N_RESULTS=4 # How many entries should the vector search return?
 
+#-------------------------------------------------------------------------------
+# Basic Rate Limiting
+#-------------------------------------------------------------------------------
+# NOTE:
+# - This set of variables allows for applying rate-limiting to individual API routes. 
+# - See https://flask-limiter.readthedocs.io/en/stable/ for details and syntax.
+RATE_LIMIT_STORAGE_URI="memory://"
+API_MODELS_RATE_LIMIT="1/second"
+API_SEARCH_RATE_LIMIT="120 per 1 hour"
+API_COMPLETE_RATE_LIMIT="60 per 1 hour"
+
 #-------------------------------------------------------------------------------
 # Hugging Face's tokenizer settings
 #-------------------------------------------------------------------------------

diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 More info:
 - <a href="https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/">"WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI"</a>. Feb 12 2024 - _lil.law.harvard.edu_
 
-![](screenshots.webp)
+https://github.com/harvard-lil/warc-gpt/assets/625889/8ea3da4a-62a1-4ffa-a510-ef3e35699237
 
 
 ---
@@ -54,18 +54,21 @@ poetry install
 
 ## Configuring the application
 
-This program uses environment variables to handle settings. Copy `.env.example` into a new `.env` file and edit it as needed.
+This program uses environment variables to handle settings. 
+Copy `.env.example` into a new `.env` file and edit it as needed.
 
 ```bash
 cp .env.example .env
 ```
 
-By default, `.env` is configured: 
-- To use [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) as an embedding model
-- To connect to a local instance of [Ollama](https://ollama.ai) for inference
-- To use a basic, fairly neutral retrieval prompt 
+See details for individual settings in [.env.example](.env.example).
 
-Edit this file as needed to adjust settings, replace the embedding model, retrieval prompt, or to connect WARC-GPT to [Open AI](https://platform.openai.com/docs/introduction), [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api), [Cohere](https://docs.cohere.com/docs) or [Perplexity AI](https://docs.perplexity.ai/).
+**A few notes:**
+- WARC-GPT can interact with both the [OpenAI API](https://platform.openai.com/docs/introduction) and [Ollama](https://ollama.ai) for local inference. 
+  - Both can be used at the same time, but at least one is needed. 
+  - By default, the program will try to communicate with Ollama's API at `http://localhost:11434`.
+  - It is also possible to use OpenAI's client to interact with compatible providers, such as [HuggingFace's Message API](https://huggingface.co/blog/tgi-messages-api) or [vLLM](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#using-openai-completions-api-with-vllm). To do so, set values for both `OPENAI_BASE_URL` and `OPENAI_COMPATIBLE_MODEL` environment variables. 
+- Prompts can be edited directly in the configuration file.
 
 [☝️ Summary](#summary)
 
@@ -105,7 +108,7 @@ poetry run flask run
 
 Once the server is started, the application's web UI should be available on `http://localhost:5000`.
 
-Unless the **Disable RAG** option is turned on, the system will try to find relevant excerpts in its knowledge base - populated ahead of time using WARC files and the `ingest` command - to answer the questions it is asked.
+Unless RAG search is disabled in settings, the system will try to find relevant excerpts in its knowledge base - populated ahead of time using WARC files and the `ingest` command - to answer the questions it is asked.
 
 The interface also automatically handles a basic chat history, allowing for few-shots / chain-of-thoughts prompting. 
 
@@ -118,49 +121,46 @@ The interface also automatically handles a basic chat history, allowing for few-
 ### [GET] /api/models
 Returns a list of available models as JSON.
 
-### [POST] /api/completion
-For a given message, retrieves relevant context from the knowledge base and use an LLM to generate a text completion.
+### [POST] /api/search
+Performs search against the vector store for a given `message`.
 
 <details>
-<summary><strong>Accepts JSON body with the following properties:</strong></summary>
+<summary><strong>Accepts a JSON body with the following properties:</strong></summary>
 
-- `model`: One of the models `/api/models` lists (required)
 - `message`: User prompt (required)
-- `temperature`: Defaults to 0.0 (required)
-- `max_tokens`: If provided, caps number of tokens that will be generated in response.
-- `no_rag`: If set and true, the API will not try to retrieve context.
-- `rag_prompt_override`: If provided, will be used in replacement of the predefined RAG prompt. {context} and {question} placeholders will be automatically replaced.
-- `history`: A list of chat completion objects representing the chat history. Each object must contain "user" and "content".
 
 </details>
 
 <details>
-<summary><strong>Returns a JSON object containing the following properties:</strong></summary>
-
-- `id_exchange`: Unique identifier for this completion
-- `response`: Text of the response generated by the LLM
-- `response_info`: An object containg technical information about the response that was generated
-    - `response_info.completion_tokens`: Number of tokens generated by the LLM.
-    - `response_info.prompt_tokens`: Number of tokens passed to the LLM.
-    - `response_info.total_tokens`: Total number of tokens
-- `request_info`: An object containing information about the request given to the chatbot
-    - `request_info.message`: Same as input `message`
-    - `request_info.message_plus_prompt`: If RAG is enabled, presents the message alongside the context and retrieval prompt, as it was given to the LLM.
-    - `request_info.max_tokens`: Same as input `max_tokens`, if provided.
-    - `request_info.model`: Same as input `model`.
-    - `request_info.no_rag`: Same as input `no_rag`.
-    - `request_info.temperature`: Same as input `temperature`.
-- `context`: Array of objects, elements pulled from the vector store.
-    - `context[].warc_filename`: Filename of the WARC from which that excerpt is from.
-    - `context[].warc_record_content_type`: Can start with either `text/html` or `application/pdf`.
-    - `context[].warc_record_id`: Individual identifier of the WARC record within the WARC file. 
-    - `context[].warc_record_date`: Date at which the WARC record was created. 
-    - `context[].warc_record_target_uri`: Filename of the WARC from which that excerpt is from.
-    - `context[].warc_record_text`: Text excerpt.
-- `history`: Array of chat history objects (Open AI format). Does not contain full context as a tokens-saving measure.
+<summary><strong>Returns a JSON array of objects containing the following properties:</strong></summary>
+
+- `[].warc_filename`: Filename of the WARC from which that excerpt is from.
+- `[].warc_record_content_type`: Can start with either `text/html` or `application/pdf`.
+- `[].warc_record_id`: Individual identifier of the WARC record within the WARC file. 
+- `[].warc_record_date`: Date at which the WARC record was created. 
+- `[].warc_record_target_uri`: Filename of the WARC from which that excerpt is from.
+- `[].warc_record_text`: Text excerpt.
+
+</details>
+
+### [POST] /api/complete
+Uses an LLM to generate a text completion.
+
+<details>
+<summary><strong>Accepts a JSON body with the following properties:</strong></summary>
+
+- `model`: One of the models `/api/models` lists (required)
+- `message`: User prompt (required)
+- `temperature`: Defaults to 0.0
+- `max_tokens`: If provided, caps number of tokens that will be generated in response.
+- `search_results`: Array, output of `/api/search`.
+- `history`: A list of chat completion objects representing the chat history. Each object must contain `user` and `content`.
 
 </details>
 
+
+Returns RAW text stream as output.
+
 [☝️ Summary](#summary)
 
 ---