LLM Evaluation & Testing quick start with Promptfoo
Better, Faster, Cheaper Prompts with LLM Testing & Evaluation
- In EXPERIMENT Branch We're modernizing things a little bit
- Run commands explicitly from package.json
- Use the promptfoo library and additional node modules inside code (for grok and dotenv etc)
- use .env and dotenv
- Last LLM Standing Wins Video Walk Through. In this video we build a visual LLM benchmarking tool built on top of Promptfoo.
- Check out the brief Video Tutorial where we highlight the key features of Promptfoo and how to get started with this repo.
- Compare Gemini Pro vs GPT-3.5 Turbo with Promptfoo.
- Monitor the performance of Local, On Device LLMs with prompt testing
- To get started with OpenAI, set your OPENAI_API_KEY environment variable.
export OPENAI_API_KEY=<your key>
- OpenAI Setup Docs
- To get started with Gemini, set your VERTEX_API_KEY environment variable.
export VERTEX_API_KEY=<your key>
export VERTEX_PROJECT_ID=<your google cloud project id>
- Gemini Setup Docs
- To setup anthropic
- `export ANTHROPIC_API_KEY=
- Anthropic Setup Docs
- To setup GROQ
export GROQ_API_KEY=
- Groq Setup Docs
- Install promptfoo
npm install -g promptfoo
- Install Docs
- cd into the directory you want to test
- Run
promptfoo eval
to evaluate
- Install promptfoo
npm install promptfoo
- Install Docs
- Update the
package.json
file to include the following scripts-
"scripts": { "eval": "promptfoo eval -c `./path/to/your/promptfooconfig.yaml`", "view": "promptfoo view" }
- See the package.json for examples
-
- I recommend using separate prompt.txt, test.txt, promptfooconfig.yaml with a dedicated directory and package.json script for each prompt you want to test.
- This way you can create multiple test + prompt combinations.
- For example this package.json: script section show cases running different tests
"scripts": {
"nlq_to_sql_ten": "source .env && promptfoo eval -c ./nlq_to_sql/promptfooconfig.yaml -t ./nlq_to_sql/test_ten.yaml -p ./nlq_to_sql/prompt.txt --no-cache --output ./nlq_to_sql/output_ten.json",
"nlq_to_sql_twenty": "source .env && promptfoo eval -c ./nlq_to_sql/promptfooconfig.yaml -t ./nlq_to_sql/test_twenty.yaml -p ./nlq_to_sql/prompt.txt --no-cache --output ./nlq_to_sql/output_twenty.json",
"view": "promptfoo view"
},
- You can set a delay between prompt tests by using the
PROMPTFOO_DELAY_MS
env variable. - Delay Docs
You can reuse the
./custom_models/customModelBase.js
to test llama models locally. Or you can create a new .js file for your model. See promptfoo custom model docs.
- Read the instructions here and download the llama files
- https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#quickstart
- I recommend installing
mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile
for the best results for 4GB models
- Place the model into the custom_models/ directory
- Make sure the name of your model file matches the
- Create a .js file in custom_models/ for your model and inherit the CustomModelBase class. Use
custom_models/mistral-7b-v0.1-Q4.js
as a template. - Add the path to the
custom_models/<your model name>.js
in the ./*/promptfooconfig.yaml providers section - Run
promptfoo eval
to evaluate your model
- Use the
sh run_local_llm.sh
script to quickly test prompts on different custom_models. Update the prompt variable to be whatever prompt you want to test.
If you want to run OpenAI exclusively comment out other models in the ./*/promptfooconfig.yaml providers section.
promptfoo eval
- load and evaluate in the current directory
promptfoo eval --no-cache
- load and evaluate in the current directory without using the cache
promptfoo view
- load the UI in the current directory
- Providers
- Prompts
- Assertions
- Variables
- 💰 Save Money & Same Time (Resource Optimization)
- With LLM testing you can determine if you need GPT-4 or if you can save money and time with GPT-3
- You can find the minimum number of tokens you can use without sacrificing quality
- Compare different LLM providers to determine which is the best fit for your application
- 👍 Ship with confidence (Validate Accuracy)
- Gain certainty that your prompt will generate the results you want
- Confidently generate json responses
- Compare prompts to determine which is more accurate
- ✅ Prevent Regressions (Consistency)
- Ensure that the output of a prompt is within the bounds of your expectations
- Make sure that when you update your prompt it doesn't break your application
- With version control and CI/CD you can ensure your prompts are always working as expected
/<name of agent/test 1>
/prompt.txt
- the prompt(s) to test/test.yaml
- variables and assertions/promptfooconfig.yaml
- llm config
/<name of agent/test N>
...
...
- Vertex Promptfoo Provider
- Vertex AI Pricing
- Great Breakdown of Gemini pro vs gpt-3.5
- Don't repeat test data
- Ensure output is in json format and the keys exist
- Reference prompt, and test files using globs and lists
- Assertions ('equals', 'contains', 'is-json', 'levenshtein-distance', 'python', 'regex', 'llm-rubric', and more)
- Example using Scenarios for test assertion variables
- LLM Providers
- Vertex Provider Src
Results generated by GPT-4 and then tweaked - take it with two grains of salt.
- Resources:
- Source Blog: https://klu.ai/blog/gemini-pro-vs-gpt-3-5-turbo
- Vertex Pricing: https://cloud.google.com/vertex-ai/pricing
- Promptfoo: https://www.promptfoo.dev/docs/guides/gemini-vs-gpt
Gemini-Pro ⚪️⚪️⚪️⚪️🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro has a small to medium sized edge and is likely a better fit for most applications.
Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro is the same price as GPT-3.5 Turbo.
Gemini-Pro ⚪️🟢🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro demonstrates superior speed, processing inputs faster than GPT-3.5 Turbo.
Gemini-Pro ⚪️⚪️🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro excels at following instructions accurately, outperforming GPT-3.5 Turbo in this regard.
Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: GPT-3.5 Turbo has a slight advantage in content generation, producing more nuanced and varied outputs.
Gemini-Pro ⚪️⚪️🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro shows superior language understanding, especially in complex comprehension tasks.
Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢🟢⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini seems to exibit more google specific bias than GPT-3.5 Turbo.
Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: OpenAIs API is much easier to use than Vertex AI's API.
Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢🟢🟢⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro is extremely restricted in its capabilities.
Gemini-Pro 🟢🟢🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
- Explanation: Gemini-Pro supports both text and image inputs, offering a significant advantage over GPT-3.5 Turbo, which is limited to text only.
- Trim your prompts to the minimum number of tokens needed to generate the desired output
- Always compare multiple LLM models to see if you need a expensive, slower model or if a cheaper, faster model will work
- Add as many test assertions as possible to ensure your prompt is generating the output you expect
- Your prompts don't 'always' need to validate every assertions, but they should always validate the most important assertions and most test cases
- Use JSON as the output format for your prompt this makes it easy to validate the output
- Isolate your prompts into separate files for readability and maintainability
- Identify which parts of your prompt are variables and which are static then separate them into different test variables
- Use the
--no-cache
flag to ensure you are always testing the latest version of your prompt - Use your users as THE primary source for your test cases. Testing every use case your users will encounter is the best way to ensure your prompt is working as expected. This is especially true for prompts that are used in production.
- Focus on asserting an acceptable range of results over specific, exact results. LLMs are non-deterministic and will generate different results each time. Your tests should account for this.