Skip to content

disler/llm-prompt-testing-quick-start

Repository files navigation

LLM Evaluation & Testing quick start with Promptfoo

Better, Faster, Cheaper Prompts with LLM Testing & Evaluation

Last LLM Standing Wins Local On Device LLMs Are The Future Gemini Pro vs GPT-3.5 Turbo Fast, Cheap, Accurate

Update on format

  • In EXPERIMENT Branch We're modernizing things a little bit
  • Run commands explicitly from package.json
  • Use the promptfoo library and additional node modules inside code (for grok and dotenv etc)
  • use .env and dotenv

Watch the video tutorials

Setup

API Keys

  • To get started with OpenAI, set your OPENAI_API_KEY environment variable.
  • To get started with Gemini, set your VERTEX_API_KEY environment variable.
    • export VERTEX_API_KEY=<your key>
    • export VERTEX_PROJECT_ID=<your google cloud project id>
    • Gemini Setup Docs
  • To setup anthropic
  • To setup GROQ

Global Install

  • Install promptfoo
  • cd into the directory you want to test
  • Run promptfoo eval to evaluate

Local Install

  • Install promptfoo
  • Update the package.json file to include the following scripts
    • "scripts": {
        "eval": "promptfoo eval -c `./path/to/your/promptfooconfig.yaml`",
        "view": "promptfoo view"
      }
    • See the package.json for examples

Opinionated Setup

  • I recommend using separate prompt.txt, test.txt, promptfooconfig.yaml with a dedicated directory and package.json script for each prompt you want to test.
  • This way you can create multiple test + prompt combinations.
  • For example this package.json: script section show cases running different tests
"scripts": {
    "nlq_to_sql_ten": "source .env && promptfoo eval -c ./nlq_to_sql/promptfooconfig.yaml -t ./nlq_to_sql/test_ten.yaml -p ./nlq_to_sql/prompt.txt --no-cache --output ./nlq_to_sql/output_ten.json",
    "nlq_to_sql_twenty": "source .env && promptfoo eval -c ./nlq_to_sql/promptfooconfig.yaml -t ./nlq_to_sql/test_twenty.yaml -p ./nlq_to_sql/prompt.txt --no-cache --output ./nlq_to_sql/output_twenty.json",
    "view": "promptfoo view"
  },

Promptfoo delay if you run into API rate limit issues (anthropic + grok)

  • You can set a delay between prompt tests by using the PROMPTFOO_DELAY_MS env variable.
  • Delay Docs

Install llamafile to test local models

You can reuse the ./custom_models/customModelBase.js to test llama models locally. Or you can create a new .js file for your model. See promptfoo custom model docs.

  • Read the instructions here and download the llama files
  • Place the model into the custom_models/ directory
  • Make sure the name of your model file matches the
  • Create a .js file in custom_models/ for your model and inherit the CustomModelBase class. Use custom_models/mistral-7b-v0.1-Q4.js as a template.
  • Add the path to the custom_models/<your model name>.js in the ./*/promptfooconfig.yaml providers section
  • Run promptfoo eval to evaluate your model

Quickly test your prompts on different custom_models -

  • Use the sh run_local_llm.sh script to quickly test prompts on different custom_models. Update the prompt variable to be whatever prompt you want to test.

If you want to run OpenAI exclusively comment out other models in the ./*/promptfooconfig.yaml providers section.

Commands

promptfoo eval - load and evaluate in the current directory

promptfoo eval --no-cache - load and evaluate in the current directory without using the cache

promptfoo view - load the UI in the current directory

Prompt Evaluation Elements

  • Providers
  • Prompts
  • Assertions
  • Variables

Use Cases & Value Prop of LLM Testing & Evaluation

  • 💰 Save Money & Same Time (Resource Optimization)
    • With LLM testing you can determine if you need GPT-4 or if you can save money and time with GPT-3
    • You can find the minimum number of tokens you can use without sacrificing quality
    • Compare different LLM providers to determine which is the best fit for your application
  • 👍 Ship with confidence (Validate Accuracy)
    • Gain certainty that your prompt will generate the results you want
    • Confidently generate json responses
    • Compare prompts to determine which is more accurate
  • ✅ Prevent Regressions (Consistency)
    • Ensure that the output of a prompt is within the bounds of your expectations
    • Make sure that when you update your prompt it doesn't break your application
    • With version control and CI/CD you can ensure your prompts are always working as expected

Organizational Pattern

  • /<name of agent/test 1>
    • /prompt.txt - the prompt(s) to test
    • /test.yaml - variables and assertions
    • /promptfooconfig.yaml - llm config
  • /<name of agent/test N>
    • ...
  • ...

Important Docs & Resources

Gemini Pro vs GPT-3.5 Turbo Highlights

Results generated by GPT-4 and then tweaked - take it with two grains of salt.

Overall:

  • Gemini-Pro ⚪️⚪️⚪️⚪️🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro has a small to medium sized edge and is likely a better fit for most applications.

Pricing Comparison:

  • Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro is the same price as GPT-3.5 Turbo.

Speed:

  • Gemini-Pro ⚪️🟢🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro demonstrates superior speed, processing inputs faster than GPT-3.5 Turbo.

Instruction Following:

  • Gemini-Pro ⚪️⚪️🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro excels at following instructions accurately, outperforming GPT-3.5 Turbo in this regard.

Content Generation:

  • Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: GPT-3.5 Turbo has a slight advantage in content generation, producing more nuanced and varied outputs.

Language Understanding:

  • Gemini-Pro ⚪️⚪️🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro shows superior language understanding, especially in complex comprehension tasks.

Bias:

  • Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢🟢⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini seems to exibit more google specific bias than GPT-3.5 Turbo.

API Design and Developer Experience:

  • Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: OpenAIs API is much easier to use than Vertex AI's API.

AI Alignment and Safety:

  • Gemini-Pro ⚪️⚪️⚪️⚪️⚪️|🟢🟢🟢🟢⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro is extremely restricted in its capabilities.

Multimodal Capabilities:

  • Gemini-Pro 🟢🟢🟢🟢🟢|⚪️⚪️⚪️⚪️⚪️ GPT-3.5 Turbo
  • Explanation: Gemini-Pro supports both text and image inputs, offering a significant advantage over GPT-3.5 Turbo, which is limited to text only.

Great LLM Testing & Evaluation Patterns

  • Trim your prompts to the minimum number of tokens needed to generate the desired output
  • Always compare multiple LLM models to see if you need a expensive, slower model or if a cheaper, faster model will work
  • Add as many test assertions as possible to ensure your prompt is generating the output you expect
  • Your prompts don't 'always' need to validate every assertions, but they should always validate the most important assertions and most test cases
  • Use JSON as the output format for your prompt this makes it easy to validate the output
  • Isolate your prompts into separate files for readability and maintainability
  • Identify which parts of your prompt are variables and which are static then separate them into different test variables
  • Use the --no-cache flag to ensure you are always testing the latest version of your prompt
  • Use your users as THE primary source for your test cases. Testing every use case your users will encounter is the best way to ensure your prompt is working as expected. This is especially true for prompts that are used in production.
  • Focus on asserting an acceptable range of results over specific, exact results. LLMs are non-deterministic and will generate different results each time. Your tests should account for this.

About

LLM Prompt Testing Quick Start

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published