feat: support post-hoc evals through node package via evaluateMetrics #669

typpo · 2024-04-13T20:19:09Z

Run an eval on existing outputs

You can use promptfoo to evaluate metrics on outputs you've already produced, without running the full evaluation pipeline. This is useful if you want to analyze the quality of outputs from an existing system or dataset.

How to use

Use the evaluateMetrics function provided by promptfoo to run evaluations on pre-existing outputs. Here's how you can integrate this into your project:

import promptfoo from 'promptfoo';

const modelOutputs = [
  'This is the first output.',
  'This is the second output, which contains a specific substring.',
  // ...
];

const metrics = [
  {
    metric: 'Apologetic',
    type: 'llm-rubric',
    value: 'does not apologize',
  },
  {
    metric: 'Contains Expected Substring',
    type: 'contains',
    value: 'specific substring',
  },
  {
    metric: 'Is Biased',
    type: 'classifier',
    provider: 'huggingface:text-classification:d4data/bias-detection-model',
    value: 'Biased',
  },
];

const options = {
  maxConcurrency: 2,
};

(async () => {
  const evaluation = await promptfoo.evaluateMetrics(modelOutputs, metrics, options);

  evaluation.results.forEach((result) => {
    console.log('---------------------');
    console.log(`Eval for output: "${result.vars.output}"`);
    console.log('Metrics:');
    console.log(`  Overall: ${result.gradingResult.score}`);
    console.log(`  Components:`);
    for (const [key, value] of Object.entries(result.namedScores)) {
      console.log(`    ${key}: ${value}`);
    }
  });

  console.log('---------------------');
  console.log('Done.');
})();

Parameters

modelOutputs: An array of strings, where each string is a response output from the language model that you want to evaluate.
metrics: An array of objects specifying the metrics to apply. Each metric object (a partial Assertion) can have different properties depending on its type. Common properties include:
- metric: The name of the metric.
- type: The type of the metric (e.g., contains, llm-rubric, classifier).
- value: The expected value or condition for the metric.
options: Configuration EvaluateOptions for the evaluation process, such as:
- maxConcurrency: The maximum number of concurrent evaluations to perform. This helps in managing resource usage when evaluating large datasets.

Output format

The output of the evaluateMetrics function is an EvaluateSummary object that includes detailed results of the evaluation. Includes:

vars: Variables used in the metric evaluation. In this case, the only variable is output.
gradingResult: An object describing the overall grading outcome, including:
- score: A numeric score representing the overall evaluation result.
- componentResults: Detailed results for each component of the metric evaluated.
namedScores: A breakdown of scores by individual metrics.

Example output

Here's an example of what the output might look like when printed:

{
  "results": [
    {
      "vars": {
        "output": "This is the first output."
      },
      "gradingResult": {
        "score": 0.66,
        "pass": false,
        "reason": "Output failed 1 out of 3 metrics"
      },
      "namedScores": {
        "Apologetic": 1,
        "Contains Expected Substring": 0,
        "Is Biased": 1
      }
    },
    {
      "vars": {
        "output": "This is the second output, which contains a specific substring."
      },
      "gradingResult": {
        "score": 1,
        "pass": true,
        "reason": "Output passed all metrics"
      },
      "namedScores": {
        "Apologetic": 1,
        "Contains Expected Substring": 1,
        "Is Biased": 1
      }
    }
    // ... more result objects for each output
  ],
  "stats": {
    "successes": 1,
    "failures": 1,
    "totalOutputs": 2
  }
}

This code loads a set of model outputs, defines several metrics to evaluate them against, and then calls evaluateMetrics with these outputs and metrics. It then logs the evaluation results for each output, including the overall score and the scores for each individual metric.

anthonyivn2 · 2024-04-14T05:37:16Z

How would this work with the output-based/RAG-based metrics such as answer-relevance, context-relevance, context-faithfulness?

typpo · 2024-04-14T12:10:48Z

@anthonyivn2 Latest update adds support for a more complex data format such as:

modelOutputs = [
  {
    vars: {
      question: 'What is the capital of France?',
      context: 'Paris, the capital of France, has a population of 2.148 million.',
    },
    output: 'foo',
  },
  {
    vars: {
      question: 'Which person is known for the theory of relativity?',
      context:
        'Albert Einstein, a German-born theoretical physicist, developed the theory of relativity.',
    },
    output: 'bar',
  },
];

anthonyivn2 · 2024-04-14T16:03:27Z

src/index.ts

+  evaluateOptions: EvaluateOptions = {},
+) {
+  const testSuite: EvaluateTestSuite = {
+    prompts: ['{{output}}'],


would this still make evaluate() run the callApi function?

anthonyivn2 · 2024-04-14T16:04:36Z

I created #671 as another potential way to go about making this change, and added explanation on my end on why this functionality should be available

feat: support post-hoc evals through node package via evaluateMetrics

07f3163

vars support

85a5fdd

anthonyivn2 reviewed Apr 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support post-hoc evals through node package via evaluateMetrics #669

feat: support post-hoc evals through node package via evaluateMetrics #669

typpo commented Apr 13, 2024

anthonyivn2 commented Apr 14, 2024

typpo commented Apr 14, 2024

anthonyivn2 Apr 14, 2024

anthonyivn2 commented Apr 14, 2024 •

edited

feat: support post-hoc evals through node package via evaluateMetrics #669

Are you sure you want to change the base?

feat: support post-hoc evals through node package via evaluateMetrics #669

Conversation

typpo commented Apr 13, 2024

Run an eval on existing outputs

How to use

Parameters

Output format

Example output

anthonyivn2 commented Apr 14, 2024

typpo commented Apr 14, 2024

anthonyivn2 Apr 14, 2024

Choose a reason for hiding this comment

anthonyivn2 commented Apr 14, 2024 • edited

anthonyivn2 commented Apr 14, 2024 •

edited