Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support post-hoc evals through node package via evaluateMetrics #669

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

typpo
Copy link
Collaborator

@typpo typpo commented Apr 13, 2024

Run an eval on existing outputs

You can use promptfoo to evaluate metrics on outputs you've already produced, without running the full evaluation pipeline. This is useful if you want to analyze the quality of outputs from an existing system or dataset.

How to use

Use the evaluateMetrics function provided by promptfoo to run evaluations on pre-existing outputs. Here's how you can integrate this into your project:

import promptfoo from 'promptfoo';

const modelOutputs = [
  'This is the first output.',
  'This is the second output, which contains a specific substring.',
  // ...
];

const metrics = [
  {
    metric: 'Apologetic',
    type: 'llm-rubric',
    value: 'does not apologize',
  },
  {
    metric: 'Contains Expected Substring',
    type: 'contains',
    value: 'specific substring',
  },
  {
    metric: 'Is Biased',
    type: 'classifier',
    provider: 'huggingface:text-classification:d4data/bias-detection-model',
    value: 'Biased',
  },
];

const options = {
  maxConcurrency: 2,
};

(async () => {
  const evaluation = await promptfoo.evaluateMetrics(modelOutputs, metrics, options);

  evaluation.results.forEach((result) => {
    console.log('---------------------');
    console.log(`Eval for output: "${result.vars.output}"`);
    console.log('Metrics:');
    console.log(`  Overall: ${result.gradingResult.score}`);
    console.log(`  Components:`);
    for (const [key, value] of Object.entries(result.namedScores)) {
      console.log(`    ${key}: ${value}`);
    }
  });

  console.log('---------------------');
  console.log('Done.');
})();

Parameters

  • modelOutputs: An array of strings, where each string is a response output from the language model that you want to evaluate.

  • metrics: An array of objects specifying the metrics to apply. Each metric object (a partial Assertion) can have different properties depending on its type. Common properties include:

    • metric: The name of the metric.
    • type: The type of the metric (e.g., contains, llm-rubric, classifier).
    • value: The expected value or condition for the metric.
  • options: Configuration EvaluateOptions for the evaluation process, such as:

    • maxConcurrency: The maximum number of concurrent evaluations to perform. This helps in managing resource usage when evaluating large datasets.

Output format

The output of the evaluateMetrics function is an EvaluateSummary object that includes detailed results of the evaluation. Includes:

  • vars: Variables used in the metric evaluation. In this case, the only variable is output.
  • gradingResult: An object describing the overall grading outcome, including:
    • score: A numeric score representing the overall evaluation result.
    • componentResults: Detailed results for each component of the metric evaluated.
  • namedScores: A breakdown of scores by individual metrics.

Example output

Here's an example of what the output might look like when printed:

{
  "results": [
    {
      "vars": {
        "output": "This is the first output."
      },
      "gradingResult": {
        "score": 0.66,
        "pass": false,
        "reason": "Output failed 1 out of 3 metrics"
      },
      "namedScores": {
        "Apologetic": 1,
        "Contains Expected Substring": 0,
        "Is Biased": 1
      }
    },
    {
      "vars": {
        "output": "This is the second output, which contains a specific substring."
      },
      "gradingResult": {
        "score": 1,
        "pass": true,
        "reason": "Output passed all metrics"
      },
      "namedScores": {
        "Apologetic": 1,
        "Contains Expected Substring": 1,
        "Is Biased": 1
      }
    }
    // ... more result objects for each output
  ],
  "stats": {
    "successes": 1,
    "failures": 1,
    "totalOutputs": 2
  }
}

This code loads a set of model outputs, defines several metrics to evaluate them against, and then calls evaluateMetrics with these outputs and metrics. It then logs the evaluation results for each output, including the overall score and the scores for each individual metric.

@anthonyivn2
Copy link
Contributor

How would this work with the output-based/RAG-based metrics such as answer-relevance, context-relevance, context-faithfulness?

@typpo
Copy link
Collaborator Author

typpo commented Apr 14, 2024

@anthonyivn2 Latest update adds support for a more complex data format such as:

modelOutputs = [
  {
    vars: {
      question: 'What is the capital of France?',
      context: 'Paris, the capital of France, has a population of 2.148 million.',
    },
    output: 'foo',
  },
  {
    vars: {
      question: 'Which person is known for the theory of relativity?',
      context:
        'Albert Einstein, a German-born theoretical physicist, developed the theory of relativity.',
    },
    output: 'bar',
  },
];

evaluateOptions: EvaluateOptions = {},
) {
const testSuite: EvaluateTestSuite = {
prompts: ['{{output}}'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this still make evaluate() run the callApi function?

@anthonyivn2
Copy link
Contributor

anthonyivn2 commented Apr 14, 2024

I created #671 as another potential way to go about making this change, and added explanation on my end on why this functionality should be available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants