feat: support post-hoc evals through node package via evaluateMetrics #669
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Run an eval on existing outputs
You can use promptfoo to evaluate metrics on outputs you've already produced, without running the full evaluation pipeline. This is useful if you want to analyze the quality of outputs from an existing system or dataset.
How to use
Use the
evaluateMetrics
function provided bypromptfoo
to run evaluations on pre-existing outputs. Here's how you can integrate this into your project:Parameters
modelOutputs
: An array of strings, where each string is a response output from the language model that you want to evaluate.metrics
: An array of objects specifying the metrics to apply. Each metric object (a partial Assertion) can have different properties depending on its type. Common properties include:metric
: The name of the metric.type
: The type of the metric (e.g.,contains
,llm-rubric
,classifier
).value
: The expected value or condition for the metric.options
: Configuration EvaluateOptions for the evaluation process, such as:maxConcurrency
: The maximum number of concurrent evaluations to perform. This helps in managing resource usage when evaluating large datasets.Output format
The output of the
evaluateMetrics
function is an EvaluateSummary object that includes detailed results of the evaluation. Includes:vars
: Variables used in the metric evaluation. In this case, the only variable isoutput
.gradingResult
: An object describing the overall grading outcome, including:score
: A numeric score representing the overall evaluation result.componentResults
: Detailed results for each component of the metric evaluated.namedScores
: A breakdown of scores by individual metrics.Example output
Here's an example of what the output might look like when printed:
This code loads a set of model outputs, defines several metrics to evaluate them against, and then calls
evaluateMetrics
with these outputs and metrics. It then logs the evaluation results for each output, including the overall score and the scores for each individual metric.