EXP-Bench Evaluation Harness Usage Instructions V0

Disclaimer: we are still working on polishing the code here, so please expect frequent changes!

New clean docker containers are used for each instance (task+LLM+Agent) during evaluation (which consists of output generation, and judging). EXP-Bench currently supports evaluation against the following judges and agents, with plans to expand support in the future:

Agent: OpenHands and Iterative Agent LLMs: OpenAI models, Claude Sonnet 3.7, Claude Haiku 3.5, DeepSeek R1, Amazon Nova Pro

All commands should be executed from this directory. To run the evaluation and judging pipelines, simply build the Docker image and execute the corresponding scripts provided below.

Setup Docker image

Note: There may be a few python packages missing that need to be installed manually. Also make sure Python 3.12.4 is installed (conda is best, which we show below).

docker images -q exp-bench-image | xargs -r docker rmi -f # removes any existing conflict image
docker build --progress=plain -t exp-bench-image -f ExpDockerfile_default .
conda create -n exp-bench python=3.12.4 
conda activate exp-bench
pip install -e . # this needs to be executed in the conda env that you just activated

Evaluation:

Evaluation Output Generation

These scripts will generate outputs within the folder outputs/evaluation.

Make sure you have configured evaluation/config/parallel_eval_gen_config_template.json with the correct parameters, e.g., LLM config and Agent name.

python evaluation/parallel_eval.py

The following config will work for Inspect Agent, though make sure to set the other parameters correctly, e.g., LLM config.

python evaluation/parallel_eval.py --task_config=evaluation/configs/parallel_eval_gen_config_template_inspect_agent.json

The following is an optional script for generating evaluation output for specific tasks only:

python evaluation/run_parallel_gen_evals.py \
  --max_duration 0.5 \
  --specific_tasks '[["neurips2024", "93022", 1, 0.25], ["neurips2024", "93022", 1, 0.5], ["neurips2024", "93022", 1, 1], ["neurips2024", "93022", 1, 2], ["neurips2024", "93022", 1, 4], ["neurips2024", "93022", 1, 8], ["neurips2024", "94155", 6, 8]]'

Judge Evaluation Output

Make sure you have configured evaluation/config/parallel_eval_judge_config_template.json with the correct parameters, e.g., LLM config and Agent name.

The specific evalauation output folder within outputs/evaluation that will be judged is determined in part by the config keys "llm_config_filename" and "agent_name".
Leave "judge_agent_name" blank.
Fill in "llm_judge_config_filename" with the LLM config used for judging. Currently we default to o3-mini.

python evaluation/parallel_eval.py --task_config=evaluation/configs/parallel_eval_judge_config_template.json

This is an optional script that will perform judging for all evaluation output configs. Currently, you need to modify the parameters within the script manually.

python evaluation/run_parallel_judge_evals.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaluation		evaluation
helper		helper
inspect_agent		inspect_agent
logs		logs
outputs/logs		outputs/logs
workspace		workspace
ExpDockerfile_default		ExpDockerfile_default
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EXP-Bench Evaluation Harness Usage Instructions V0

Setup Docker image

Evaluation:

Evaluation Output Generation

Judge Evaluation Output

About

Uh oh!

Releases

Packages

Languages

License

Just-Curieous/EXP-Bench

Folders and files

Latest commit

History

Repository files navigation

EXP-Bench Evaluation Harness Usage Instructions V0

Setup Docker image

Evaluation:

Evaluation Output Generation

Judge Evaluation Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages