Disclaimer: we are still working on polishing the code here, so please expect frequent changes!
New clean docker containers are used for each instance (task+LLM+Agent) during evaluation (which consists of output generation, and judging). EXP-Bench currently supports evaluation against the following judges and agents, with plans to expand support in the future:
Agent: OpenHands and Iterative Agent LLMs: OpenAI models, Claude Sonnet 3.7, Claude Haiku 3.5, DeepSeek R1, Amazon Nova Pro
All commands should be executed from this directory. To run the evaluation and judging pipelines, simply build the Docker image and execute the corresponding scripts provided below.
Note: There may be a few python packages missing that need to be installed manually. Also make sure Python 3.12.4 is installed (conda is best, which we show below).
docker images -q exp-bench-image | xargs -r docker rmi -f # removes any existing conflict image
docker build --progress=plain -t exp-bench-image -f ExpDockerfile_default .
conda create -n exp-bench python=3.12.4
conda activate exp-bench
pip install -e . # this needs to be executed in the conda env that you just activated
These scripts will generate outputs within the folder outputs/evaluation
.
Make sure you have configured evaluation/config/parallel_eval_gen_config_template.json
with the correct parameters, e.g., LLM config and Agent name.
python evaluation/parallel_eval.py
The following config will work for Inspect Agent, though make sure to set the other parameters correctly, e.g., LLM config.
python evaluation/parallel_eval.py --task_config=evaluation/configs/parallel_eval_gen_config_template_inspect_agent.json
The following is an optional script for generating evaluation output for specific tasks only:
python evaluation/run_parallel_gen_evals.py \
--max_duration 0.5 \
--specific_tasks '[["neurips2024", "93022", 1, 0.25], ["neurips2024", "93022", 1, 0.5], ["neurips2024", "93022", 1, 1], ["neurips2024", "93022", 1, 2], ["neurips2024", "93022", 1, 4], ["neurips2024", "93022", 1, 8], ["neurips2024", "94155", 6, 8]]'
Make sure you have configured evaluation/config/parallel_eval_judge_config_template.json
with the correct parameters, e.g., LLM config and Agent name.
- The specific evalauation output folder within
outputs/evaluation
that will be judged is determined in part by the config keys "llm_config_filename" and "agent_name". - Leave "judge_agent_name" blank.
- Fill in "llm_judge_config_filename" with the LLM config used for judging. Currently we default to o3-mini.
python evaluation/parallel_eval.py --task_config=evaluation/configs/parallel_eval_judge_config_template.json
This is an optional script that will perform judging for all evaluation output configs. Currently, you need to modify the parameters within the script manually.
python evaluation/run_parallel_judge_evals.py