This repository contains the code for the paper The Open Proof Corpus: Building a Large-Scale, Human-Validated Dataset of LLM-Generated Proofs. This README explains how to:
- Reproduce our experiments,
- Run your own experiment using our user interface,
- Train the model as described in the paper.
Note: This repository was also used for the IMO 2025 evaluation of MathArena. The project "imo_2025" contains all configs and parameters to reproduce our results. If, after running your own evaluation, you wish to convert the results to the MathArena format, you can run:
uv run python scripts/convert_matharena.py --convert imo_2025
You can use either UV or Conda to install and manage dependencies.
UV is a fast Python package manager.
-
macOS & Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
If using UV, prepend all commands with
uv run
.
To use Conda:
conda create -n opc python=3.12
conda activate opc
python -m pip install -e .
This installs the project in editable mode.
To use this codebase for your own project:
- Own a website domain.
- Use the
website
directory (a Flask app) to serve the UI.
We used Nginx, pm2, and waitress-serve
, but other setups work too. To run locally:
cd website
pip install -e .
python main.py
This runs the user interface locally. You can now use the UI normally through your browser.
You need to define several config files. This repository already includes three fully-working example configurations that you can reference. See src/grading/config.py
for all possible config options, we only mention the most important ones here.
name: [[Name of the Project, can be anything]]
admins:
- [[Judge ID of an admin. How to define a Judge ID is explained later]]
- ...
model_configs:
- [[Path to model config to run for this project, excluding yaml and relative to configs/models, e.g., openai/o3]]
- ...
prompt: [[Prompts for questions. Will be formatted using problem={problem_statement}, so make sure that any loose brackets {,} appear doubled, and that "{problem}" appears somewhere in the prompt]]
prompt_gold: [[Prompt for questions that contain a final answer that can be extracted. Similar instructions as above.]]
n_solutions: [[Number of solutions to sample per model]]
problem_regexes:
- [[Regex that has to match the problem path, defaults to everything. Multiple regexes will match the union of all them]]
- ...
allow_shuffle: [[Whether to shuffle the model answers in each problem (good for anonymity), but can break if you run best-of-n selection and do not run the solver all at once ]]
overwrite: [[Whether to overwrite existing results]]
To setup a solver config for best-of-n selection, you additionally have to specify the following keys:
type: "best_of_n"
attempt_key: "attempts_singular"
n_solutions: [[Number of solutions to select from ]]
judges:
- name: [[Name of the judge]]
judge: [[Config of the judge model, relative to configs/models]]
mode: [[Either "random", "discrete", "continuous", or "dual"]]
prompt: [[Prompt of the judge. Will be formatted with problem={problem_statement}, solution={model_solution} if the mode is not dual. If the mode is dual, then it will be formatted with problem={problem_statement}, solution_a={model_solution_a}, solution_b={model_solution_b}. Your prompt should state that the model should format its final answer in \boxed{{}}.]]
correct_word: [[If mode=discrete, this will be used to compare the models judgment with to see if it thinks the proof is correct]]
tournament_type: [[If mode=dual, the type of tournament that is run between model answers. Either "swiss" or "bracket"]]
tournament_config: [[If mode=dual, the config of the tournament]]
- max_losses: [[The maximum number of losses in a bracket tournament before a solution is finally removed.]]
- n_rounds: [[Number of rounds in the swiss tournament]]
- first_win_word: [[The word the model is instructed to output if the first output wins.]]
- second_win_word: [[The word the model is instructed to output if the second output wins.]]
- ...
n_problems: [[Number of problems to distribute to judges]]
overlap: [[Float between 0 and 1 indicating the frequency of problems that should be judged by two people.]]
judge_file: configs/judges/{project_name}.json
competition_weight:
[[Competition name]]: [[Relative weight to other competitions]]
...
"[Judge ID]": {
"only_undergraduate": "[[Bool, indicating whether to only give undergraduate problems to this judge (false if empty)]]",
"only_duplicates": "[[Bool, indicating whether to only give duplicate problems to this judge (false if empty)]]",
"load": "[[Float, relative weight of number of problems to give to this judge compared to others]]"
},
prompt: [[Prompt to give to the LLM grader. Will be formatted with problem={problem}, ground_truth_solution={ground_truth_solution}, proof_solution={proof_solution}. The model should output its answer in valid JSON (shown below)]]
judge: [[Config of the judge model, relative to configs/models]]
overwrite: [[Whether to overwrite existing results]]
Required output format of an LLM judge:
{
"summary": "A concise summary of the proof solution.",
"issues": [
{
"location": "A description of where the issue occurs in the proof",
"text": "A citation or excerpt from the proof that contains the issue. If the issue is not contained to a very small part of the proof (e.g., a single sentence), you can leave this field empty.",
"description": "A brief explanation of the issue.",
"category": "The category of the issue (Overgeneralization, Oversimplification, Skipping Computation Steps, Citing Non-Standard Works or Theorems, Missing Edge Cases, Wrong Final Answer, Other)."
}
]
}
- Add a new Judge ID to
configs/judges/judges.json
. This file maps judge IDs to properties. - Add that ID to
configs/judges/{project_name}.json
. - If the judge is also an admin, add them to the admin list in
configs/projects/{project_name}.yaml
.
- Problems must be formatted as a list of dictionaries in
data/raw/{filename}.json
:
{
"problem_id": "[[Problem ID. Anything before the first '_' will be used as competition name]]",
"metadata": {
"[[Metadata key, can be anything]]": "[[Value]]",
"level": "[[If undergraduate, will not be given to judges that cannot do this]]"
},
"problem": "[[Problem Statement]]",
"solutions": [
{
"solution": "[[Ground Truth Solution]]",
"images": {
"[[key]]": "[[Link to image]]"
},
"[[Metadata key, can be anything]]": "[[Value]]"
}
]
}
- Then process and validate them:
python scripts/process.py --project {project_name} --file-name {filename}.json
python scripts/validate.py
Validation checks for consistency and uniqueness of paths.
Once configs and API keys are in place, generate model outputs:
python scripts/solve.py --project {project_name}
Assign problems to judges for grading:
python scripts/distribution.py --project {project_name}
Fetch fully judged problems from the website:
python scripts/recover.py --project {project_name}
Track judging progress and judge overlap:
python scripts/monitor.py --project {project_name}
Convert all raw results into a single export file:
python scripts/convert.py
Prepare final results for use as training data:
python scripts/postprocess.py --project {project_name}
Most results are directly included in the Open Proof Corpus. Extract them using:
notebooks/extraction.ipynb
To evaluate the judging capabalities of a model on the test set after postprocessing, go through the following instructions:
Install additional libraries:
pip install -e .
pip install torch==2.6.0
pip install vllm==0.8.5
pip install flash_attn==2.7.4.post1
pip install flashinfer-python==0.2.2.post1
pip install -U ray==2.46.0
python ./scripts/judge_inference.py --judge_configs {judge_models} --setting_config discrete --n 5 --skip-existing
To obtain the results, run:
python ./scripts/judge_results.py --setting_config discrete
To finetune a verification model using the OPC, we adapt the verl
framework, in which you should run the following steps:
To transform the data into a parquet
format, as well as split the training data into training and validation, run:
python runs/preprocess.py --data_source ../data/{project_name}/train_samples.json --prompt_file runs/prompts/{prompt_file}
You are free to change any hyperparameters in runs/train.sh
, according to the verl documentation.
To run the model, use the following command if you are running on a local server:
slurm/train.sbatch
or the following if you are using a Slurm cluster (overriding any of the default parameters if needed):
sbatch slurm/train.sbatch
The highest checkpoint according to the validation score is saved at best_validation/
, while the latest checkpoints can be found in checkpoints/
.