Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

This is the official repo for

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization (ICLR 2024)

by Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, Yuhuai Wu

Abstract

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code --- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting --- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes.

Prompts

All prompts used in the paper can be found in the prompts directory.

Informal Solution Generation

Because Minerva is proprietary, we cannot release the exact code for generating informal solutions and autoforamlization. However, we provide an open source alternative script based on vLLM. The datasets should be first formatted in a jsonl format and the script can be run as follows:

python informal_prove.py

The script will output the informal solutions as well as majority voting accuracy. Note that for the subsequent steps, we only need to perform autoformalization on the problems where at least one informal solution is correct.

Formal Statement Generation

We generate formal statements with the following script:

python formalize_statement.py

The prompts for formalizing statements are in the prompts/formal_statement directory.

Formal Solution Generation

Formal solutions are generated with the following script:

python formalize_proof.py

The prompts for formalizing statements are in the prompts/formal_solution directory.

Proof Checking

We use the Isabelle proof environment and PISA interface to verify the statements and proofs. After installing them, put proof_checking_local.py under pisa/src/main/python/. Because proof checking is relatively slow but highly parallelizable, we can first chunk the formalized statements and proofs into smaller files:

python proof_checking_processing.py

and then run the proof checking script:

python proof_checking_local.py

Citation

If you find this code useful in your research, please consider citing:

@inproceedings{zhou2023don,
  title={Don't Trust: Verify--Grounding LLM Quantitative Reasoning with Autoformalization},
  author={Zhou, Jin Peng and Staats, Charles E and Li, Wenda and Szegedy, Christian and Weinberger, Kilian Q and Wu, Yuhuai},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prompts

prompts

DTV.png

DTV.png

README.md

README.md

formalize_proof.py

formalize_proof.py

formalize_statement.py

formalize_statement.py

informal_prove.py

informal_prove.py

proof_checking_local.py

proof_checking_local.py

proof_checking_processing.py

proof_checking_processing.py

utils.py

utils.py

Repository files navigation

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Abstract

Prompts

Informal Solution Generation

Formal Statement Generation

Formal Solution Generation

Proof Checking

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
prompts		prompts
DTV.png		DTV.png
README.md		README.md
formalize_proof.py		formalize_proof.py
formalize_statement.py		formalize_statement.py
informal_prove.py		informal_prove.py
proof_checking_local.py		proof_checking_local.py
proof_checking_processing.py		proof_checking_processing.py
utils.py		utils.py

jinpz/dtv

Folders and files

Latest commit

History

Repository files navigation

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Abstract

Prompts

Informal Solution Generation

Formal Statement Generation

Formal Solution Generation

Proof Checking

Citation

About

Resources

Stars

Watchers

Forks

Languages