Training Open Instruction-Following Language Models

This repo serves as an open effort on instruction-tuning and post-training popular pretrained language models on publicly available datasets. We release this repo and will keep updating it with:

Code for finetuning language models with latest techniques and instruction datasets in a unified format.
Code for DPO, preference finetuning and reinforcement learning with verifiable rewards (RLVR).
Checkpoints or other useful artifacts that we build in our exploration.

We also support some evaluations natively in the codebase, but these are now unmaintained and instead we suggest using OLMES, which we used for TÜLU 3.

The lastest details on open post-training are found in TÜLU 3: Pushing Frontiers in Open Language Model Post-Training.

Please see our first paper How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources for more thoughts behind this project and our initial findings. Please see our second paper Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2 for results using Llama-2 models and direct preference optimization. We are still working on more models. For more recent results involving PPO and DPO please see our third paper Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.

Try some of the models we train with Open Instruct. There is a free demo or download them from HuggingFace:

Stage	Llama 3.1 8B	Llama 3.1 70B	OLMo-2 7B	OLMo-2 13B
Base Model	meta-llama/Llama-3.1-8B	meta-llama/Llama-3.1-70B	allenai/OLMo2-7B-1124	allenai/OLMo-2-13B-1124
SFT	allenai/Llama-3.1-Tulu-3-8B-SFT	allenai/Llama-3.1-Tulu-3-70B-SFT	allenai/OLMo-2-1124-7B-SFT	allenai/OLMo-2-1124-13B-SFT
DPO	allenai/Llama-3.1-Tulu-3-8B-DPO	allenai/Llama-3.1-Tulu-3-70B-DPO	allenai/OLMo-2-1124-7B-DPO	allenai/OLMo-2-1124-13B-DPO
Final Models (RLVR)	allenai/Llama-3.1-Tulu-3-8B	allenai/Llama-3.1-Tulu-3-70B	allenai/OLMo-2-1124-7B-Instruct	allenai/OLMo-2-1124-13B-Instruct
Reward Model (RM)	allenai/Llama-3.1-Tulu-3-8B-RM	(Same as 8B)	allenai/OLMo-2-1124-7B-RM	(Same as 7B)

News

[2024-11-22] We released TÜLU 3: Pushing Frontiers in Open Language Model Post-Training and updated our entire stack of open post-training recipes with both Llama 3.1 and OLMo 2.
[2024-07-01] We released Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback and have majorly updated our codebase to support new models and package versions.
[2023-11-27] We released Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2. Check out our models here. We have added a DPO finetuning script for replicating our results.
[2023-09-26] We switched to use the official alpaca-eval library to run AlpacaFarm evaluation but use regenerated longer reference outputs. This will change our numbers reported in the paper. We will update the paper soon.
[2023-09-25] Supported using vLLM for our evaluations, which speeds up the evaluation by 10x.
[2023-09-17] Supported LoRA and QLoRA finetuning. See here for more details.
[2023-08-18] Added support for ToxiGen/TruthfulQA evaluation. Check our scripts/eval/ for examples of running them.
[2023-08-08] Supported several new instruction dataset, including LIMA / WizardLM / Open-Orca. See the preparation script for details. Performance hasn't been evaluated yet.
[2023-08-06] Supported LLaMa 2 finetuning and FlashAttention-2 by bumping the version of transformers and many other dependencies.
[2023-06-29] Added licensing info for our released models.
[2023-06-09] Released Tülu (a suite of LLaMa models fully-finetuned on a strong mix of datasets) and many other checkpoints on HuggingFace [Links].
[2023-06-09] Initial release of the codebase containing the training and evaluation code for our arxiv paper.

Setup

Our setup mostly follows our Dockerfile, which uses Python 3.10. Note that Open Instruct is a research codebase and does not guarantee backward compatibility. We offer two installation strategies:

Local installation: This is the recommended way to install Open Instruct. You can install the dependencies by running the following commands:

pip install --upgrade pip "setuptools<70.0.0" wheel 
# TODO, unpin setuptools when this issue in flash attention is resolved
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install packaging
pip install flash-attn==2.7.2.post2 flashinfer-python>=0.2.7.post1 --no-build-isolation
pip install -r requirements.txt
pip install -e .
python -m nltk.downloader punkt

Local installation with uv (preview): We are experimenting with using uv. You can install via

uv sync
uv sync --extra compile --extra liger # to install flash attention, flash infer, and liger-kernel

Docker installation: You can also use the Dockerfile to build a Docker image. You can build the image with the following command:

# If you are internal at Ai2, defaults are set to use an internal base image.
docker build . -t open_instruct_dev

# if you are external to Ai2, please build the base image first and then build the open_instruct_dev image.
docker build -f Dockerfile.base --build-arg CUDA="12.8.0" --build-arg VARIANT=devel --build-arg DIST=ubuntu22.04 --build-arg TORCH_VER="2.7.0" --build-arg BEAKER_VERSION=v1.5.208 --build-arg VULKAN_SDK_VERSION="1.3.275" --target default-cuda -t open_instruct_dev:latest .
docker build . -t open_instruct_dev --build-arg BASE_IMAGE=open_instruct_dev:latest

# if you are interally at AI2, you can create a beaker image like this:
beaker_user=$(beaker account whoami --format json | jq -r '.[0].name')
beaker image delete $beaker_user/open_instruct_dev 
beaker image create open_instruct_dev -n open_instruct_dev -w ai2/$beaker_user

Optionally you can build the base image with the following command:

docker build -f Dockerfile.base --build-arg CUDA="12.8.0" --build-arg VARIANT=devel --build-arg DIST=ubuntu22.04 --build-arg TORCH_VER="2.7.0" --build-arg BEAKER_VERSION=v1.5.208 --build-arg VULKAN_SDK_VERSION="1.3.275" --target default-cuda -t open_instruct_dev:latest .

Docker with uv: You can also use the Dockerfile to build a Docker image with uv. You can build the image with the following command:

# similar to above, you need to provide the base image if you are external to Ai2.
docker build -f Dockerfile.uv --build-arg UV_CACHE_DIR=$UV_CACHE_DIR -t open_instruct_dev_uv .
# if you are interally at AI2, you can create an image like this:
beaker_user=$(beaker account whoami --format json | jq -r '.[0].name')
beaker image delete $beaker_user/open_instruct_dev_uv 
beaker image create open_instruct_dev_uv -n open_instruct_dev_uv -w ai2/$beaker_user

If you are internally at AI2, you may launch experiments using our always-up-to-date auto-built image nathanl/open_instruct_auto.

Training

After having setup the environment, you are ready to launch some experiments. We provide a few examples below. To learn more about how to reproduce the Tulu 3 models, please refer to the Tulu 3 README. The instructions and documentations for Tulu 1 and Tulu 2 are in Tulu 1 and 2 README.

Finetuning

You can run the following commands for getting started:

# quick debugging run using 1 GPU
bash scripts/train/finetune/mini.sh
# train an 8B tulu3 model using 8 GPU
bash scripts/train/finetune/tulu_finetune_mix.sh

Preference Tuning

# quick debugging run using 1 GPU
bash scripts/train/dpo/mini.sh
# train an 8B tulu3 model using 8 GPU
bash scripts/train/dpo/tulu_preference_mix.sh

Reinforcement Learning with Verifiable Rewards (RLVR)

# quick debugging run using 1 GPU (0.5 for inference, 0.5 for training)
# here we are using a small model, so it's prob not gonna train good models, but it's easy to test run and print stuff.
bash scripts/train/rlvr/ppo_mini.sh
bash scripts/train/rlvr/ppo2_mini.sh # experimental support (ppo2 adds kl to loss directly instead of using KL penalty in rewards)
bash scripts/train/rlvr/grpo_mini.sh

# train an 8B tulu3 model using 8 GPU (1 for inference, 7 for training)
bash scripts/train/rlvr/tulu_rlvr.sh

Contamination checks

We release our scripts for measuring the overlap between instruction tuning datasets and evaluation datasets in ./decontamination. See the README for more details.

Developing

When submitting a PR to this repo, we check the core code in open_instruct/ for style with the following:

make style
make quality

Run the tests with uv run pytest.

Repo structure

├── assets/                     <- Images, licenses, etc.
├── configs/                    
|     ├── beaker_configs/       <- AI2 Beaker configs
|     ├── ds_configs/           <- DeepSpeed configs
|     └── train_configs/        <- Training configs
├── decontamination/            <- Scripts for measuring train-eval overlap
├── eval/                       <- Evaluation suite for fine-tuned models
├── human_eval/                 <- Human evaluation interface (not maintained)
├── open_instruct/              <- Source code (flat)
├── quantize/                   <- Scripts for quantization
├── scripts/                    <- Core training and evaluation scripts
└── Dockerfile                  <- Dockerfile

Licensing

This codebase is licensed under Apache 2.0 as given in LICENSE.

The license we use for V1 models released (along with the base model licenses) can be found in assets/model_licenses/tulu_license.txt - just replace <MODELNAME> with the actual model name (i.e., the name on HuggingFace).

V2 models are licensed under the low-risk AI2 ImpACT license. See here for more details.

Acknowledgements

Open Instruct is a project that benefitd from many open-source projects and libraries. We would like to particularly thank the folloiwng projects:

HuggingFace Transformers: We adapted Hugging Face's Trainer for our finetuning scripts.
HuggingFace TRL and eric-mitchell/direct-preference-optimization: our preference tuning code is adapted from TRL and from Eric Mitchell's DPO code.
OpenAI's lm-human-preferences, summarize-from-feedback, and vwxyzjn/summarize_from_feedback_details: Our core PPO code is adapted from OpenAI's original RLHF code and Huang et al (2024)'s reproduction work of OpenAI's summarize from feedback work.
OpenRLHF: We adapted OpenRLHF's Ray + vLLM distributed code for scaling up PPO RLVR training into the 70B scale.

Citation

If you used this repository or our models, please cite our work:

Tulu 1:

@misc{wang2023far,
   title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources}, 
   author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
   year={2023},
   eprint={2306.04751},
   archivePrefix={arXiv},
   primaryClass={cs.CL}
}

Tulu 2:

@misc{ivison2023camels,
      title={Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2}, 
      author={Hamish Ivison and Yizhong Wang and Valentina Pyatkin and Nathan Lambert and Matthew Peters and Pradeep Dasigi and Joel Jang and David Wadden and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
      year={2023},
      eprint={2311.10702},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Tulu 2.5:

@misc{ivison2024unpacking,
      title={Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}, 
      author={Hamish Ivison and Yizhong Wang and Jiacheng Liu and Zeqiu Wu and Valentina Pyatkin and Nathan Lambert and Noah A. Smith and Yejin Choi and Hannaneh Hajishirzi},
      year={2024},
      eprint={2406.09279},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
}

Tulu 3:

@article{lambert2024tulu3,
  title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
  author = {
    Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James V. Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A. Smith and Yizhong Wang and Pradeep Dasigi and Hannaneh Hajishirzi
  },
  year = {2024},
  email = {tulu@allenai.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 816 Commits
.github		.github
assets		assets
configs		configs
decontamination		decontamination
docker		docker
docs		docs
eval		eval
human_eval		human_eval
open_instruct		open_instruct
quantize		quantize
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Dockerfile.base		Dockerfile.base
Dockerfile.uv		Dockerfile.uv
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mason.py		mason.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
update_command_args.py		update_command_args.py
uv.lock		uv.lock
weight-diff-requirements.txt		weight-diff-requirements.txt
weka_setup_helper.sh		weka_setup_helper.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Training Open Instruction-Following Language Models

News

Setup

Training

Finetuning

Preference Tuning

Reinforcement Learning with Verifiable Rewards (RLVR)

Contamination checks

Developing

Repo structure

Licensing

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 40

Languages

License

allenai/open-instruct

Folders and files

Latest commit

History

Repository files navigation

Training Open Instruction-Following Language Models

News

Setup

Training

Finetuning

Preference Tuning

Reinforcement Learning with Verifiable Rewards (RLVR)

Contamination checks

Developing

Repo structure

Licensing

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 40

Languages

Packages