EasyRLHF

EasyRLHF aims to provide an easy and minimal interface to train aligned language models, using off-the-shelf solutions and datasets (i.e. HF Trainer, HF Datasets, Deepspeed, trl).

Following sections will cover rough concepts of alignment methods(RLHF, RRHF, DPO, IPO) and provide how to run examples.

RLHF Overview

As shown in InstructGPT paper, we can train a reward model and reinforce a language model to follow human instructions better. We can first train reward model and SFT-LM with hh-rlhf dataset and slimorca-dedup dataset respectively. Then PPO-LM can be trained with trl library.

Train a Reward Model

we need pairwise comparison dataset to train a reward model. In the InstructGPT paper, authors used 4~9 ranked continuations on the same prompt. For instance, A < B < C = D < E is a ranked sequence and one can sample two arbitrary samples(A and C). Here, C wins over A on the human preference. Thus we model logit of C - logit of A to be the log odds of C being a better demonstration than A. logit of X can be computed by a linear head attached at the top of a transformer decoder. We use off-the-shelf dataset from hh-rlhf by anthropic. This dataset is already flat so we don't need to worry about sampling schemes discussed in InstructGPT paper.

Train a SFT(supervised finetuned) model (WIP)

We can train SFT model with standard next-token-prediction using slimorca-dedup.

Train a PPO model (WIP)

Now that we have a reward model and a SFT model, we can do reinforcement learning with off-the-shelf RL packages designed for language models. We use trlto reinforce the SFT model. In PPO stage, we keep copy of SFT model for reference. This reference model allows behavior model to learn to increase human preference while avoiding reward hacking. Specifically, behavior model first generates a completion given prompt. Token distributions are kept close to reference model via minimising KL-divergence against reference model's token distribution. A completion is fed to reward model to get a reward score. KL term and reward score are summed and regarded as a reward for PPO algorithm.

Quickstart

Prepare virtual environment (optional)

conda create -n easy-rlhf python=3.8

Clone and install requirements

git clone https://github.com/DaehanKim/EasyRLHF.git
cd EasyRLHF
pip install .

Unzip hh-rlhf dataset and train a reward model by using rm_train cmd

cd data
find . -name '*.gz' -print0 | xargs -0 gzip -d
rm_train --devices "0,1,2,3" \
--output_dir "outputs/my-model" \
--train_data data/helpful-base/train.jsonl,data/helpful-online/train.jsonl,data/helpful-rejection-sampled/train.jsonl \
--valid_data data/helpful-base/test.jsonl,data/helpful-online/test.jsonl,data/helpful-rejection-sampled/test.jsonl

Alternatively, you can use scripts/rm_train.sh for more customized settings

Notes

Default model is gpt2-xl(1.5B) and the loss is binary cross entropy.
Deepspeed config is in configs/ds_config.yaml where you can set your preferred distributed setting. Default is set to Zero-2 Parallelism.
ToDo
- basic reward model training
- basic SFT model training
- basic PPO model training

RRHF Overview

TBD

DPO Overview

TBD

IPO Overview

TBD

References

License

This project just binds libraries and datasets from various sources, so is under license terms of corresponding sources. Binding script itself is licensed MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
EasyRLHF		EasyRLHF
assets		assets
configs		configs
data		data
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EasyRLHF

EasyRLHF

assets

assets

configs

configs

data

data

scripts

scripts

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

setup.py

setup.py

Repository files navigation

EasyRLHF

RLHF Overview

Train a Reward Model

Train a SFT(supervised finetuned) model (WIP)

Train a PPO model (WIP)

Quickstart

Notes

RRHF Overview

DPO Overview

IPO Overview

References

License

About

Releases

Packages

Languages

DaehanKim/EasyRLHF

Folders and files

Latest commit

History

Repository files navigation

EasyRLHF

RLHF Overview

Train a Reward Model

Train a SFT(supervised finetuned) model (WIP)

Train a PPO model (WIP)

Quickstart

Notes

RRHF Overview

DPO Overview

IPO Overview

References

License

About

Topics

Resources

Stars

Watchers

Forks

Languages