Reproducing results of paper - "Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints"
The paper compares different divergence functions for direct preference optimization (DPO).
Results notebook on nbviewer - results.ipynb
- Install poetry
- Then run:
git clone https://github.com/somvy/slic-hf && cd slic-hf
poetry install && poetry shell
wandb login
huggingface-cli login
- Specify your HuggingFace username, desired SFT model in config.py
Prompts - first sentences from movie reviews. Used some hacks to generate answers with positive bias (see dataset/generation_config.py) Used diverse beam search decoding with diversity penalty 50 to generate 6 answers per prompt. Then scored them with reward model. Used pairs of (top1, top4\5\6) and (top1\2\3, top6) as chosen and rejected answers (total 6 pairs from generation). Final dataset - 3600 pairs, test size 0.2.
Also randomly selected 50 prompts for eval generation - hf link
Use this dataset, or generate your own by
set -a && source .env && poetry run python dataset/main.py
after generation change datasets paths in config.py
- Specify training arguments, DPOTrainer params and run_name in train_dpo/train.py
- Run
set -a && source .env && poetry run python train_dpo/train.py
- (Optional) Generate answers from eval dataset. Specify generation params and desired run_name in train_dpo/generate.py
set -a && source .env && poetry run python train_dpo/generate.py
Trained GPT2 finetuned on IMDB reviews.
3 epochs, batch size 4, lr 1e-4 for sigmoid and hinge, 1e-5 for others.
Loss | Weights | Wandb Report |
---|---|---|
Hinge | link | |
link | ||
link | ||
link | ||
link | ||
Sigmoid | link | |
link | ||
link | ||
link | ||
link | ||
JS divergence | link | |
link | ||
link | ||
Forward KL | link | |
link | ||
link | ||
link | ||
link | ||
link | ||
link | ||
link | ||
link | ||
link |