[CODE IMPROVEMENT] Deprecate RLHF with PPO in favor of DPO #545

pascal-pfeiffer · 2023-12-20T10:19:00Z

🔧 Proposed code refactoring

Deprecate RLHF with PPO in favor of DPO.

Motivation

RLHF with PPO involves an extra step to train a reward model (see #175).
For good results, the reward model needs to be of very high quality and tuned on a large amount of data. While OpenAI and Meta have successfully used PPO in their pipeline, the OS community struggled to get good results. Probably due to the lack of good training data for the reward model.
DPO (#530) is a technique that allows to train with Human Feedback in a more stable way on good+bad sample pairs and without the need for an additional reward model. It is also much quicker to train as the generate steps are skipped.

pascal-pfeiffer added the area/core Core code related issue label Dec 20, 2023

pascal-pfeiffer self-assigned this Dec 20, 2023

pascal-pfeiffer mentioned this issue Jan 30, 2024

deprecate RLHF #592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CODE IMPROVEMENT] Deprecate RLHF with PPO in favor of DPO #545

[CODE IMPROVEMENT] Deprecate RLHF with PPO in favor of DPO #545

pascal-pfeiffer commented Dec 20, 2023

[CODE IMPROVEMENT] Deprecate RLHF with PPO in favor of DPO #545

[CODE IMPROVEMENT] Deprecate RLHF with PPO in favor of DPO #545

Comments

pascal-pfeiffer commented Dec 20, 2023

🔧 Proposed code refactoring

Motivation