Add Conservative DPO, IPO, and KTO #78

ertkonuk · 2024-01-11T00:28:38Z

What does this PR do ?

Adds Conservative DPO (CDPO), IPO, and KTO methods

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

For CDPO, simply set the model.dpo.label_smoothing to a positive non-zero value. For IPO, and KTO, set the model.dpo.preference_loss to "ipo" or "kto", respectively.

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

for more information, see https://pre-commit.ci

gshennvm · 2024-01-11T00:58:12Z

nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py

+        assert self.preference_loss in ["dpo", "ipo", "kto"]
+        if self.preference_loss == "dpo":
+            loss = (
+                -torch.nn.functional.logsigmoid(self.ref_policy_kl_penalty * logits) * (1.0 - self.label_smoothing)


nit: do you have to compute self.ref_policy_kl_penalty * logits twice in this loss function?

We don't need to compute that twice. I changed the code.

gshennvm · 2024-01-11T01:03:17Z

nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py

+        elif self.preference_loss == "kto":
+            rewards_kl = self.get_reduced_masked_logps(pi_logprobs - ref_logprobs, labels, average_log_probs=True)
+            chosen_kl, reject_kl = self.split_output_tensor(rewards_kl)
+            loss = torch.cat(


another nit: I'm not sure what the performance impact of this is but this is creating a tensor and then filling it in with these 2 things followed by a mean -> why not do .sum() on each of these tensors and then divide it ourselves?

I think it also makes it more readable that way

Changed it!

gshennvm · 2024-01-11T01:05:28Z

nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py

-        loss = -torch.nn.functional.logsigmoid(chosen_rewards - reject_rewards)
+        chosen_rewards, reject_rewards = self.split_output_tensor(rewards)
+        logits = chosen_rewards - reject_rewards
+        assert self.preference_loss in ["dpo", "ipo", "kto"]


can you put the implementation of these 3 loss function in separate functions, and then put these functions in a dictionary?

something like:

PREFERENCE_LOSS_FUNCTIONS = { "dpo": dpo_loss_function, ... }

and then get it in the model __init__?

I put this dict outside the class. Alternatively, we can also make it a part of the MegatronGPTDPOModel class.

gshennvm · 2024-01-11T01:06:34Z

with this addition, we should also rename dpo -> something else to make it clear we have the option to do other things.

currently im thinking of calling this gpt_preference_optimization @odelalleau do you have any thoughts on a better name?

gshennvm · 2024-01-11T01:07:43Z

@ertkonuk please also make sure that we reflect this change on our DPO tutorial, and I will ping you about adding these loss functions onto our CI once we finalize the review.

for more information, see https://pre-commit.ci

odelalleau · 2024-01-12T04:13:42Z

with this addition, we should also rename dpo -> something else to make it clear we have the option to do other things.

currently im thinking of calling this gpt_preference_optimization @odelalleau do you have any thoughts on a better name?

I'm not sure we should rename it. I feel like it may cause more trouble and confusion than keeping the DPO name and documenting it as "DPO and related algorithms / variants" => I wouldn't rush a renaming right now given the limited scope of the changes in this PR (and this is not a criticism: it's great that it's limited!)

Btw apologies but my full review may come late => feel free to merge without me. I want to spend some time actually looking at these papers, but this won't be possible until next week.

At first glance the code structure looks good to me!

EDIT: will also need a CHANGELOG entry

kawine · 2024-02-16T10:17:25Z

thanks for including kto in this @ertkonuk !

this kto implemention is the kto-paired version in huggingface that assumes access to a paired preference data. the more powerful (and standard) version of kto can work with purely biary data (+1/-1, good/bad) and supports extreme data imbalances (e.g., 5% positive examples and 95% negative examples), and has some minor changes to make training more stable

huggingface is close to merging the PR for the unpaired version (see here)
open-rlhf has already implemented this version (as has our own repo)

i think NeMo users would find it more useful to have the latter version of KTO, since it would allow them to align with a much more abundant kind of feedback

ertkonuk · 2024-02-21T16:30:26Z

Hi @kawine,

Thanks for your feedback and providing the reference implementations. I agree that supporting the unpaired version would be more advantageous and our plan is to eventually have that in NeMo Aligner. I'll begin making the necessary changes to implement the standard version of KTO very soon.

…uk/kto

for more information, see https://pre-commit.ci

…uk/kto

for more information, see https://pre-commit.ci

…uk/kto

for more information, see https://pre-commit.ci

…uk/kto

for more information, see https://pre-commit.ci

…uk/kto

for more information, see https://pre-commit.ci

SahilJain314 · 2024-04-17T00:48:48Z

It looks like there are large discrepancies between fp32 and bf16 runs at the moment. Looking into it.

ertkonuk added 3 commits January 9, 2024 17:37

Conservative DPO.

9225f02

Identity Preference Optimization

7cd452b

Kahneman-Tversky Optimization

7ce8581

ertkonuk requested review from odelalleau and gshennvm January 11, 2024 00:28

github-actions bot added the CI label Jan 11, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

39ac046

for more information, see https://pre-commit.ci

ertkonuk changed the base branch from dev to main January 11, 2024 00:33

gshennvm requested changes Jan 11, 2024

View reviewed changes

ertkonuk added 2 commits January 11, 2024 18:25

Separate loss functions

98df9a8

Separate loss functions

7bbd233

github-actions bot removed the CI label Jan 12, 2024

pre-commit-ci bot and others added 4 commits January 12, 2024 00:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b20120

for more information, see https://pre-commit.ci

Optimized KTO loss

4d3bdc8

Optimized KTO loss

f7dafe0

[pre-commit.ci] auto fixes from pre-commit.com hooks

ece15ad

for more information, see https://pre-commit.ci

ertkonuk and others added 4 commits March 19, 2024 18:37

KTO dataset and trainer class.

63170f3

KTO trainer.

5966908

Merge branch 'tkonuk/kto' of github.com:NVIDIA/NeMo-Aligner into tkon…

0af53aa

…uk/kto

[pre-commit.ci] auto fixes from pre-commit.com hooks

500511d

for more information, see https://pre-commit.ci

github-actions bot added the Algorithms label Mar 21, 2024

ertkonuk and others added 4 commits March 21, 2024 18:41

KTO Config.

adec0fa

Merge branch 'tkonuk/kto' of github.com:NVIDIA/NeMo-Aligner into tkon…

a591db6

…uk/kto

KL term is ongoing.

af76f0b

[pre-commit.ci] auto fixes from pre-commit.com hooks

bbfbd7c

for more information, see https://pre-commit.ci

github-actions bot added the Utils label Mar 23, 2024

ertkonuk and others added 15 commits March 24, 2024 12:47

KL term is wip.

e2a6c9a

KL term is wip.

3b7fb5d

[pre-commit.ci] auto fixes from pre-commit.com hooks

ec82ea1

for more information, see https://pre-commit.ci

KTO loss is wip.

2fa6607

KTO loss is wip.

d0a4a43

[pre-commit.ci] auto fixes from pre-commit.com hooks

c594b31

for more information, see https://pre-commit.ci

KTO Loss is implemented. Metrics is wip.

dd6f8d0

KTO Loss is implemented. Metrics is wip.

845cd10

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa34ea6

for more information, see https://pre-commit.ci

KTO metrics are done. Test convergence.

949eedb

Merge branch 'tkonuk/kto' of github.com:NVIDIA/NeMo-Aligner into tkon…

43d6ae0

…uk/kto

[pre-commit.ci] auto fixes from pre-commit.com hooks

0d7de7d

for more information, see https://pre-commit.ci

Anthropic hh dataset preprocessing script.

5c395a6

Merge branch 'tkonuk/kto' of github.com:NVIDIA/NeMo-Aligner into tkon…

bcaf324

…uk/kto

[pre-commit.ci] auto fixes from pre-commit.com hooks

145aeb1

for more information, see https://pre-commit.ci

odelalleau mentioned this pull request Apr 8, 2024

Can you support KTO? #143

Open

ertkonuk and others added 3 commits April 10, 2024 10:18

Updated batch metrics.

c92aaef

Merge branch 'tkonuk/kto' of github.com:NVIDIA/NeMo-Aligner into tkon…

ab1ecfd

…uk/kto

[pre-commit.ci] auto fixes from pre-commit.com hooks

e08d285

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Conservative DPO, IPO, and KTO #78

Add Conservative DPO, IPO, and KTO #78

ertkonuk commented Jan 11, 2024

gshennvm Jan 11, 2024

ertkonuk Jan 12, 2024

gshennvm Jan 11, 2024

ertkonuk Jan 12, 2024

gshennvm Jan 11, 2024

ertkonuk Jan 12, 2024

gshennvm commented Jan 11, 2024

gshennvm commented Jan 11, 2024

odelalleau commented Jan 12, 2024 •

edited

kawine commented Feb 16, 2024 •

edited

ertkonuk commented Feb 21, 2024

SahilJain314 commented Apr 17, 2024

Add Conservative DPO, IPO, and KTO #78

Are you sure you want to change the base?

Add Conservative DPO, IPO, and KTO #78

Conversation

ertkonuk commented Jan 11, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

gshennvm Jan 11, 2024

Choose a reason for hiding this comment

ertkonuk Jan 12, 2024

Choose a reason for hiding this comment

gshennvm Jan 11, 2024

Choose a reason for hiding this comment

ertkonuk Jan 12, 2024

Choose a reason for hiding this comment

gshennvm Jan 11, 2024

Choose a reason for hiding this comment

ertkonuk Jan 12, 2024

Choose a reason for hiding this comment

gshennvm commented Jan 11, 2024

gshennvm commented Jan 11, 2024

odelalleau commented Jan 12, 2024 • edited

kawine commented Feb 16, 2024 • edited

ertkonuk commented Feb 21, 2024

SahilJain314 commented Apr 17, 2024

odelalleau commented Jan 12, 2024 •

edited

kawine commented Feb 16, 2024 •

edited