fixing issue 46 #136

guyknvda · 2024-03-26T08:45:53Z

What does this PR do ?

Apply sampling params to the logprobs of the response tokens (see issue #46)

The application of sampling params is done by default.
to be consistent with the response generation process (done in text_generation_utils.py )
the following parameters were taken into account:

temperature
top_p
top_k

note that:

if use_greedy is set to True (default), the generation doesnt change the logits, thus the original logits are used to compute the log prob, ignoring the other sampling params (top_p, temperature and top_k)
the repetition_penalty is currently not taken into account since during the generation, it is also not taken into account (potential issue. its only taken into account if compute_logprob is True).

Additional Information

Related to Log probs in PPO do not account for custom sampling_params #46

Signed-off-by: gkoren <gkoren@nvidia.com>

for more information, see https://pre-commit.ci

odelalleau

Sincere apologies for the late review!

I really appreciate you tackling this issue, which is not trivial. Unfortunately it's more complex than that, for two reasons:

We need to handle tensor parallelism, which means that modifying the logits probably needs to be done within DistributedLogprob. It's likely be going to be a bit tricky to implement but it should be doable, at the expense of a few more steps to handle top_k / top_p. Note that it may be more efficient (and less memory intensive) to gather only the top_k logits from each rank.
We need to also modify the logits used in the loss here

I also think we should add flags to control where exactly these transformations are applied. I'm actually not sure it's a good idea to apply it to compute the KL penalty term because:

If we apply it to the reference policy, it may lead to infinite KL due to top_p / top_k (when we sample a token that has zero probability under the reference policy)
If we don't apply it to the reference policy, then we may start with a high KL penalty from the start, which could cause some issues.

I would thus suggest to add some fine-grained control on where we apply this transformation, with the following default values:

model:
  ppo:
    transform_logits_from_sampling_params:
      loss: True
      kl_penalty_actor: False
      kl_penalty_ref: ${.kl_penalty_actor}

This way we will be able to easily experiment with various configurations to see what actually works best in practice.

odelalleau · 2024-04-12T16:01:50Z

nemo_aligner/models/nlp/gpt/megatron_gpt_ppo_actor.py

+        # apply the sampling params to the logits - focusing only on the generated tokens.
+        context_length = context_lengths.min().item()
+        resp_logits = logits[:, context_length - 1 :].contiguous()
+        if not samparams.get("use_greedy", False):  # if use_greedy is True, use the logits as is


Minor suggestion: move this up two lines and write it

if samparams.get("use_greedy", False): return logits

which will avoid a couple of useless ops & extra indent

Also: I think we should add a check to skip scaling if temp == 1, and top_p / top_k if they are equal to 1.0 (or 0.0) / 1. This way we don't mess with logits for no good reason.

guyknvda · 2024-05-28T12:45:17Z

replaced by new PR #186

guyknvda added 2 commits March 26, 2024 00:52

fixing issue 46

df4e6ed

Signed-off-by: gkoren <gkoren@nvidia.com>

now also taking use_greedy into account

1188736

Signed-off-by: gkoren <gkoren@nvidia.com>

guyknvda requested a review from odelalleau March 26, 2024 08:45

[pre-commit.ci] auto fixes from pre-commit.com hooks

e0f4371

for more information, see https://pre-commit.ci

odelalleau requested changes Apr 12, 2024

View reviewed changes

guyknvda closed this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing issue 46 #136

fixing issue 46 #136

guyknvda commented Mar 26, 2024

odelalleau left a comment

odelalleau Apr 12, 2024

odelalleau Apr 12, 2024 •

edited

guyknvda commented May 28, 2024

fixing issue 46 #136

fixing issue 46 #136

Conversation

guyknvda commented Mar 26, 2024

What does this PR do ?

Additional Information

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau Apr 12, 2024

Choose a reason for hiding this comment

odelalleau Apr 12, 2024 • edited

Choose a reason for hiding this comment

guyknvda commented May 28, 2024

odelalleau Apr 12, 2024 •

edited