feat: add soft distillation #736

mattmazzola · 2023-06-12T19:04:40Z

⚠️ This PR is not intended to be merged directly. Purpose to share features that may be useful for Metaseq ⚠️

Background

One of the main goals for our project's fork was to implement "soft" distillation (training on set of logprobs rather than correctness of token class) and to measure the efficacy of this technique compared to normal finetuning

From our docs:

The motivation for training on log probabilities rather than token classes is to pass as much knowledge from the teacher to the student as possible. [... By the teacher providing] log probabilities of other tokens in the vocabulary [we expect] the student better learn to represent the teacher’s knowledge.

Issue

Soft Distillation was not implemented

Solution

Add new pipeline task streaming_distillation_language_modeling
- Add new criterion vocab_parallel_soft_cross_entropy (Note: Soft)
  - Considers multiple possible predictions for each token of the target sequence
- Adds new parameters
  --task streaming_distillation_language_modeling
  --distillation-mode logprobs_distillation
  --criterion vocab_parallel_soft_cross_entropy

Testing

Did not test

Related to #726

This feature was implemented by @anselmwang and @clarissesimoes

clarissesimoes

I think one the most valuable things that we've added is the documentation, as both parallel vocabulary cross entropy and soft cross entropy can be hard to understand due to the fact of both of them applying simplifications in the formulas, all of them explained in the docs. Is that a way to bring the MD files that we've created to this diff?

metaseq/criterions/vocab_parallel_cross_entropy.py

clarissesimoes · 2023-06-13T14:42:04Z

metaseq/models/transformer_decoder.py

@@ -59,10 +60,10 @@ def log_weight_stats(tensor, name):
    )


-class ModelParallelTransformerDecoder(BaseDecoder):


I don't recall changing this class in our implementation, but this might have been changed by Yu or Sahaj. It would be a good idea to include Sahaj in this review too.

I don't think we can modify reviewers since we are not contributors/admins on repo.
However, I think we can mention them here @anselmwang @sahajgg

clarissesimoes · 2023-06-13T14:44:47Z

metaseq/models/transformer_decoder.py

+        # Gather output if model is in inference mode (i.e. evallm or generation) cause both are not yet compatible with
+        # parallel vocab embeddings
+        criterion = getattr(self.args, "criterion")
+        is_parallel_criterion = criterion.find("vocab_parallel") != -1
+        if not is_parallel_criterion or getattr(self, "inference", False):
+            x = gather_from_tensor_model_parallel_region(x).contiguous()


This is the only change that I confirm that I've made in this class. Please confirm if the others should be double checked with Sahaj and Yu. My concern is that we might be reverting changes important in the original code in the rest of the diff

tests/test_streaming_distillation_language_modelling_task.py

mattmazzola · 2023-06-12T19:10:21Z

metaseq/criterions/vocab_parallel_soft_cross_entropy.py

+        target_mask = (target_tokens < vocab_start_index) | (target_tokens >= vocab_end_index)
+        masked_target = target_tokens.clone() - vocab_start_index
+        masked_target[target_mask] = 0
+
+        # Get predicted-logits = logits[top_logprobs].
+        predicted_logits = vocab_parallel_logits.gather(dim=-1, index=masked_target)
+        predicted_logits[target_mask] = 0.0
+        # All reduce is needed to get the chunks from other GPUs.
+        torch.distributed.all_reduce(
+            predicted_logits, op=torch.distributed.ReduceOp.SUM, group=get_tensor_model_parallel_group()
+        )
+
+        # Sum of exponential of logits along vocab dimension across all GPUs.
+        exp_logits = vocab_parallel_logits
+        torch.exp(vocab_parallel_logits, out=exp_logits)
+        sum_exp_logits = exp_logits.sum(dim=-1)
+        torch.distributed.all_reduce(
+            sum_exp_logits, op=torch.distributed.ReduceOp.SUM, group=get_tensor_model_parallel_group()
+        )
+
+        # Loss = log(sum(exp(logits))) - predicted-logit.
+        target_weights = target_predictions.exp()
+        loss = ((torch.log(sum_exp_logits).unsqueeze(dim=-1) - predicted_logits) * target_weights).sum(-1)
+
+        # Store softmax, top_logprobs-mask and masked-top_logprobs for backward pass.
+        softmax = exp_logits.div(sum_exp_logits.unsqueeze(dim=-1))
+        ctx.save_for_backward(softmax, target_mask, masked_target, target_weights)


If I remember, the majority of work was in this section of code. @clarissesimoes can you confirm?
If not, can you make a comment to call out other notable places of code they should pay closer attention to

Also, there is a documentation file included in PR to help explain this code.

I confirm, and I'd also mention the file metaseq/tasks/streaming_distillation_language_modeling.py as equally important, as data preprocessing and masks are slightly different for distillation when compared to finetuning

mattmazzola · 2023-06-12T19:12:30Z

metaseq/criterions/vocab_parallel_mse_loss.py

+    has_megatron_submodule = False
+
+
+class _VocabParallelMSELoss(torch.autograd.Function):


From what I remember, we implemented this thinking it would work for Logits AND Logprobs, but it only works for Logits. Then because we could only get logprobs from OpenAI model output and couldn't convert to logits this loss effectively became unused.

However, we left it in because the implementation may be valuable for other applications where logit values are available

We've tested MSE Loss with logprobs but training never converged. MSE Loss can be used if the input is teacher logits, though

mattmazzola · 2023-06-13T15:34:11Z

metaseq/models/transformer_decoder.py

@@ -59,10 +60,10 @@ def log_weight_stats(tensor, name):
    )


-class ModelParallelTransformerDecoder(BaseDecoder):


I don't think we can modify reviewers since we are not contributors/admins on repo.
However, I think we can mention them here @anselmwang @sahajgg

tests/test_streaming_distillation_language_modelling_task.py

metaseq/criterions/vocab_parallel_cross_entropy.py

mattmazzola added 2 commits June 12, 2023 11:15

Initial Commit

fc8633b

More changes to transformer_decoder

c5be494

facebook-github-bot added the cla signed label Jun 12, 2023

clarissesimoes suggested changes Jun 13, 2023

View reviewed changes

mattmazzola added 2 commits June 13, 2023 08:38

Add vocab files used by tests

446e1da

Add documentation file explaining loss functions

c2b3415

mattmazzola commented Jun 13, 2023

View reviewed changes

mattmazzola marked this pull request as ready for review June 13, 2023 15:59

mattmazzola requested review from suchenzang, ngoyal2707, punitkoura, moyapchen, davides, igormolybogFB, Xirider, andrewPoulton, bashnick, tangbinh, urielsinger, zycalice, ArmenAg, lilisierrayu and adampolyak as code owners June 13, 2023 15:59

anselmwang approved these changes Oct 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add soft distillation #736

feat: add soft distillation #736

mattmazzola commented Jun 12, 2023

clarissesimoes left a comment

clarissesimoes Jun 13, 2023

mattmazzola Jun 13, 2023

clarissesimoes Jun 13, 2023

mattmazzola Jun 12, 2023

mattmazzola Jun 13, 2023

clarissesimoes Jun 13, 2023

mattmazzola Jun 12, 2023

clarissesimoes Jun 13, 2023

mattmazzola Jun 13, 2023

		@@ -59,10 +60,10 @@ def log_weight_stats(tensor, name):
		)


		class ModelParallelTransformerDecoder(BaseDecoder):

		has_megatron_submodule = False


		class _VocabParallelMSELoss(torch.autograd.Function):

feat: add soft distillation #736

Are you sure you want to change the base?

feat: add soft distillation #736

Conversation

mattmazzola commented Jun 12, 2023

Background

Issue

Solution

Testing

clarissesimoes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment