Compute grad norm #897

tcapelle · 2024-04-29T10:50:01Z

Added the Grad norm function that original was added in the improved logging experience.

It may be moved somewhere else, but I think it's a really relevant metric to have when training.

pytorch-bot · 2024-04-29T10:50:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/897

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

musabgultekin · 2024-04-29T18:18:02Z

We can also use torch.nn.utils.clip_grad_norm_ instead of manually calculating the norms. :
https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html

tcapelle · 2024-04-29T18:28:07Z

tcapelle · 2024-04-29T18:35:52Z

We should also add this parameter to the recipes YAMLs, default should be 1.0 as in HF and axolotl.

tcapelle · 2024-04-29T18:45:53Z

I need to add the:

max_norm: 1.0

to the recipes, do you have any trick to do this automatically?

ebsmothers

I am still on the fence about adding grad norm as a field that we log 100% of the time. Mainly I want to make the following separation:

fields that must be logged every time (i.e. they are fundamental properties of training that most people will need to know almost every time), vs
fields that are useful but in a more limited set of cases.

In my mind grad norm is kinda on the boundary between these two. But I wanna make sure we are super clear about the precedent we set here. My concern is that we wind up adding a bunch of stats from category (2) into our core recipe, bloat the training loop, and reduce the ease with which folks can modify copy-pasted versions of the recipes.

Anyways, as I said, I am on the fence here, so will def not block this change based on this point alone. The changes requested is more a function of the clip_grad_norm usage. Also cc @joecummings and @kartikayk for any thoughts on logging here.

ebsmothers · 2024-04-29T20:30:13Z

recipes/full_finetune_distributed.py

@@ -476,6 +476,9 @@ def train(self) -> None:
                loss = loss / self._gradient_accumulation_steps
                running_loss += loss
                loss.backward()
+                grad_norm = nn.utils.clip_grad_norm_(


Sorry maybe I am misunderstanding the intent of this change, but doesn't clip_grad_norm rescale the gradients? I don't think we want to do that here, right?

where would you do this? In my understanding you want to clip after computing the gradients (backward) and before doing the optimizer step.

ebsmothers · 2024-04-29T20:34:50Z

I need to add the:
max_norm: 1.0
to the recipes, do you have any trick to do this automatically?

@tcapelle see my other comment. Unless I'm misunderstanding I don't think we want to use clip_grad_norm here after all. Regardless you raise a good point.. I've also found it a bit annoying to manually modify/add a field in all our configs. I think we should try to add a tool for this at some point (but to actually answer your question, no such tool exists as of yet).

musabgultekin · 2024-04-30T06:09:47Z

We can use float('inf') instead of 1? So it doesn't clip

tcapelle · 2024-04-30T10:09:16Z

I feel that we should have grad_clip enabled by default. The idea is give a good finetune recipe in place, also the grad_norm is a good debugging tool. This is a good example of grad norm utility, it enables you to debug and even analyse before doing an optimizer step so we can avoid loss spikes.

ebsmothers · 2024-04-30T13:38:28Z

While adding support for gradient clipping as a feature is nice to have, I don’t think we should conflate it with what’s being proposed here, which is a logging change. I definitely do not think we should enable gradient clipping by default without testing the implication of such a change on our various recipes.

As I mentioned above, I do see the value in logging grad norm. And evidently clip_grad_norm is a reasonable way to do it (provided that we pass inf as suggested by @musabgultekin). However, there is a cost to this change: we are calling a method clearly intended to clip gradients and using it in a non-obvious way for logging in the recipe. In my opinion, one of the top considerations for our recipes is that they are easily understandable, and I think such a change harms that a bit. So if efficiency of implementations are roughly equivalent I’d actually prefer a separate utility (assuming we are not adding proper gradient clipping support, which again, should be addressed separately and a bit more carefully imo).

tcapelle · 2024-04-30T13:55:38Z

Agree with everything above. I think we should wait and test if max_grad_norm should be used on the recipes as default or not. I can change it now to float(inf) so it does not impact the tests.
What I can say, is that I have always seen this parameter being used for LLM finetuning. Some examples are:

The HF.Trainer has a default of 1.0
axolotl has a default of 1.0
the mistral reference finetuning script defaults of 1.0,
lit-gpt defaults to 1.0
nanoGPT defaults to 1.0

I can pull data on W&B runs and check when it is being changed to other than 1.0 on the integrations we already have.

add grad norm

f7e950e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2024

use torch.nn clip_grad

b106099

remove grad func

d934ec8

tcapelle force-pushed the grad_norm branch from 553c63d to d934ec8 Compare April 29, 2024 18:37

tcapelle added 2 commits April 29, 2024 20:38

inplace

d6ca312

update with grad_norm

aa849ce

ebsmothers requested changes Apr 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute grad norm #897

Compute grad norm #897

tcapelle commented Apr 29, 2024

pytorch-bot bot commented Apr 29, 2024

musabgultekin commented Apr 29, 2024

tcapelle commented Apr 29, 2024

tcapelle commented Apr 29, 2024

tcapelle commented Apr 29, 2024

ebsmothers left a comment

ebsmothers Apr 29, 2024

tcapelle Apr 30, 2024

ebsmothers commented Apr 29, 2024

musabgultekin commented Apr 30, 2024

tcapelle commented Apr 30, 2024

ebsmothers commented Apr 30, 2024

tcapelle commented Apr 30, 2024 •

edited

Compute grad norm #897

Are you sure you want to change the base?

Compute grad norm #897

Conversation

tcapelle commented Apr 29, 2024

pytorch-bot bot commented Apr 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/897

musabgultekin commented Apr 29, 2024

tcapelle commented Apr 29, 2024

tcapelle commented Apr 29, 2024

tcapelle commented Apr 29, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers Apr 29, 2024

Choose a reason for hiding this comment

tcapelle Apr 30, 2024

Choose a reason for hiding this comment

ebsmothers commented Apr 29, 2024

musabgultekin commented Apr 30, 2024

tcapelle commented Apr 30, 2024

ebsmothers commented Apr 30, 2024

tcapelle commented Apr 30, 2024 • edited

tcapelle commented Apr 30, 2024 •

edited