Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute grad norm #897

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Compute grad norm #897

wants to merge 5 commits into from

Conversation

tcapelle
Copy link
Contributor

Added the Grad norm function that original was added in the improved logging experience.

  • It may be moved somewhere else, but I think it's a really relevant metric to have when training.

Copy link

pytorch-bot bot commented Apr 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/897

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2024
@musabgultekin
Copy link
Contributor

We can also use torch.nn.utils.clip_grad_norm_ instead of manually calculating the norms. :
https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html

@tcapelle
Copy link
Contributor Author

image

@tcapelle
Copy link
Contributor Author

We should also add this parameter to the recipes YAMLs, default should be 1.0 as in HF and axolotl.

@tcapelle
Copy link
Contributor Author

I need to add the:

max_norm: 1.0

to the recipes, do you have any trick to do this automatically?

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still on the fence about adding grad norm as a field that we log 100% of the time. Mainly I want to make the following separation:

  1. fields that must be logged every time (i.e. they are fundamental properties of training that most people will need to know almost every time), vs
  2. fields that are useful but in a more limited set of cases.

In my mind grad norm is kinda on the boundary between these two. But I wanna make sure we are super clear about the precedent we set here. My concern is that we wind up adding a bunch of stats from category (2) into our core recipe, bloat the training loop, and reduce the ease with which folks can modify copy-pasted versions of the recipes.

Anyways, as I said, I am on the fence here, so will def not block this change based on this point alone. The changes requested is more a function of the clip_grad_norm usage. Also cc @joecummings and @kartikayk for any thoughts on logging here.

@@ -476,6 +476,9 @@ def train(self) -> None:
loss = loss / self._gradient_accumulation_steps
running_loss += loss
loss.backward()
grad_norm = nn.utils.clip_grad_norm_(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I am misunderstanding the intent of this change, but doesn't clip_grad_norm rescale the gradients? I don't think we want to do that here, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would you do this? In my understanding you want to clip after computing the gradients (backward) and before doing the optimizer step.

@ebsmothers
Copy link
Contributor

I need to add the:

max_norm: 1.0

to the recipes, do you have any trick to do this automatically?

@tcapelle see my other comment. Unless I'm misunderstanding I don't think we want to use clip_grad_norm here after all. Regardless you raise a good point.. I've also found it a bit annoying to manually modify/add a field in all our configs. I think we should try to add a tool for this at some point (but to actually answer your question, no such tool exists as of yet).

@musabgultekin
Copy link
Contributor

We can use float('inf') instead of 1? So it doesn't clip

@tcapelle
Copy link
Contributor Author

I feel that we should have grad_clip enabled by default. The idea is give a good finetune recipe in place, also the grad_norm is a good debugging tool. This is a good example of grad norm utility, it enables you to debug and even analyse before doing an optimizer step so we can avoid loss spikes.

@ebsmothers
Copy link
Contributor

While adding support for gradient clipping as a feature is nice to have, I don’t think we should conflate it with what’s being proposed here, which is a logging change. I definitely do not think we should enable gradient clipping by default without testing the implication of such a change on our various recipes.

As I mentioned above, I do see the value in logging grad norm. And evidently clip_grad_norm is a reasonable way to do it (provided that we pass inf as suggested by @musabgultekin). However, there is a cost to this change: we are calling a method clearly intended to clip gradients and using it in a non-obvious way for logging in the recipe. In my opinion, one of the top considerations for our recipes is that they are easily understandable, and I think such a change harms that a bit. So if efficiency of implementations are roughly equivalent I’d actually prefer a separate utility (assuming we are not adding proper gradient clipping support, which again, should be addressed separately and a bit more carefully imo).

@tcapelle
Copy link
Contributor Author

tcapelle commented Apr 30, 2024

Agree with everything above. I think we should wait and test if max_grad_norm should be used on the recipes as default or not. I can change it now to float(inf) so it does not impact the tests.
What I can say, is that I have always seen this parameter being used for LLM finetuning. Some examples are:

  • The HF.Trainer has a default of 1.0
  • axolotl has a default of 1.0
  • the mistral reference finetuning script defaults of 1.0,
  • lit-gpt defaults to 1.0
  • nanoGPT defaults to 1.0

I can pull data on W&B runs and check when it is being changed to other than 1.0 on the integrations we already have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants