-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute grad norm #897
base: main
Are you sure you want to change the base?
Compute grad norm #897
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/897
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
We can also use |
We should also add this parameter to the recipes YAMLs, default should be 1.0 as in HF and axolotl. |
I need to add the: max_norm: 1.0 to the recipes, do you have any trick to do this automatically? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still on the fence about adding grad norm as a field that we log 100% of the time. Mainly I want to make the following separation:
- fields that must be logged every time (i.e. they are fundamental properties of training that most people will need to know almost every time), vs
- fields that are useful but in a more limited set of cases.
In my mind grad norm is kinda on the boundary between these two. But I wanna make sure we are super clear about the precedent we set here. My concern is that we wind up adding a bunch of stats from category (2) into our core recipe, bloat the training loop, and reduce the ease with which folks can modify copy-pasted versions of the recipes.
Anyways, as I said, I am on the fence here, so will def not block this change based on this point alone. The changes requested is more a function of the clip_grad_norm usage. Also cc @joecummings and @kartikayk for any thoughts on logging here.
@@ -476,6 +476,9 @@ def train(self) -> None: | |||
loss = loss / self._gradient_accumulation_steps | |||
running_loss += loss | |||
loss.backward() | |||
grad_norm = nn.utils.clip_grad_norm_( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry maybe I am misunderstanding the intent of this change, but doesn't clip_grad_norm rescale the gradients? I don't think we want to do that here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where would you do this? In my understanding you want to clip after computing the gradients (backward) and before doing the optimizer step.
@tcapelle see my other comment. Unless I'm misunderstanding I don't think we want to use clip_grad_norm here after all. Regardless you raise a good point.. I've also found it a bit annoying to manually modify/add a field in all our configs. I think we should try to add a tool for this at some point (but to actually answer your question, no such tool exists as of yet). |
We can use |
I feel that we should have grad_clip enabled by default. The idea is give a good finetune recipe in place, also the grad_norm is a good debugging tool. This is a good example of grad norm utility, it enables you to debug and even analyse before doing an optimizer step so we can avoid loss spikes. |
While adding support for gradient clipping as a feature is nice to have, I don’t think we should conflate it with what’s being proposed here, which is a logging change. I definitely do not think we should enable gradient clipping by default without testing the implication of such a change on our various recipes. As I mentioned above, I do see the value in logging grad norm. And evidently clip_grad_norm is a reasonable way to do it (provided that we pass inf as suggested by @musabgultekin). However, there is a cost to this change: we are calling a method clearly intended to clip gradients and using it in a non-obvious way for logging in the recipe. In my opinion, one of the top considerations for our recipes is that they are easily understandable, and I think such a change harms that a bit. So if efficiency of implementations are roughly equivalent I’d actually prefer a separate utility (assuming we are not adding proper gradient clipping support, which again, should be addressed separately and a bit more carefully imo). |
Agree with everything above. I think we should wait and test if
I can pull data on W&B runs and check when it is being changed to other than 1.0 on the integrations we already have. |
Added the Grad norm function that original was added in the improved logging experience.