Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient Clipping #902

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Gradient Clipping #902

wants to merge 8 commits into from

Conversation

swfsql
Copy link
Contributor

@swfsql swfsql commented Dec 14, 2023

  • Draft state.

  • Closes Gradient Clipping #596.

  • Adds Storage and Gradient view/mutating methods.

    • Added dfdx::nn_traits::WithGrads trait and dfdx_derives::WithGrads proc macro, basead on ZeroGrads.
      • The overall design is as suggested by Gradient Clipping #596 (comment), allowing custom cpu operations on the elements.
      • The ZeroGrads trait could be merged into the WithGrads by mostly just merging their methods.
    • Added dfdx_core::tensor::WithStorage trait.
    • Change the interface so Cuda can do more with Cuda kernels, and make the necessary kernels.
      • This could be a separated improvement by a future PR. Since grad updates are not made that often, I think leaving things on cpu isn't too bad.
  • Changed some methods from Gradients:

    • Exposed get_mut as pub.
    • Exposed get_ref as pub, and lower the requirements from &mut self to &self.
  • Added gradient clamping and cliping methods.

    • Add examples for all methods (view/mutate grads, clamp and clips).

Example using clip_norm:

// (...)
// let loss = dfdx::losses::cross_entropy_with_logits_loss(prediction_y, y);
grads = loss.backward();

// accumulates into norm_squared, then applies clip_norm
let mut norm_squared = 0.;
model.grads_norm_squared(&grads, &mut norm_squared);
model.grads_clip_norm(&mut grads, norm_squared.sqrt(), 1e-2);

opt.update(&mut model, &grads).unwrap();

Note that clip_norm doesn't change the grads "direction" because all grad values are scaled by the same value, while clip_value does changes the direction (because some values are changed while others are left intact). So for gradient descent, where the grads direction is supposed to be somewhat followed, my guess is that clip_norm is better.

…and cliping

- Added `dfdx::nn_traits::WithGrads` trait and `dfdx_derives::WithGrads` proc macro, basead on `ZeroGrads`.
- Added  `dfdx_core::tensor::WithStorage` trait.
- Changed some methods from `Gradients`:
  - Exposed `get_mut` as `pub`.
  - Exposed `get_ref` as `pub`, and lower the requirements from `&mut self` to `&self`.
- Added gradient clamping and cliping methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gradient Clipping
2 participants