Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of Loss.get_grad confusingly describes independent variable #901

Open
connorbrinton opened this issue Aug 15, 2023 · 1 comment
Labels
docs Documentation feat / loss Loss functions

Comments

@connorbrinton
Copy link
Contributor

How to reproduce the behaviour

Hi there! I was recently working on implementing a custom loss function (binary focal loss) and found some of the documentation to be a bit confusing. The documentation of Loss.get_grad states that it should:

Calculate the gradient of the loss with respect with the model outputs.

However, looking at the implementation of some of Thinc's built-in loss functions, Loss.get_grad actually calculates the gradient of the loss with respect to the logits used as input to the preceding softmax/sigmoid layer.

For example, the CategoricalCrossentropy loss class computes the gradient as guesses - target. This is off from the derivative of the loss wrt the model outputs (probabilities) by a factor of 1 / (p * (1 - p)). This factor is cancelled out by the derivative of the logistic function in the derivative of the loss wrt to the logits, giving the derivative wrt to the logits as guesses - targets.

This whole setup works because the softmax part of the softmax layer uses the identity function as its backwards pass. This ends up making the forwards and backwards passes of the softmax layer inconsistent, but in theory everything balances out.

I assume that this setup was selected to help improve numerical stability. The focal loss paper actually mentions this explicitly:

we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.

Anyways, the point of this issue is that the current documentation of Loss.get_grad is confusing, since the gradient is actually computed with respect to the logits, and not the model outputs, even though the model outputs are what is provided to the method. It would be great to have this clarified in the documentation 🙂

Thanks for maintaining Thinc! 😄

Your Environment

  • Operating System: macOS 13.5
  • Python Version Used: 3.9.16
  • Thinc Version Used: 8.1.10
  • Environment Information: Poetry virtual environment, M1 mac
@rmitsch rmitsch added docs Documentation feat / loss Loss functions labels Aug 16, 2023
@rmitsch
Copy link
Contributor

rmitsch commented Aug 17, 2023

Hi @connorbrinton, thanks for reporting this! We'll look into it and update this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation feat / loss Loss functions
Projects
None yet
Development

No branches or pull requests

2 participants