Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An idea/suggestion on the gradient used #57

Open
DagonArises opened this issue Mar 28, 2022 · 1 comment
Open

An idea/suggestion on the gradient used #57

DagonArises opened this issue Mar 28, 2022 · 1 comment
Labels
question Further information is requested

Comments

@DagonArises
Copy link

I would like to make an inquiry on the form of gradients computed in get_gradients function. It seems that you have computed dL_t/d_W directly where W is a parameter. While loss gradient on weight is simple enough in feedforward NNs, in RNNs because the same weight is shared at all time steps, each dL_t/d_W is actually a summation of partial derivative products of lengths 1, 2, ..., t respectively. Please see this tutorial for the actual form, in particular results (5) and (6).

Those longer partial derivative products correspond to the backpropagated signals over longer temporal dependencies. If these longer ones vanish (and they are prone to vanishing), then the weights are updated in a way that 'cannot' retain earlier information.

Therefore it occurs to me that if dL_t/d_W stays away from 0, it does not seem to be guaranteed that vanishing gradients did not take place. It might be those shorter partial derivative products more vanishing-resistant that have kept the magnitude of dL_t/d_W away from 0.
A more direct indicator could be dh_t/dh_1, or dh_t/dh_0, where h_t is the hidden state at step t. Both are products of result (6) multiplied over the time steps. If say starting from t = 100 such a quantity vanishes, then we can claim the model is unable to retain information for more than 100 steps.

That being said, I am not really an expert in RNN, and I am just raising an idea here. I would really appreciate it if you can take a look at whether my understanding is correct, and whether the statistic dh_t/dh_1, or dh_t/dh_0 can possibly be implemented.
Thanks in advance!

@OverLordGoldDragon OverLordGoldDragon added the question Further information is requested label Mar 28, 2022
@OverLordGoldDragon
Copy link
Owner

Too rusty on RNNs to validate any of this, I fear. SE might help. I'm also no longer developing this repository, but I'm open to reviewing merge-ready contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants