You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the losses, ll-loss and kl-loss, are calculated by summing over the time dimension and averaged over the batch dimension. It might be good to just take the average over time dimension as well.
Pros
The static scaling factor layer is no longer needed.
Smaller gradients for RNN weights (1/sequence_length smaller), which could possibly result in more stable training.
Different sequence lengths could potentially need different learning rates (though this could be adapted by Adam).
Easier to compare loss of models with different sequence lengths (if this is needed in the future).
This is backwards compatible as the layers and weights are not changed.
The text was updated successfully, but these errors were encountered:
I think this makes sense.
Is there any downside?
Cheers, Mark.
On 4 Sep 2023, at 13:25, RukuangHuang ***@***.******@***.***>> wrote:
Currently, the losses, ll-loss and kl-loss, are calculated by summing over the time dimension and averaged over the batch dimension. It might be good to just take the average over time dimension as well.
Pros
* The static scaling factor layer is no longer needed.
* Smaller gradients for RNN weights (1/sequence_length smaller), which could possibly result in more stable training.
* Different sequence lengths could potentially need different learning rates (though this could be adapted by Adam).
* Easier to compare loss of models with different sequence lengths (if this is needed in the future).
* This is backwards compatible as the layers and weights are not changed.
—
Reply to this email directly, view it on GitHub<#178>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AALVITPDTI2IGCQTFAFDBPLXYXCEZANCNFSM6AAAAAA4KLAUZY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
Currently, the losses, ll-loss and kl-loss, are calculated by summing over the time dimension and averaged over the batch dimension. It might be good to just take the average over time dimension as well.
Pros
The text was updated successfully, but these errors were encountered: