You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask you about the 1/sqrt(self.dim_V) normalization in the MAB inside the softmax function. Usually the attention scaling is implemented with the reciprocal of the dimensionality of the key, and since here the dim_V is split up into num_heads equal parts the size of the key vectors are dim_V//num_heads.
Is this something intentional or a "bug"? Although calling it a bug is an over exaggeration since it only introduces an extra 1/sqrt(num_heads) scale.
If this is unintentional, I'm happy to make a pull request, although it's only changing a word or if it was something intentional could you explain the idea behind it?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi!
I would like to ask you about the
1/sqrt(self.dim_V)
normalization in the MAB inside the softmax function. Usually the attention scaling is implemented with the reciprocal of the dimensionality of the key, and since here thedim_V
is split up intonum_heads
equal parts the size of the key vectors aredim_V//num_heads
.Is this something intentional or a "bug"? Although calling it a bug is an over exaggeration since it only introduces an extra
1/sqrt(num_heads)
scale.If this is unintentional, I'm happy to make a pull request, although it's only changing a word or if it was something intentional could you explain the idea behind it?
Thanks!
The text was updated successfully, but these errors were encountered: