Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the normalization in the attention weight calculation #18

Open
Ph03n1xdust opened this issue Mar 13, 2023 · 0 comments
Open

Comments

@Ph03n1xdust
Copy link

Hi!

I would like to ask you about the 1/sqrt(self.dim_V) normalization in the MAB inside the softmax function. Usually the attention scaling is implemented with the reciprocal of the dimensionality of the key, and since here the dim_V is split up into num_heads equal parts the size of the key vectors are dim_V//num_heads.

Is this something intentional or a "bug"? Although calling it a bug is an over exaggeration since it only introduces an extra 1/sqrt(num_heads) scale.

If this is unintentional, I'm happy to make a pull request, although it's only changing a word or if it was something intentional could you explain the idea behind it?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant