Question about the normalization in the attention weight calculation #18

Ph03n1xdust · 2023-03-13T16:59:31Z

Hi!

I would like to ask you about the 1/sqrt(self.dim_V) normalization in the MAB inside the softmax function. Usually the attention scaling is implemented with the reciprocal of the dimensionality of the key, and since here the dim_V is split up into num_heads equal parts the size of the key vectors are dim_V//num_heads.

Is this something intentional or a "bug"? Although calling it a bug is an over exaggeration since it only introduces an extra 1/sqrt(num_heads) scale.

If this is unintentional, I'm happy to make a pull request, although it's only changing a word or if it was something intentional could you explain the idea behind it?

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the normalization in the attention weight calculation #18

Question about the normalization in the attention weight calculation #18

Ph03n1xdust commented Mar 13, 2023

Question about the normalization in the attention weight calculation #18

Question about the normalization in the attention weight calculation #18

Comments

Ph03n1xdust commented Mar 13, 2023