multihead_attention #17

pengyuchen · 2017-11-21T04:40:01Z

In multihead-attention model, there should be one more linear projection layer at the end of the output (before residual connection).

crystina-z · 2018-10-10T05:42:38Z

modules.py

-
+
+        # Linear projections
+        outputs = tf.layers.dense(outputs, num_units, activation=tf.nn.relu) # (N, T_q, C)


I think you are right about an extra projection, but can I ask about the activation function here? seems in the original paper there is no bias and activation, only a plain "Concat(head1, ..., headh)W_O"

multihead_attention

bf93aa1

crystina-z reviewed Oct 10, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multihead_attention #17

multihead_attention #17

pengyuchen commented Nov 21, 2017

crystina-z Oct 10, 2018



		# Linear projections
		outputs = tf.layers.dense(outputs, num_units, activation=tf.nn.relu) # (N, T_q, C)

multihead_attention #17

Are you sure you want to change the base?

multihead_attention #17

Conversation

pengyuchen commented Nov 21, 2017

crystina-z Oct 10, 2018

Choose a reason for hiding this comment