Masked attention #141

lethienhoa · 2018-05-09T19:27:25Z

Hi,
I see that this implementation is lacking masked attention on encoder. Input_lengths should be passed to decoder (not just encoder) in order to compute this. OpenNMT already provided this in function sequence_mask.
Best,

erogol · 2018-07-10T10:14:07Z

@lethienhoa why do you need masked attention, if you mask the loss ?

valtsblukis · 2018-07-10T16:06:48Z

I just noticed the same thing and landed here. The attention mechanism should only include those encoder outputs in the weighted sum that correspond to valid tokens in the input sequences. For example, if your input lengths in the batch are 23, 12, 7. Then for the third element in the batch, the attention should compute the weighted sum over the 7 encoder outputs, rather than all 23.

Normally your attention would learn to ignore the extra encoder outputs anyway, but this might pose a problem if you train and test with different maximum sentence sizes.

erogol · 2018-07-12T12:21:38Z

@valtsblukis thx for explaining it. Yes that was my understading too, but I'd also assume the model would learn it anyways. I am also performing an experiment with my model with/without masking too see the difference.

pskrunner14 · 2018-09-01T11:29:05Z

@lethienhoa I'll see to it. Thanks for pointing this out.

pskrunner14 self-assigned this Sep 1, 2018

pskrunner14 added the medium priority label Sep 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masked attention #141

Masked attention #141

lethienhoa commented May 9, 2018

erogol commented Jul 10, 2018

valtsblukis commented Jul 10, 2018

erogol commented Jul 12, 2018

pskrunner14 commented Sep 1, 2018

Masked attention #141

Masked attention #141

Comments

lethienhoa commented May 9, 2018

erogol commented Jul 10, 2018

valtsblukis commented Jul 10, 2018

erogol commented Jul 12, 2018

pskrunner14 commented Sep 1, 2018