Why is the model outputting UNK tokens? Shouldn't it be able to point to unkown words from the input? #32

Rhuax · 2019-04-02T13:08:15Z

From: https://github.com/abisee/pointer-generator/blob/master/beam_search.py#L111

In decoding words, we change the token id to the unkown id if t<vocab.size(). So if the decoder is pointing to that particular token it produces [UNK] in output. Is it correct? Following the paper it seems that the decoder should be able to point to that token and copy it, instead of copying the unknown token. I think it's the whole purpose of the pointer-generator model to handle oovs. But from some experiments in decoding I see that the models often outputs some unknown tokens.
I tried replacing the 50k vocabulary to the full vocabolary but I get cuda device asserted errors.

nefujiangping · 2019-04-27T07:36:17Z

There is still a UNK token in the fixed vocab. Pointer-generator can reduce the UNK outputs, but cannot prevent model from outputting UNK. In some timesteps, if the p_gen is high and the probability of the UNK of the fixed vocab is large, the model will still output UNK.

Rhuax · 2019-04-28T17:26:41Z

Why did the authors include the UNK token in the vocab?

nefujiangping · 2019-05-03T02:34:27Z

Pointer-generator use the same vocab at encoder/decoder input and decoder target, and the vocab is constructed from train-data (no val/test-data). If the vocab can cover all words of corpus (train,val,test), the UNK token can be excluded.

By using two vocabs (source vocab and target vocab), UNK token may be excluded in target vocab.

Rhuax changed the title ~~Why is the model outputting UNK tokens? Should it be able to point to unkown words from the input?~~ Why is the model outputting UNK tokens? Shouldn't it be able to point to unkown words from the input? Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the model outputting UNK tokens? Shouldn't it be able to point to unkown words from the input? #32

Why is the model outputting UNK tokens? Shouldn't it be able to point to unkown words from the input? #32

Rhuax commented Apr 2, 2019

nefujiangping commented Apr 27, 2019

Rhuax commented Apr 28, 2019

nefujiangping commented May 3, 2019

Why is the model outputting UNK tokens? Shouldn't it be able to point to unkown words from the input? #32

Why is the model outputting UNK tokens? Shouldn't it be able to point to unkown words from the input? #32

Comments

Rhuax commented Apr 2, 2019

nefujiangping commented Apr 27, 2019

Rhuax commented Apr 28, 2019

nefujiangping commented May 3, 2019