Repeated content #7

nguyenvo09 · 2018-09-04T18:41:21Z

I used your code and trained a model to generate new sentences. The problem is that there are so many repeated tokens in generated samples.

Any insight how to deal with this?

For example, token appears so many times.

https://pastebin.com/caxz43CQ

timbmg · 2018-09-05T09:34:09Z

For how long did you train? What was your final KL/NLL Loss? Also with what min_occ did you train?

Also, when looking at it, the samples actually don't look that bad. Certainly, there is a problem with <unk> tokens, that they might be repeated many times before finally an <eos> token is produced. However, I think that's expected, since the network really does not know what <unk> is, so there actually can be any number of <unk>'s.
I think if you move on to another dataset, where the training and validation set are more similar, you should have less <unk>'s produced.

preke · 2019-04-05T05:40:40Z

Is that seq2seq-like model you want to implement?
I have the same problem met.
It seems like when training, the input of the decoder also have to be sorted by length.
While in the evaluation part, we do not have prior knowledge of the lengths of the sentences we want to generate, so, this part of the information is kind of lost.
Also, it seems seq2seq-like decoder could only be implemented by RNNLM, is that true?(like the code below:

            t = 0
            while(t < self.max_sequence_length-1):
                if t == 0:
                    input_sequence = Variable(torch.LongTensor([self.sos_idx] * batch_size), volatile=True)
                    if torch.cuda.is_available():
                        input_sequence = input_sequence.cuda()
                        outputs        = outputs.cuda()

                input_sequence  = input_sequence.unsqueeze(1)
                input_embedding = self.embedding(input_sequence) # b * e
                output, hidden  = self.decoder_rnn(input_embedding, hidden) 
                logits          = self.outputs2vocab(output) # b * v
                outputs[:,t,:]  = nn.functional.log_softmax(logits, dim=-1).squeeze(1)  # b * v 
                input_sequence  = self._sample(logits)
                t += 1

            outputs = outputs.view(batch_size, self.max_sequence_length, self.embedding.num_embeddings)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated content #7

Repeated content #7

nguyenvo09 commented Sep 4, 2018 •

edited

timbmg commented Sep 5, 2018 •

edited

preke commented Apr 5, 2019

Repeated content #7

Repeated content #7

Comments

nguyenvo09 commented Sep 4, 2018 • edited

timbmg commented Sep 5, 2018 • edited

preke commented Apr 5, 2019

nguyenvo09 commented Sep 4, 2018 •

edited

timbmg commented Sep 5, 2018 •

edited