Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated content #7

Open
nguyenvo09 opened this issue Sep 4, 2018 · 2 comments
Open

Repeated content #7

nguyenvo09 opened this issue Sep 4, 2018 · 2 comments

Comments

@nguyenvo09
Copy link

nguyenvo09 commented Sep 4, 2018

I used your code and trained a model to generate new sentences. The problem is that there are so many repeated tokens in generated samples.

Any insight how to deal with this?

For example, token appears so many times.

https://pastebin.com/caxz43CQ

@timbmg
Copy link
Owner

timbmg commented Sep 5, 2018

For how long did you train? What was your final KL/NLL Loss? Also with what min_occ did you train?

Also, when looking at it, the samples actually don't look that bad. Certainly, there is a problem with <unk> tokens, that they might be repeated many times before finally an <eos> token is produced. However, I think that's expected, since the network really does not know what <unk> is, so there actually can be any number of <unk>'s.
I think if you move on to another dataset, where the training and validation set are more similar, you should have less <unk>'s produced.

@preke
Copy link

preke commented Apr 5, 2019

Is that seq2seq-like model you want to implement?
I have the same problem met.
It seems like when training, the input of the decoder also have to be sorted by length.
While in the evaluation part, we do not have prior knowledge of the lengths of the sentences we want to generate, so, this part of the information is kind of lost.
Also, it seems seq2seq-like decoder could only be implemented by RNNLM, is that true?(like the code below:

            t = 0
            while(t < self.max_sequence_length-1):
                if t == 0:
                    input_sequence = Variable(torch.LongTensor([self.sos_idx] * batch_size), volatile=True)
                    if torch.cuda.is_available():
                        input_sequence = input_sequence.cuda()
                        outputs        = outputs.cuda()

                input_sequence  = input_sequence.unsqueeze(1)
                input_embedding = self.embedding(input_sequence) # b * e
                output, hidden  = self.decoder_rnn(input_embedding, hidden) 
                logits          = self.outputs2vocab(output) # b * v
                outputs[:,t,:]  = nn.functional.log_softmax(logits, dim=-1).squeeze(1)  # b * v 
                input_sequence  = self._sample(logits)
                t += 1

            outputs = outputs.view(batch_size, self.max_sequence_length, self.embedding.num_embeddings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants