New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taken <Pad> as a regular token could make model only learn the <Pad> information? #50
Comments
Hi, |
Hi, |
I have the same issue. I have modified the code to run with pytorch Lightning, but for me as well it learned only pads. |
I am running the experiments of QQP and I have changed the computation of loss in the training code. loss_mask = ([0]*(len(src)+1) + [1]*len(trg) + [0] * pad_length) Here is my result, the suffix "with_loss_mask" means only calculating loss of tokens in target sentence terms["loss"] = terms["mse_with_loss_mask"] +terms["decoder_nll_with_loss_mask"] + tT_loss_with_loss_mask
--------------------------------------------
| decoder_nll | 7.04e-09 |
| decoder_nll_q0 | 1.75e-08 |
| decoder_nll_q1 | 1.36e-08 |
| decoder_nll_q2 | 1.16e-08 |
| decoder_nll_q3 | 2.7e-09 |
| decoder_nll_with_loss_mask | 2.56e-08 |
| decoder_nll_with_loss_mask_q0 | 5.69e-08 |
| decoder_nll_with_loss_mask_q1 | 6.13e-08 |
| decoder_nll_with_loss_mask_q2 | 3.49e-08 |
| decoder_nll_with_loss_mask_q3 | 9.4e-09 |
| grad_norm | 0.0356 |
| loss | 0.00671 |
| loss_q0 | 0.00704 |
| loss_q1 | 0.00685 |
| loss_q2 | 0.00674 |
| loss_q3 | 0.00663 |
| mse | 1.5 |
| mse_q0 | 3.58 |
| mse_q1 | 2.92 |
| mse_q2 | 2.24 |
| mse_q3 | 0.699 |
| mse_with_loss_mask | 0.00671 |
| mse_with_loss_mask_q0 | 0.00704 |
| mse_with_loss_mask_q1 | 0.00685 |
| mse_with_loss_mask_q2 | 0.00674 |
| mse_with_loss_mask_q3 | 0.00663 |
| nll | 51.2 |
| nll_q0 | 115 |
| nll_q1 | 95.9 |
| nll_q2 | 77.8 |
| nll_q3 | 25.1 |
| nll_with_loss_mask | 1.11 |
| nll_with_loss_mask_q0 | 0.0114 |
| nll_with_loss_mask_q1 | 0.14 |
| nll_with_loss_mask_q2 | 0.608 |
| nll_with_loss_mask_q3 | 1.62 |
| samples | 9.8e+08 |
--------------------------------------------
Here is an example of generated texts, the model doesn't generate PAD, but it still can't generate expected text.
|
when I use the original loss(without loss mask),I get the following result
The loss doesn't become very small but the generated texts become much better
The only difference between the above two experiments (w/o loss mask) is training step. Maybe we just need to train more steps and set a proper lr |
did you just only modify the trg's loss that during training in |
when I use the original loss(without loss mask), I did not modify any code model trained with loss mask did not perform well,maybe I need to train more steps? Hope someone can give me some advice |
Did you modify the p_sample() at the end? I find if we change the seq_len, too many pads can seriously affect the effect. |
Hello!Could you please show me your modified 'training_losses_seq2seq'? |
Hi
In my project, I discovered that taking as the regular token, the diffusion model usally
learn the information. In other words, the model tends to predict the token instead of other words in the generation.
How to avoid this issue?
The text was updated successfully, but these errors were encountered: