Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

couldn't match SOTA performance on wmt14 EnDe #32

Closed
yilinyang7 opened this issue Mar 6, 2019 · 3 comments
Closed

couldn't match SOTA performance on wmt14 EnDe #32

yilinyang7 opened this issue Mar 6, 2019 · 3 comments

Comments

@yilinyang7
Copy link

yilinyang7 commented Mar 6, 2019

Dear authors,

I understand this repo isn't very much for supervised MT. But your codebase contains Transformer Enc-Dec model and more importantly it is much simpler than standard supervised MT codebase (e.g. T2T, Fairseq, OpenNMT).

With the intention to reproduce wmt14 EnDe SOTA performance, I use the data & BPE from Fairseq, train the Transformer base (emb_dim=512) w/ only mt_step="en-de" on 4x 2080 Ti (one gpu even lower). And finally got a tokenized BLEU score of 25.63 w/ beam_size 4, length_penalty 0.6. It's more than 1 BLEU lower than reported in Transformer paper.

Training script:
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en/fairseq --lgs 'en-de' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 6000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0,1,2,3'

Translate results:

valid_en-de_mt_ppl-> 5.401580
valid_en-de_mt_acc -> 65.806969
valid_en-de_mt_bleu -> 28.990000
test_en-de_mt_ppl -> 5.942769
test_en-de_mt_acc -> 66.605212
test_en-de_mt_bleu -> 25.630000

My intuition is the model structure is slightly different (gelu, layer_norm etc.). May I ask you have you tried it with supervised MT wmt14 benchmark, and what's your thoughts on this?

Best.

@glample
Copy link
Contributor

glample commented Mar 6, 2019

Hi,

Yes, unfortunately I also tried, and I have never been able to reproduce fairseq results with XLM on the supervised tasks, there was always a difference of 1 or 2 BLEU. This is a bit annoying, because probably if we could match the supervised results we would also be better in unsupervised / semi-supervised, etc.

I really don't think that the differences in terms of architecture (we have one extra layer norm after the embeddings I believe, and when I compared I didn't use GELU) could explain the difference in BLEU. There are a couple of things that we don't have that fairseq has, such as "smoothed softmax", "average checkpointing", etc. and I think these are more the features we are missing in order to get SOTA results in supervised MT. If you see other features that may explain the differences I can try to implement them and retry on the supervised task.

@yilinyang7
Copy link
Author

Thank you for your clarification. I'll keep looking into it, and let you know when I find anything.

@sugeeth14
Copy link

Hi @yilinyang7 I wanted to pretrain a language model with MLM objective and use it to train on supervised MT for En-De translation. I am however unable to do so with some error If you did supervised machine translation using the pretrained language model can you please elaborate on the steps to follow to do so. It would be great if you could share the commands with which you tried. Kindly share the progress.
Thanks in advance .

@glample glample closed this as completed Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants