Skip to content

Latest commit

 

History

History
48 lines (31 loc) · 2.25 KB

BENCHMARK.md

File metadata and controls

48 lines (31 loc) · 2.25 KB

Benchmark of NJUNMT-pytorch

DL4MT

Chinese to English

See configuration in dl4mt_config.yaml

Decay Method Granularity MT03(Dev) MT04 MT05 MT06
Loss Word(30K) 38.84 41.02 36.46 35.26
Loss BPE(30K) 37.72 38.64 35.09 33.73
Noam Word(30K) 38.31 39.82 35.84 33.96
Noam BPE(30K) 38.48 40.47 36.79 35.21

Word(30K): Training NMT at word level and keep most 30K frequent words and keep the rest as a special token .

BPE(30K): Use Byte Pair Encoding to split words into subword sequences. We do 30K BPE operations here and keep all the BPE tokens.

When choosing Loss as learning rate method, BPE model performs abnormally worse than word-level model. This result is confusing and one of the possible reasons maybe that the first occurrence of decay is too late, which make this scheduling policy degenerate into the vanilla Adam.

Transformer

NIST ZH2EN

We use the same settings as transformer_basev2.

System MT03(dev) MT04 MT05 MT06
Word(maxlen=80) 43.88 45.68 42.14 41.32
BPE(maxlen=100) 45.83 46.66 43.36 42.17

You can reproduce these results by using transformer_nist_zh2en.yaml and transformer_nist_zh2en_bpe.yaml under the configs folder.

WMT14 EN2DE

System newstest2013(dev) newstest2014
Our 25.58 24.31
(Vaswani et. al, 2017) 25.8 n/a

We also try to reproduce the result reported in "attention is all you need". We use settings of transformer_basev2. We do not use the modified Adam parameters with $beta_1 = 0.9, beta_2=0.98$ as we find it performs inferior to the original parameters. Our BLEU scores report here is generated by the best model in the first 100K steps, which is the same as the settings of abalation experiments in that paper. We also report the BLEU on dev data in that paper.

Your can reproduce this result by using transformer_wmt14_en2de.yaml under the configs folder.