See configuration in dl4mt_config.yaml
Decay Method | Granularity | MT03(Dev) | MT04 | MT05 | MT06 |
---|---|---|---|---|---|
Loss | Word(30K) | 38.84 | 41.02 | 36.46 | 35.26 |
Loss | BPE(30K) | 37.72 | 38.64 | 35.09 | 33.73 |
Noam | Word(30K) | 38.31 | 39.82 | 35.84 | 33.96 |
Noam | BPE(30K) | 38.48 | 40.47 | 36.79 | 35.21 |
Word(30K): Training NMT at word level and keep most 30K frequent words and keep the rest as a special token .
BPE(30K): Use Byte Pair Encoding to split words into subword sequences. We do 30K BPE operations here and keep all the BPE tokens.
When choosing Loss
as learning rate method, BPE model performs abnormally worse than word-level model. This result
is confusing and one of the possible reasons maybe that the first occurrence of decay is too late, which make this scheduling
policy degenerate into the vanilla Adam.
We use the same settings as transformer_basev2.
System | MT03(dev) | MT04 | MT05 | MT06 |
---|---|---|---|---|
Word(maxlen=80) | 43.88 | 45.68 | 42.14 | 41.32 |
BPE(maxlen=100) | 45.83 | 46.66 | 43.36 | 42.17 |
You can reproduce these results by using transformer_nist_zh2en.yaml
and transformer_nist_zh2en_bpe.yaml
under the configs
folder.
System | newstest2013(dev) | newstest2014 |
---|---|---|
Our | 25.58 | 24.31 |
(Vaswani et. al, 2017) | 25.8 | n/a |
We also try to reproduce the result reported in "attention is all you need". We use settings of transformer_basev2. We do not use the modified Adam parameters with
Your can reproduce this result by using transformer_wmt14_en2de.yaml
under the configs
folder.