Skip to content

YuweiYin/HLT-MT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HLT-MT: High-resource Language-specific Training for Multilingual Neural Machine Translation

picture

Abstract

Multilingual neural machine translation (MNMT) trained in multiple language pairs has attracted considerable attention due to fewer model parameters and lower training costs by sharing knowledge among multiple languages. Nonetheless, multilingual training is plagued by language interference degeneration in shared parameters because of the negative interference among different translation directions, especially on high-resource languages. In this paper, we propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference, which adopts the two-stage training with the language-specific selection mechanism. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder to enhance the translation quality of high-resource directions. Next, the model is further trained on all available corpora to transfer knowledge from high-resource languages (HRLs) to low-resource languages (LRLs). Experimental results show that HLT-MT outperforms various strong baselines on WMT-10 and OPUS-100 benchmarks. Furthermore, the analytic experiments validate the effectiveness of our method in mitigating the negative interference in multilingual training.

Data

  • WMT-10
    • Multilingual dataset (from the WMT corpus) with 11 languages: 10 English-centric language pairs.
    • English (En), French (Fr), Czech (Cs), German(De), Finnish (Fi), Latvian (Lv), Estonian (Et), Romanian (Ro), Hindi(Hi), Turkish (Tr), and Gujarati (Gu).
  • OPUS-100
    • Massively multilingual dataset (from the OPUS-100 corpus) with 100 languages.
    • 94 English-centric language pairs are used after dropping out 5 languages that lack corresponding test sets.

Dataset Statistics

Statistics and sources of the training, valid, and test sets from WMT between English and other languages:

Language #Bitext Training Valid Test
Fr (French) 10M WMT15 Newstest13 Newstest15
Cs (Czech) 10M WMT19 Newstest16 Newstest18
De (German) 4.6M WMT19 Newstest16 Newstest18
Fi (Finnish) 4.8M WMT19 Newstest16 Newstest18
Lv (Latvian) 1.4M WMT17 Newsdev17 Newstest17
Et (Estonian) 0.7M WMT18 Newsdev18 Newstest18
Ro (Romanian) 0.5M WMT16 Newsdev16 Newstest16
Hi (Hindi) 0.26M WMT14 Newsdev14 Newstest14
Tr (Turkish) 0.18M WMT18 Newstest16 Newstest18
Gu (Gujarati) 0.08M WMT19 Newsdev19 Newstest19

OPUS-100: Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

Environment

cd HLT-MT
pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

HLT-MT Training

# parameters
NODES=${1}
MAX_TOKENS=${2}
UPDATE_FREQ=${3}
MAX_EPOCH=${4}
LR=${5}
WARMUP_STEPS=${6}
WEIGHT_DECAY=${7}
HIGH_LANGS=${8}
LOW_LANGS=${9}
LANG_PAIRS=${10}
ADAPTER_NUM=${11}
ADAPTER_DIM=${12}
APPEND_CMDS=${13}

# default values
if [ ! ${NODES} ]; then
    NODES=4
fi
if [ ! ${MAX_TOKENS} ]; then
    MAX_TOKENS=4096
fi
if [ ! ${UPDATE_FREQ} ]; then
    UPDATE_FREQ=4
fi
if [ ! ${LR} ]; then
    LR=3e-4
fi
if [ ! ${WARMUP_STEPS} ]; then
    WARMUP_STEPS=4000
fi
if [ ! ${WEIGHT_DECAY} ]; then
    WEIGHT_DECAY=0
fi
if [ ! ${ADAPTER_NUM} ]; then
    ADAPTER_NUM=3
fi
if [ ! ${ADAPTER_DIM} ]; then
    ADAPTER_DIM=3072
fi

LANGS="en,fr,cs,de,fi,lv,et,ro,hi,tr,gu"
GPUS=8
bsz=$((${MAX_TOKENS}*${UPDATE_FREQ}*${NODES}*${GPUS}))

TEXT=/path/to/data-bin/
MODEL=/path/to/model/
PRETRAINED_ENCODER_MODEL=/path/to/xlmr_model

python -m torch.distributed.launch \
  --nproc_per_node=${GPUS} --nnodes=${NODES} --node_rank=${OMPI_COMM_WORLD_RANK} \
  --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} train.py ${TEXT} \
  --save-dir ${MODEL} --arch "two_stage_sparse_transformer" \
  --variant addffn --pretrained-infoxlm-checkpoint ${PRETRAINED_ENCODER_MODEL} \
  --init-encoder-only --init-decoder-only --task "translation_multi_simple_epoch" \
  --sampling-method "linear" --sampling-temperature 5.0 --min-sampling-temperature 1.0 \
  --encoder-langtok "tgt" --langtoks '{"main":("tgt",None)}' --langs ${LANGS} \
  --high-langs ${HIGH_LANGS} --low-langs ${LOW_LANGS} --lang-pairs ${LANG_PAIRS} \
  --ddp-backend=no_c10d --enable-reservsed-directions-shared-datasets \
  --share-all-embeddings --max-source-positions 256 --max-target-positions 256 \
  --criterion "label_smoothed_cross_entropy_with_sparse" --label-smoothing 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr ${LR} \
  --warmup-epoch 5 --warmup-updates 4000 \
  --max-update 400000 --max-epoch ${MAX_EPOCH} --max-tokens ${MAX_TOKENS} --update-freq ${UPDATE_FREQ} \
  --dropout 0.1 --attention-dropout 0.0 --weight-decay ${WEIGHT_DECAY} \
  --seed 1 --log-format simple --skip-invalid-size-inputs-valid-test \
  --fp16 --truncate-source --same-lang-per-batch --enable-lang-ids \
  --use-adapter --adapter-num ${ADAPTER_NUM} --adapter-dim ${ADAPTER_DIM} \
  --swap-adapter 0 --start-hard-epoch 1 --hard-adapter 0.5 --end-soft-epoch 5 --disparity-weight 1.0 \
  --log-file ${MODEL}/train.log --tensorboard-logdir ${MODEL}/logs ${APPEND_CMDS}

Evaluation

  • Metrics: the case-sensitive detokenized BLEU using sacreBLEU:
    • BLEU+case.mixed+lang.{src}-{tgt}+numrefs.1+smooth.exp+tok.13a+version.1.3.1

Experiments

Baselines

Experiment Results

En-X evaluation results for bilingual (1-1), one-to-many (1-N), and many-to-many (N-N) models on WMT-10. The languages are ordered from high-resource languages (left) to low-resource languages (right).

En-X test sets Models #Params Fr Cs De Fi Lv Et Ro Hi Tr Gu Avg (all)
1-1 BiNMT 242M/10M 36.3 22.3 40.2 15.2 16.5 15.0 23.0 12.2 13.3 7.9 20.2
1-N MNMT 242M 34.2 20.9 40.0 15.0 18.1 20.9 26.0 14.5 17.3 13.2 22.0
1-N mBART 611M 33.7 20.8 38.9 14.5 18.2 20.5 26.0 15.3 16.8 12.9 21.8
1-N XLM-R 362M 34.7 21.5 40.1 15.2 18.6 20.8 26.4 15.6 17.4 14.9 22.5
1-N LS-MNMT 409M 35.0 21.7 40.6 15.5 18.9 21.0 26.2 14.8 16.5 12.8 22.3
1-N HLT-MT 381M 36.2 22.2 41.8 16.6 19.5 21.1 26.6 15.8 17.1 14.6 23.2
N-N MNMT 242M 34.2 21.0 39.4 15.2 18.6 20.4 26.1 15.1 17.2 13.1 22.0
N-N mBART 611M 32.4 19.0 37.0 13.2 17.0 19.5 25.1 15.7 16.7 14.2 21.0
N-N XLM-R 362M 34.2 21.4 39.7 15.3 18.9 20.6 26.5 15.6 17.5 14.5 22.4
N-N LS-MNMT 409M 34.8 21.1 39.3 15.2 18.7 20.5 26.3 14.9 17.3 12.3 22.0
N-N HLT-MT 381M 35.8 22.4 41.5 16.3 19.6 21.0 26.6 15.7 17.6 14.7 23.1

X-En test BLEU for high/medium/low resource language pairs in many-to-many setting on OPUS-100 test sets. The BLEU scores are average across all language pairs in the respective groups. "WR": win ratio (%) compared to "ref" (MNMT).

Models #Params High (45) Med (21) Low (28) Avg (94) WR
OPUS-100 baseline 254M 30.3 32.6 31.9 31.4 -
MNMT 242M 32.3 35.1 35.8 33.9 ref
XLM-R 362M 33.1 35.7 36.1 34.6 -
LS-MNMT 456M 33.4 35.8 35.9 34.7 -
HLT-MT 391M 34.1 36.6 36.1 35.3 72.3

En-X test BLEU for high/medium/low resource language pairs in many-to-many setting on OPUS-100 test sets.

Models #Params High (45) Med (21) Low (28) Avg (94) WR
OPUS-100 baseline 254M 23.7 25.6 22.2 24.0 -
MNMT 242M 26.3 31.4 31.2 28.9 ref
XLM-R 362M 26.9 31.9 31.7 29.4 -
LS-MNMT 456M 27.5 31.6 31.5 29.6 -
HLT-MT 391M 27.6 33.3 31.8 30.1 77.7

Ablation Study

Ablation study of our proposed approach on the WMT-10 benchmark. Our method can be easily initialized by the cross-lingual pretrained model XLM-R to enhance the performance.

XLM-R Two-stage Training SLP Avg (high) Avg (low) Avg (all)
. . . 24.9 17.8 22.0
. . 25.4 18.0 22.4
. 26.0 18.1 22.8
. . 25.2 18.5 22.5
. 26.0 17.9 22.8
26.2 18.5 23.2

Citation

@inproceedings{hltmt,
  title     = {High-resource Language-specific Training for Multilingual Neural Machine Translation},
  author    = {Yang, Jian and Yin, Yuwei and Ma, Shuming and Zhang, Dongdong and Li, Zhoujun and Wei, Furu},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on
               Artificial Intelligence, {IJCAI-22}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},
  editor    = {Lud De Raedt},
  pages     = {4461--4467},
  year      = {2022},
  month     = {7},
  note      = {Main Track},
  doi       = {10.24963/ijcai.2022/619},
  url       = {https://doi.org/10.24963/ijcai.2022/619},
}

License

Please refer to the LICENSE file for more details.

Contact

If there is any question, feel free to create a GitHub issue or contact us by Email.

About

[IJCAI-ECAI 2022] HLT-MT: High-resource Language-specific Training for Multilingual Neural Machine Translation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published