Skip to content

libeineu/SDT-Training

Repository files navigation

Shallow-to-Deep Training for Neural Machine Translation

Bei Li , Ziyang Wang , Hui Liu , Yufan Jiang , Quan Du , Tong Xiao, Huizhen Wang and Jingbo Zhu. Shallow-to-Deep Training for Neural Machine Translation. In Proceedings of EMNLP, 2020. [paper][code]

SDT Transformer on Fairseq

The SDT Model is based on the Transformer system Fairseq v0.6.2 implemented by Facebook

Runtime Environment

This system has been tested in the following environment.

  • Python version >=3.6
  • Pytorch version >=1.0.0

For SDT-Transformer:

First, go into the SDT-training directory.

Then, the training script at different stages is the same with Fairseq, the following parameter is required when stacking encoder layers:

  • add --reset-optimizer for SDT-Transformer.

After the warm up phase is over, the learning rate will also be reset to the peak value when the optimizer state is reset

As for the --arch and arguments, sdt_ should be used as the prefix for SDT Transformer, such as:

  • --arch transformer_t2t_wmt_en_de_big -> --arch sdt_transformer_t2t_wmt_en_de_big

The arch we set is sdt_transformer_t2t_wmt_en_de_nl, where n is the depth of encoder

Example of the script for single training phase:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8
max_token=2048 
data_dir=google 
save_dir_1=
python3 -u train.py data-bin/$data_dir \
--distributed-world-size 8 -s en -t de \
--ddp-backend no_c10d \
--arch sdt_transformer_t2t_wmt_en_de_6l \
--optimizer adam --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 16000 \
--lr $lr_1 --min-lr 1e-09 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens $max_tokens \
--update-freq 4 \
--no-progress-bar \
--fp16 \
--adam-betas '(0.9, 0.997)' \
--log-interval 100 \
--share-all-embeddings \
--max-epoch 2 \
--save-dir $save_dir_1 \
--keep-last-epochs 5 \
--tensorboard-logdir $save_dir_1 > $save_dir_1/train.log

When a stage of training is over, we need to stack encoder layer to construct a deeper network by stack.py.

save_dir_1=
save_dir_2=
num_layer=
python3 stack.py $save_dir_1/checkpoint_last.pt $save_dir_2/checkpoint_last.pt $num_layer
  • save_dir_1 is the storage path of the model in the previous stage
  • save_dir_2 is the storage path of the current deep model that needs to be trained
  • num_layer is the number of encoder layers to be copied.

We can use the SDT method to train a 48-layer model from scratch by SDT_train.sh

nohup sh SDT_train.sh > train.log &

Citation

please cite as:

@article{li2020shallow,
  title={Shallow-to-Deep Training for Neural Machine Translation},
  author={Li, Bei and Wang, Ziyang and Liu, Hui and Jiang, Yufan and Du, Quan and Xiao, Tong and Wang, Huizhen and Zhu, Jingbo},
  journal={arXiv preprint arXiv:2010.03737},
  year={2020}
}



MIT License Latest Release Build Status Documentation Status


Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

What's New:

Features:

Fairseq provides reference implementations of various sequence-to-sequence models, including:

Additionally:

  • multi-GPU (distributed) training on one machine or across multiple machines
  • fast generation on both CPU and GPU with multiple search algorithms implemented:
  • large mini-batch training even on a single GPU via delayed updates
  • mixed precision training (trains faster with less GPU memory on NVIDIA tensor cores)
  • extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulers

We also provide pre-trained models for translation and language modeling with a convenient torch.hub interface:

en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model')
en2de.translate('Hello world', beam=5)
# 'Hallo Welt'

See the PyTorch Hub tutorials for translation and RoBERTa for more examples.

Model

Requirements and Installation

  • PyTorch version >= 1.4.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • For faster training install NVIDIA's apex library with the --cuda_ext and --deprecated_fused_adam options

To install fairseq:

pip install fairseq

On MacOS:

CFLAGS="-stdlib=libc++" pip install fairseq

If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run.

Installing from source

To install fairseq from source and develop locally:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable .

Getting Started

The full documentation contains instructions for getting started, training new models and extending fairseq with new model types and tasks.

Pre-trained models and examples

We provide pre-trained models and pre-processed, binarized test sets for several tasks listed below, as well as example training and evaluation commands.

  • Translation: convolutional and transformer models are available
  • Language Modeling: convolutional and transformer models are available
  • wav2vec: wav2vec large model is available

We also have more detailed READMEs to reproduce results from specific papers:

Join the fairseq community

License

fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well.

Citation

Please cite as:

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}

About

The implementation of "Shallow-to-Deep Training for Neural Machine Translation"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages