Skip to content

thompsonb/fairseq-smrt

Repository files navigation

Simulated Multiple Reference Training (SMRT)

This repo contains a fork of fairseq sufficient to replicate the experiments in Simulated Multiple Reference Training Improves Low-Resource Machine Translation by Huda Khayrallah, Brian Thompson, Matt Post, Philipp Koehn.

Here's the abstract from the paper:

Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser’s distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation.

An illustration of our method is below. Each time a target sentence in the training data is used, a paraphrase(blue) is sampled from the many possible paraphrases (grey). The training objective also takes into account the distribution over all the possible options (green) along the sampled path.

paraphrase_lattice image

Our method is implemented as a fairseq criterion, enabled via:

--criterion smrt_cross_entropy

It takes in a pretrained Fairseq paraphraser and data directory for the corresponding dictionary:

--paraphraser-model /path/to/paraphraser/paraphraser.pt  
--paraphraser-data-dir /directory/containing/directory/ 

During training, our new objective is mixed with standard label smoothed cross entropy training. To control the fraction of the time the new objective is used:

--prob-use-smrt 0.5

Finally, to control the amount of variation introduced when sampling a path from the paraphraser:

--paraphraser-sample-topN 100

Installation Instructions

To install in a conda environment:

git clone git@github.com:thompsonb/fairseq-smrt.git #or git clone https://github.com/thompsonb/fairseq-smrt.git
cd fairseq-smrt
conda create -n fairseq-smrt python=3.7 pytorch=1.4.0 torchvision=0.5.0 -c pytorch
source activate fairseq-smrt
pip install --editable . #this command needs to be run from inside the fairseq-smrt directory. 
pip install 'tensorboard==2.2.1'
pip install 'tensorboardX==2.0'

Paraphraser Model

We release a pre-trained paraphraser model trained on the ParaBank2 dataset, along along with its associated vocabulary, and the SentencePiece model.

In order to use this paraphraser in MT training, you must apply this SentencePiece model to the target side of your training data, and use this dictionary in training.

Replication Example

Below are commands to replicate one of the experiments (hu-en) from the paper. We assume you have followed the installation instructions above. You should run on a machine with a GPU and CUDA. Other language pairs follow the same pattern.

source activate fairseq-smrt
fairseq_smrt=`pwd` #set this to repo path
cd $fairseq_smrt

#download and extract the paraphrase model 
wget data.statmt.org/smrt/smrt-parabank2-fairseq.tgz
tar -xf smrt-parabank2-fairseq.tgz

# download and extract the data split
wget data.statmt.org/smrt/smrt-globalvoices-splits.tgz
tar -xf smrt-globalvoices-splits.tgz

#apply fairseq-preprocess to the files that already have SentencePiece applied
trg=en
src=hu

data=$fairseq_smrt/smrt-globalvoices-splits/$src-$trg
databin=$data/databin
mkdir -p $databin

python $fairseq_smrt/preprocess.py --source-lang $src --target-lang $trg \
        --trainpref $data/train.sp --validpref $data/valid.sp \
        --testpref $data/test.sp  --workers 30 \
        --tgtdict smrt-parabank2-fairseq/dict.en.txt \
        --destdir $databin

# run training
traindir=$fairseq_smrt/expts/$src-$trg
mkdir -p $traindir
mkdir -p $traindir/tensorboard

# The the total batchsize should be 16000. Due to the way batching works in fairseq,
# this should be the product of (\# of gpus) \* (max-tokens) \* (update-freq).\n
# The exact setting of each of these will vary based on your hardware. 
batchsize=4000
updatefreq=4  # set to 4 for 1 GPU, 2 for 2 GPUs, 1 for 4 GPUs if you use a batchsize of 4000

python $fairseq_smrt/train.py \
  $databin \
 --source-lang $src \
 --target-lang $trg \
 --save-dir $traindir \
 --patience 50 --criterion smrt_cross_entropy \
 --paraphraser-model  $fairseq_smrt/smrt-parabank2-fairseq/paraphraser.pt  \
 --paraphraser-data-dir $fairseq_smrt/smrt-parabank2-fairseq/ \
 --paraphraser-sample-topN 100 \
 --prob-use-smrt 0.5 \
 --label-smoothing 0.2 \
 --share-decoder-input-output-embed \
 --arch transformer  --encoder-layers 5 --decoder-layers 5 \
 --encoder-embed-dim 512 --decoder-embed-dim 512 \
 --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
 --encoder-attention-heads 2 --decoder-attention-heads 2 \
 --encoder-normalize-before --decoder-normalize-before \
 --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
 --weight-decay 0.0001 \
 --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
 --lr 1e-3 --min-lr 1e-9 --no-epoch-checkpoints \
 --max-tokens $batchsize --tensorboard-logdir $traindir/tensorboard \
 --max-epoch 200 --save-interval 10 --update-freq $updatefreq \
 --log-format json --log-interval 500   &> $traindir/train.log

# generate translations
python $fairseq_smrt/generate.py $databin \
      --beam 5 --remove-bpe sentencepiece --gen-subset test  \
      --batch-size 100 --lenpen 1.2 --source-lang $src --target-lang $trg \
      --path $traindir/checkpoint_best.pt  \
      &> $traindir/test.log

# score with sacrebleu
grep '^H' $traindir/test.log  | cut -d- -f 2- | sort -n | cut -f 3-  >  $traindir/test.txt 
cat $traindir/test.txt |  sacrebleu $data/test.raw.$trg > $traindir/test.sacrebleu

License

fairseq(-py) is MIT-licensed.

Citation

Please cite this work as:

@inproceedings{khayrallah-etal-2020-simulated,
    title = "Simulated multiple reference training improves low-resource machine translation",
    author = "Khayrallah, Huda  and
      Thompson, Brian  and
      Post, Matt  and
      Koehn, Philipp",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.7",
    pages = "82--89"
    }

in addition to fairseq.

If you replicate the parameters we use, please cite FLoRes. If you use any data preprocessed with SentencePiece please cite it.

The data used to train the paraphraser comes from ParaBank2, please cite that work if you use the released paraphraser.

If you use the GlobalVoices data please cite Opus.

About

Code for "Simulated Multiple Reference Training Improves Low-Resource Machine Translation"

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages