Constrained-LevT

This repository contains the code for the ACL-20 paper: Lexically Constrained Neural Machine Translation with Levenshtein Transformer. If you use this repository in your work, please cite:

@article{susanto2020lexically,
  title={Lexically Constrained Neural Machine Translation with Levenshtein Transformer},
  author={Susanto, Raymond Hendy and Chollampatt, Shamil and Tan, Liling},
  journal={arXiv preprint arXiv:2004.12681},
  year={2020}
}

Requirements and Installation

PyTorch version >= 1.2.0
Python version >= 3.6

git clone https://github.com/raymondhs/constrained-levt
cd constrained-levt
pip install --editable .

Usage

To replicate the experiments in our paper, you can download our pretrained models and evaluation sets into the root directory of this repository. These models were trained following the original instructions to train Levenshtein Transformer model. To preserve each constraint in the output, use --preserve-constraint. For example:

mkdir -p data-bin
tar -xvzf const_levt_en_de.tgz -C data-bin
cat data-bin/const_levt_en_de/newstest2014-wikt.en \
| python interactive_with_constraints.py \
    data-bin/const_levt_en_de \
    -s en -t de \
    --task translation_lev \
    --path data-bin/const_levt_en_de/checkpoint_best.pt \
    --iter-decode-max-iter 9 \
    --iter-decode-eos-penalty 0 \
    --beam 1 \
    --print-step \
    --batch-size 400 \
    --buffer-size 4000 \
    --preserve-constraint | tee /tmp/gen.out
# ...
# | Translated 3003 sentences (87040 tokens) in 11.5s (261.37 sentences/s, 7575.50 tokens/s)

# Compute term usage rate
cat /tmp/gen.out \
| grep ^H \
| sed 's/^H\-//' \
| sort -n -k 1 \
| cut -f 3 > /tmp/gen.out.sys
python scripts/term_usage_rate.py \
    -i data-bin/const_levt_en_de/newstest2014-wikt.en \
    -s /tmp/gen.out.sys
# Term use rate: 100.000

Each input line is tab-separated, where the first column corresponds to the source text and the remaining columns for the constraints. Each constraint is provided in this format: source|||target. A preprocessing script (tokenize.sh) is provided in case you want to try with your own input. It will run tokenization, BPE segmentation, and additional preprocessing for Romanian. For example:

echo 'Hello world!' | ./tokenize.sh en data-bin/const_levt_en_de/ende.code

License

The code and models in this repository are licensed under the MIT License. The evaluation datasets are licensed under CC-BY-SA 3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 971 Commits
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_lm.py		eval_lm.py
fairseq.gif		fairseq.gif
fairseq_logo.png		fairseq_logo.png
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
interactive_with_constraints.py		interactive_with_constraints.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
summary.md		summary.md
tokenize.sh		tokenize.sh
train.py		train.py
validate.py		validate.py

License

raymondhs/constrained-levt

Folders and files

Latest commit

History

Repository files navigation

Constrained-LevT

Requirements and Installation

Usage

License

About

Resources

License

Stars

Watchers

Forks

Languages